首页 » 计算机科学(Computing Science) » CSE545 Assignment 1

CSE545 Assignment 1

2024-09-12 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CSE545 Assignment 1.

Overview

Goals.

Practice working with regular expressions for tokenization
Become familiar with a lexicon word count approach
Implement a lexicon-based classifier
Become familiar with the basics of pytorch
Implement a basic feature extractor
Implement key components of a logistic regression classifier

the model itself
the loss function

General Requirements. You must use Python version 3.6 or later, along with Pytorch 1.4.0. You must fill in functions of template code, which you can download, here:

template code

Only edit the contents of methods marked "#X.X IMPLEMENT" (where "X.X" is the step below). Do not edit methods or sections of code marked "#DONT EDIT".

Python Libraries. No libraries beyond those provided in the template code are permitted unless listed below. Of these libraries, you may not use any subcomponents that specifically implement a concept which the instructions indicate you should implement (e.g. a complete logistic regression). The project can be complete without any additional libraries. However, if any additional libraries are deemed permissible they will be listed here:
<no libraries currently>

Submission.

Place all of your code in a single file, a1_<lastname>_<id>.py.
All three parts of your code should run with:
python3 a1_LASTNAME_ID.py Magazine_Subscriptions_5.json.gz
Place the output of your code in a single file called a1_<lastname>_<id>_OUTPUT.txt
After the package imports you will see the following line which sends your print() statements to a file instead of the console
sys.stdout = open('a1_lastname_id_OUTPUT.txt', 'w')
Change the file name to include your personal details. If this causes you any issues you can comment out the line and manually copy-and-paste your results into a .txt file.
Submit both the .py and .txt in Blackboard under assignment 1.
DO NOT ZIP the files, submit them as 2 independent files.

Academic Integrity. Copying chunks of code from other students, websites or other resources outside of materials provided in class is prohibited. You are responsible for both (1) not copying others' work, and (2) making sure your work is not accessible to others. Assignments will be extensively checked for copying of others’ work. Please see the syllabus for additional policies.

Data

Who says magazines are a dying medium? There were over 89,000 reviews left on Amazon between 2001 and 2018 for magazines, and we're going to use 2,375 of them.

Before beginning, download the data for this assignment, containing the 2,374 valid records here:
A1 Main Data Set

The file is a gzip of a json encoding of Amazon review records for magazines. It comes from Jianmo Ni's Amazon Review Data.*

Each review contains a lot of information, but for this assignment we just care about the "overall" which is the rating given to the product and "reviewText" which is the natural language review itself.

Note that the template code already reads in the file and converts to simply text and positive/negative score. Still, it is recommended to take a look at the data to understand what it is doing.

Part 1: Regular Expressions for Tokenization

1.1 Implement the tokenize(text) method using a regular expression.
Your approach should handle the following situations:

treat . ! ? , " ; (period, exclamation, question mark, comma, double quote, and semicolon) as tokens. All other punctuation can be ignored, not extracted as a token or part of a token except where noted below.
allow contractions and possesives where ' appears in the middle of the word and ≤ 3 letters appear after the apostrophe ex. can't, you're, i'm, Marvin's, would've
allows tokens to consist of both letters and integers e.g. "24k" "32in"
allow - (dash) to appear in the middle of a token

At this point, you should be able to run
python3 a1_LAST_ID.py Magazine_Subscriptions_5.json.gz

and see the tokenization of the first 3 reviews.

Part 1 Hints.

text has been lowercased by the time it goes to tokenize()
As long as you handle the situations above you will receive credit. Don't get caught up trying to make your tokenizer perfect until you finish the rest of the assignment.
If you predefine your regular expression and compile it, it will run more efficiently

Example output:

LOADING DATA...

DONE.

2374 records.

First 3 records:

["I'm old, and so is my computer. Any advice that can help me maximize my computer perfomance is very welcome. MaximumPC has some good tips on computer parts, vendors, and usefull tests", 1]

["There's nothing to say, but if you want a REAL men's magazine, this is it. They have great articles and stories, and I love the bits of knowledge that starts the Know & Tell section, and I love the last page, which has an interview with a celebrity. Just get this magazine and forget all the other mature men's mags.", 1]

["If you're the kind of man who looks at himself in a full length mirrror every morning, this is the magazine for you.\nIf you think the car you drive defines you, the clothes you wear are the measure of your self-worth, the watch on your wrist tells the value of you as a person, and the woman you're seen with is a measure of your rank in society- you need this magazine.\nOther men- who might value human relationships- may profitably give it a pass.", 0]

TOKENIZING TEXT...

DONE.

First 3 records:

[["i'm", 'old', ',', 'and', 'so', 'is', 'my', 'computer', '.', 'any', 'advice', 'that', 'can', 'help', 'me', 'maximize', 'my', 'computer', 'perfomance', 'is', 'very', 'welcome', '.', 'maximumpc', 'has', 'some', 'good', 'tips', 'on', 'computer', 'parts', ',', 'vendors', ',', 'and', 'usefull', 'tests'], 1]

[["there's", 'nothing', 'to', 'say', ',', 'but', 'if', 'you', 'want', 'a', 'real', "men's", 'magazine', ',', 'this', 'is', 'it', '.', 'they', 'have', 'great', 'articles', 'and', 'stories', ',', 'and', 'i', 'love', 'the', 'bits', 'of', 'knowledge', 'that', 'starts', 'the', 'know', 'tell', 'section', ',', 'and', 'i', 'love', 'the', 'last', 'page', ',', 'which', 'has', 'an', 'interview', 'with', 'a', 'celebrity', '.', 'just', 'get', 'this', 'magazine', 'and', 'forget', 'all', 'the', 'other', 'mature', "men's", 'mags', '.'], 1]

[['if', "you're", 'the', 'kind', 'of', 'man', 'who', 'looks', 'at', 'himself', 'in', 'a', 'full', 'length', 'mirrror', 'every', 'morning', ',', 'this', 'is', 'the', 'magazine', 'for', 'you', '.', 'if', 'you', 'think', 'the', 'car', 'you', 'drive', 'defines', 'you', ',', 'the', 'clothes', 'you', 'wear', 'are', 'the', 'measure', 'of', 'your', 'self-worth', ',', 'the', 'watch', 'on', 'your', 'wrist', 'tells', 'the', 'value', 'of', 'you', 'as', 'a', 'person', ',', 'and', 'the', 'woman', "you're", 'seen', 'with', 'is', 'a', 'measure', 'of', 'your', 'rank', 'in', 'society-', 'you', 'need', 'this', 'magazine', '.', 'other', 'men-', 'who', 'might', 'value', 'human', 'relationships-', 'may', 'profitably', 'give', 'it', 'a', 'pass', '.'], 0]

Part 2: Applying a Lexicon for Sentiment Classification

2.1 Implement the lexicaScore(lexica, tokens) method. This function should count the number of words that appear in each category of the lexica. Then, return a dictionary for their relative frequencies (the count of tokens from the dictionary divided by the total number of tokens appearing).

2.2 Adjust the "thresh" parameter within the posNegLexClassify method. After you finish 2.1, when you run your code you should see the true rating (1 or 0) as well as the prediction based on the lexicon and threshold. Adjust the thresh to try to increase the accuracy over these 20 examples. Adjust until > 60% are correct and at least 1 of the true "0" is predicted as "0".

Part 2 Hints.

Note that you are writing code that can handle more than just 2 lexica, but only the 'pos' and 'neg' categories will be used for sentiment classification.

Example output:

SCORING BY LEXICON...

DONE.

First 20 lexicon predictions:

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.02702702702702703, 'verbs': 0.13513513513513514}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.04477611940298507, 'verbs': 0.1044776119402985}

rating: 0, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.03225806451612903}

rating: 0, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.0}

rating: 1, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.02}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.011904761904761904, 'verbs': 0.05952380952380952}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.10526315789473684, 'verbs': 0.0}

rating: 0, lex pred: 1, lex scores: {'neg': 0.0029585798816568047, 'pos': 0.015779092702169626, 'verbs': 0.0650887573964497}

rating: 1, lex pred: 0, lex scores: {'neg': 0.002652519893899204, 'pos': 0.010610079575596816, 'verbs': 0.04509283819628647}

rating: 0, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.016666666666666666, 'verbs': 0.06666666666666667}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.07142857142857142, 'verbs': 0.14285714285714285}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0029411764705882353, 'pos': 0.014705882352941176, 'verbs': 0.052941176470588235}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.01834862385321101, 'verbs': 0.06422018348623854}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.019801980198019802, 'verbs': 0.08415841584158416}

rating: 1, lex pred: 0, lex scores: {'neg': 0.012048192771084338, 'pos': 0.0, 'verbs': 0.04819277108433735}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.05555555555555555, 'verbs': 0.1111111111111111}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.3333333333333333, 'verbs': 0.3333333333333333}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 1.0, 'verbs': 0.0}

rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.09523809523809523, 'verbs': 0.047619047619047616}

rating: 1, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.25}

Lexicon Overall Accuracy: 0.638

Part 3: Logistic Regression for Sentiment Classification

3.1 Implement the method extractMultiHot(tokens,vocab). This method takes in the tokens as well as the vocabulary words and returns a single vector, a multi-hot encoding of all of the tokens that appear in the vector. The vector returned should be in the form of a list of size len(vocab).

3.2 Define normalizedLogLoss(ypred, true). This should be the normalized log likelihood loss function that we went over in class. It is the likelihood function, logged, subtracted from zero, and then normalized by the total number of observations. ypred represents the output of the logistic regression model (i.e. an estimated probability) while ytrue is the correct label (either a 0 or a 1). Output should be a torch scalar floating-point value (between 0 and 1).

Part 3 Hints.

The vocabulary is only made up of words that appeared in the data over 5 times. Make sure your code does not fail when it encounters a token not in the vocabulary. It should simply do nothing for such terms (later in the course, we will discuss a slightly better option for handling these "out of vocabulary" tokens).
Use "vocab" to define where, within the encoding vector, to mark each word.
torch.log() implements the log transformation of torch tensors.
Vocabulary size will vary depending on your tokenization approach. As long as you fulfill the requirements for part 1, then don't worry about small differences in vocabulary size from the output below.

Example Output:

EXTRACTING FEATURES...

Vocabulary Size: 2000

Done.

Xtrain shape: torch.Size([1899, 2000]) , ytrain shape: torch.Size([1899, 1])

Training Logistic Regression...

epoch: 0, loss: 0.70983

epoch: 20, loss: 0.35653

epoch: 40, loss: 0.30757

epoch: 60, loss: 0.27947

epoch: 80, loss: 0.25972

epoch: 100, loss: 0.24448

epoch: 120, loss: 0.23207

epoch: 140, loss: 0.22162

epoch: 160, loss: 0.21262

epoch: 180, loss: 0.20473

epoch: 200, loss: 0.19773

epoch: 220, loss: 0.19144

epoch: 240, loss: 0.18574

epoch: 260, loss: 0.18055

epoch: 280, loss: 0.17578

Done.

First 20 test set predictions:

rating: 0, logreg pred: 1 (prob: 0.664), lex pred: 1

rating: 0, logreg pred: 1 (prob: 0.608), lex pred: 1

rating: 0, logreg pred: 1 (prob: 0.788), lex pred: 1

rating: 0, logreg pred: 1 (prob: 0.908), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.947), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.930), lex pred: 1

rating: 0, logreg pred: 1 (prob: 0.668), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.832), lex pred: 1

rating: 0, logreg pred: 0 (prob: 0.253), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.983), lex pred: 1

rating: 1, logreg pred: 1 (prob: 0.939), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.995), lex pred: 1

rating: 1, logreg pred: 0 (prob: 0.342), lex pred: 0

rating: 0, logreg pred: 0 (prob: 0.070), lex pred: 1

rating: 1, logreg pred: 1 (prob: 0.830), lex pred: 1

rating: 1, logreg pred: 1 (prob: 0.954), lex pred: 1

rating: 1, logreg pred: 1 (prob: 0.955), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.883), lex pred: 0

rating: 1, logreg pred: 1 (prob: 0.907), lex pred: 1

rating: 1, logreg pred: 1 (prob: 0.995), lex pred: 1

LogReg Model Test Set Accuracy: 0.899

Lexicon Test Set Accuracy: 0.832

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *