Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CSE545 Assignment 1.
Overview
Goals.
General Requirements. You must use Python version 3.6 or later, along with Pytorch 1.4.0. You must fill in functions of template code, which you can download, here: Only edit the contents of methods marked "#X.X IMPLEMENT" (where "X.X" is the step below). Do not edit methods or sections of code marked "#DONT EDIT".
Python Libraries. No libraries beyond those provided in the template code are permitted unless listed below. Of these libraries, you may not use any subcomponents that specifically implement a concept which the instructions indicate you should implement (e.g. a complete logistic regression). The project can be complete without any additional libraries. However, if any additional libraries are deemed permissible they will be listed here: Submission.
Academic Integrity. Copying chunks of code from other students, websites or other resources outside of materials provided in class is prohibited. You are responsible for both (1) not copying others' work, and (2) making sure your work is not accessible to others. Assignments will be extensively checked for copying of others’ work. Please see the syllabus for additional policies. |
Data
Who says magazines are a dying medium? There were over 89,000 reviews left on Amazon between 2001 and 2018 for magazines, and we're going to use 2,375 of them.
Before beginning, download the data for this assignment, containing the 2,374 valid records here:
The file is a gzip of a json encoding of Amazon review records for magazines. It comes from Jianmo Ni's Amazon Review Data.*
Each review contains a lot of information, but for this assignment we just care about the "overall" which is the rating given to the product and "reviewText" which is the natural language review itself.
Note that the template code already reads in the file and converts to simply text and positive/negative score. Still, it is recommended to take a look at the data to understand what it is doing. |
Part 1: Regular Expressions for Tokenization
1.1 Implement the tokenize(text) method using a regular expression.
At this point, you should be able to run and see the tokenization of the first 3 reviews.
Part 1 Hints.
|
Example output:
LOADING DATA...
DONE.
2374 records.
First 3 records:
["I'm old, and so is my computer. Any advice that can help me maximize my computer perfomance is very welcome. MaximumPC has some good tips on computer parts, vendors, and usefull tests", 1]
["There's nothing to say, but if you want a REAL men's magazine, this is it. They have great articles and stories, and I love the bits of knowledge that starts the Know & Tell section, and I love the last page, which has an interview with a celebrity. Just get this magazine and forget all the other mature men's mags.", 1]
["If you're the kind of man who looks at himself in a full length mirrror every morning, this is the magazine for you.\nIf you think the car you drive defines you, the clothes you wear are the measure of your self-worth, the watch on your wrist tells the value of you as a person, and the woman you're seen with is a measure of your rank in society- you need this magazine.\nOther men- who might value human relationships- may profitably give it a pass.", 0]
TOKENIZING TEXT...
DONE.
First 3 records:
[["i'm", 'old', ',', 'and', 'so', 'is', 'my', 'computer', '.', 'any', 'advice', 'that', 'can', 'help', 'me', 'maximize', 'my', 'computer', 'perfomance', 'is', 'very', 'welcome', '.', 'maximumpc', 'has', 'some', 'good', 'tips', 'on', 'computer', 'parts', ',', 'vendors', ',', 'and', 'usefull', 'tests'], 1]
[["there's", 'nothing', 'to', 'say', ',', 'but', 'if', 'you', 'want', 'a', 'real', "men's", 'magazine', ',', 'this', 'is', 'it', '.', 'they', 'have', 'great', 'articles', 'and', 'stories', ',', 'and', 'i', 'love', 'the', 'bits', 'of', 'knowledge', 'that', 'starts', 'the', 'know', 'tell', 'section', ',', 'and', 'i', 'love', 'the', 'last', 'page', ',', 'which', 'has', 'an', 'interview', 'with', 'a', 'celebrity', '.', 'just', 'get', 'this', 'magazine', 'and', 'forget', 'all', 'the', 'other', 'mature', "men's", 'mags', '.'], 1]
[['if', "you're", 'the', 'kind', 'of', 'man', 'who', 'looks', 'at', 'himself', 'in', 'a', 'full', 'length', 'mirrror', 'every', 'morning', ',', 'this', 'is', 'the', 'magazine', 'for', 'you', '.', 'if', 'you', 'think', 'the', 'car', 'you', 'drive', 'defines', 'you', ',', 'the', 'clothes', 'you', 'wear', 'are', 'the', 'measure', 'of', 'your', 'self-worth', ',', 'the', 'watch', 'on', 'your', 'wrist', 'tells', 'the', 'value', 'of', 'you', 'as', 'a', 'person', ',', 'and', 'the', 'woman', "you're", 'seen', 'with', 'is', 'a', 'measure', 'of', 'your', 'rank', 'in', 'society-', 'you', 'need', 'this', 'magazine', '.', 'other', 'men-', 'who', 'might', 'value', 'human', 'relationships-', 'may', 'profitably', 'give', 'it', 'a', 'pass', '.'], 0]
Part 2: Applying a Lexicon for Sentiment Classification
2.1 Implement the lexicaScore(lexica, tokens) method. This function should count the number of words that appear in each category of the lexica. Then, return a dictionary for their relative frequencies (the count of tokens from the dictionary divided by the total number of tokens appearing).
2.2 Adjust the "thresh" parameter within the posNegLexClassify method. After you finish 2.1, when you run your code you should see the true rating (1 or 0) as well as the prediction based on the lexicon and threshold. Adjust the thresh to try to increase the accuracy over these 20 examples. Adjust until > 60% are correct and at least 1 of the true "0" is predicted as "0".
Part 2 Hints.
|
Example output:
SCORING BY LEXICON...
DONE.
First 20 lexicon predictions:
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.02702702702702703, 'verbs': 0.13513513513513514}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.04477611940298507, 'verbs': 0.1044776119402985}
rating: 0, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.03225806451612903}
rating: 0, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.0}
rating: 1, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.02}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.011904761904761904, 'verbs': 0.05952380952380952}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.10526315789473684, 'verbs': 0.0}
rating: 0, lex pred: 1, lex scores: {'neg': 0.0029585798816568047, 'pos': 0.015779092702169626, 'verbs': 0.0650887573964497}
rating: 1, lex pred: 0, lex scores: {'neg': 0.002652519893899204, 'pos': 0.010610079575596816, 'verbs': 0.04509283819628647}
rating: 0, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.016666666666666666, 'verbs': 0.06666666666666667}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.07142857142857142, 'verbs': 0.14285714285714285}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0029411764705882353, 'pos': 0.014705882352941176, 'verbs': 0.052941176470588235}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.01834862385321101, 'verbs': 0.06422018348623854}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.019801980198019802, 'verbs': 0.08415841584158416}
rating: 1, lex pred: 0, lex scores: {'neg': 0.012048192771084338, 'pos': 0.0, 'verbs': 0.04819277108433735}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.05555555555555555, 'verbs': 0.1111111111111111}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.3333333333333333, 'verbs': 0.3333333333333333}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 1.0, 'verbs': 0.0}
rating: 1, lex pred: 1, lex scores: {'neg': 0.0, 'pos': 0.09523809523809523, 'verbs': 0.047619047619047616}
rating: 1, lex pred: 0, lex scores: {'neg': 0.0, 'pos': 0.0, 'verbs': 0.25}
Lexicon Overall Accuracy: 0.638
Part 3: Logistic Regression for Sentiment Classification
3.1 Implement the method extractMultiHot(tokens,vocab). This method takes in the tokens as well as the vocabulary words and returns a single vector, a multi-hot encoding of all of the tokens that appear in the vector. The vector returned should be in the form of a list of size len(vocab).
3.2 Define normalizedLogLoss(ypred, true). This should be the normalized log likelihood loss function that we went over in class. It is the likelihood function, logged, subtracted from zero, and then normalized by the total number of observations. ypred represents the output of the logistic regression model (i.e. an estimated probability) while ytrue is the correct label (either a 0 or a 1). Output should be a torch scalar floating-point value (between 0 and 1).
Part 3 Hints.
|
Example Output:
EXTRACTING FEATURES...
Vocabulary Size: 2000
Done.
Xtrain shape: torch.Size([1899, 2000]) , ytrain shape: torch.Size([1899, 1])
Training Logistic Regression...
epoch: 0, loss: 0.70983
epoch: 20, loss: 0.35653
epoch: 40, loss: 0.30757
epoch: 60, loss: 0.27947
epoch: 80, loss: 0.25972
epoch: 100, loss: 0.24448
epoch: 120, loss: 0.23207
epoch: 140, loss: 0.22162
epoch: 160, loss: 0.21262
epoch: 180, loss: 0.20473
epoch: 200, loss: 0.19773
epoch: 220, loss: 0.19144
epoch: 240, loss: 0.18574
epoch: 260, loss: 0.18055
epoch: 280, loss: 0.17578
Done.
First 20 test set predictions:
rating: 0, logreg pred: 1 (prob: 0.664), lex pred: 1
rating: 0, logreg pred: 1 (prob: 0.608), lex pred: 1
rating: 0, logreg pred: 1 (prob: 0.788), lex pred: 1
rating: 0, logreg pred: 1 (prob: 0.908), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.947), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.930), lex pred: 1
rating: 0, logreg pred: 1 (prob: 0.668), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.832), lex pred: 1
rating: 0, logreg pred: 0 (prob: 0.253), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.983), lex pred: 1
rating: 1, logreg pred: 1 (prob: 0.939), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.995), lex pred: 1
rating: 1, logreg pred: 0 (prob: 0.342), lex pred: 0
rating: 0, logreg pred: 0 (prob: 0.070), lex pred: 1
rating: 1, logreg pred: 1 (prob: 0.830), lex pred: 1
rating: 1, logreg pred: 1 (prob: 0.954), lex pred: 1
rating: 1, logreg pred: 1 (prob: 0.955), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.883), lex pred: 0
rating: 1, logreg pred: 1 (prob: 0.907), lex pred: 1
rating: 1, logreg pred: 1 (prob: 0.995), lex pred: 1
LogReg Model Test Set Accuracy: 0.899
Lexicon Test Set Accuracy: 0.832