Assignment 2
For assignment 2, we will use a collection of articles published by the Russian state media, Sputnik. From its website, all the English-language articles that contain “ukraine” have been downloaded at the end of 2022 and compiled as a corpus 8,063 documents. You can download the dataset “data_corpus_sputnik2022.rds” from Moodle.
You are asked to finish the following tasks:
1) Train word embeddings using word2vec on this corpus, and perform a sentiment analysis based on the word embeddings and the relative distances to the following seed words:
Positive: good, rich, happy, perfect, great, important, worth
Negative: bad, poor, sad, shame, regret, disappointment, frustrated
. In training the model, you can decide on the size of dimensions, number of iterations, and which model you would like to you.
. Choose a reasonable distance (or similarity) measure.
. Please find a reasonable way to aggregate the distances to each of the positive and negative seed words to generate a sentiment score per article.
2) Plot the article level sentiment scores by date.
Please submit your Rmarkdown files with both codes to complete the above tasks and the plots as output. The deadline is 23:59pm, 19 March.