Homework 2: Parsing rotten tomatoes reviews
In this homework, you have to create a script that will parse data from Rotten Tomatoes, a movie reviews website. The work you have to do is similar to what we covered in Lectures 4-6, albeit for a different website. Please read and follow the instructions below very carefully.
Step 1
Your script should begin by importing libraries, and then defining two variables
1. movie a string variable indicating the movie for which reviews will be parsed
2. pageNum the number of review pages to parse
For example, to parse the first 3 pages of the Gangs of New York reviews, I would have to open your script and set movie='gangs_of_new_york' and pageNum=3. Your code should then go to the movie’s All Critics reviews page, and parse the first 3 pages of reviews.
- Note: pagination on RT happens by clicking on the “Load more” button.
Step 2
For each review contained in each of the pages you requested, parse the following information
1. The critic This should be None if the review doesn't have a critic’s name.
2. The rating. The rating should be 'rotten' , 'fresh', or None if the review doesn't have a rating.
3. The source This should be None if the review doesn't have a source.
4. The text. This should be None if the review doesn't have text.
5. The date. This should be None if the review doesn't have a date.
Continuing with our Gangs of New York example:
The load more button is at the bottom of the review page:
Step 3
After parsing the data, save them in a file that is called firstname_lastname_movie.json,
The JSON file should be a list of dicts, one for each review, with the following structure:
[
{ “critic_name”:... , “rating”: … , “source”: … , “text”:... , “date”: … },
{ “critic_name”:... , “rating”: … , “source”: … , “text”:... , “date”: … },
]
Note: This notebook introduces you to JSON. We can think of JSON files as lists of dicts, all with the same keys.
For example, I would save my data to “apostolos_filippas_gangs_of_new_york.json”. If I had to parse the first three pages of reviews for that movie, my .json output would look and be named like this.
Deliverables and Grading
Deliverables
Submit only your python code (script or notebook) on Blackboard
Grading
To grade your exercise, we will use your script to parse reviews from a few randomly selected movies. In other words, I will change the movie variable and the pageNum variable in the beginning of your code. Your grade will be the percentage of items successfully parsed, compared to my own script. As such, please make sure that you have run your parser for a bunch of different movies, to ensure your code works for many different cases.
If you submit Homework 2 within 2 weeks, you can get up to a +30% bonus to your final grade.
I will release a hint thereafter, and then you will be able to get the normal 100% of the homework grade.
Collaboration
I encourage you to talk to other students about this assignment and help one another. However, this is not a group assignment. This means that you have to write your own scripts. If you have any questions about collaborating, please read course policies #5 and #7. If you still have questions, contact the instructor.