Assignment 1
Big Data Analytics with Python
MET CS777
Description
The goal of this assignment is to write programs using Python (not PySpark) to answer some data analysis questions given the datasets.
Problem 1 (50 points)
“Social computing research at the university of Minnesota” has released moving rating datasets at different sizes at “grouplens.org” website. Load MovieLens 10M dataset, which consists of 10 million movie ratings. You can download the data by going to grouplens.org, and under the “datasets” tab, download “ MovieLens 10M dataset” that is 63 MB (direct link:https://grouplens.org/datasets/movielens/10m/).
Tasks
a. (5 points) Divide the data to 5 almost equal size files and use the five files in the rest of the assignment.
b. (15 points) Sort the data from the highest rating movie to the lowest one. Measure how much time sorting takes.
• Part 1: Don’t use any built-in sort function and write the “sort” function yourself.
• Part 2: Use built-in sort function.
c. (5 points) Create histogram of the movie ratings.
Measure how much time it takes to create the histogram.
d. (10 points) Data contains more than 10M ratings of 10681 movies by 71567 users.
Create a histogram of the number of times each movie got rated.
Measure how much time it takes to create the histogram.
e. (15 points) Choose the lowest three bins of histogram in part C and create a histogram of movie ratings for these three bins. Do the same thing for the top three bins of the histogram.
Problem 2 (50 points)
Taxi Data set
The data set consists of New York City Taxi trip reports in the Year 2013, which was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong.com/open-data/foil_nyc_taxi/).
The data set itself is a simple text file. Each taxi trip report is a different line in the file. Among other things, each trip report includes the starting point, the drop-off point, corresponding timestamps, and information related to the payment. The data are reported by the time that the trip ended, i.e., upon arrive in the order of the drop-off timestamps. The attributes present on each line of the file are in order as it was shown in table 1. The data files are in comma separated values (CSV) format.
Table 1: Taxi data set fields.
Obtaining the Dataset
The data set (93 MB compressed, uncompressed 384 MB), namely taxi-data-sorted- small.csv.bz2 is attached to the assignment.
Tasks
Task 1 (10 points): Top-10 Active Taxis
Many different taxis have had multiple drivers. Write and execute a Python program that computes the top ten taxis that have had the largest number of drivers. Your output should be a set of (medallion, number of drivers) pairs.
Note: You should consider that this is a real-world data set that might include wrongly formatted data lines. You should cleanup the data before the main processing,a line might not include all the fields. If a data line is not correctly formatted, you should drop that line and do not consider it.
Report the processing time of the task as well.
Task 2 (20 Points): Top-10 Best Drivers
We would like to figure out who the top 10 best drivers are in terms of their average earned money per minute spent carrying a customer. The total amount field is the total money earned on a trip. In the end, we are interested in computing a set of (driver, money per minute) pairs.
Report the processing time of the task as well.
Task 3 (20 Points): Best time of the day to Work on Taxi
We would like to know which hour of the day is the best time for drivers that has the highest profit per miles. Consider the surcharge amount in dollar for each taxi ride (without tip amount) and the distance in miles and sum up the rides for each hour of the day (24 hours) – consider the pickup time for your calculation. The profit ratio is the ration surcharge in dollar divided by the travel distance in miles for each specific time of the day.
Profit Ratio = (Surcharge Amount in US Dollar) / (Travel Distance in miles) We are interested to know the time of the day that has the highest profit ratio.
Report the processing time of the task as well.
Submission Guidelines
● Naming Convention:
METCS777-Assignment1-[ProblemX-]FIRST+LASTNAME.[zip/pdf/py/ipynb] Where:
o ProblemX doesn’t apply for . [zip/pdf] files
o No space between first and lastname
● Folder Structure
o Create one document in pdf that has screenshots of running results of all tasks. Explain clearly and precisely the results.
o Create one code file for each coding problem as follows:
METCS777-Assignment1-ProblemX-FIRST+LASTNAME.[py/ipynb]
o Please zip the folder containing all your code and document files (use .zip only!!!).
● For example, sample submission of John Doe’s Assignment 1 should be one file METCS777-Assignment1-JohnDoe.zip, which includes:
o METCS777-Assignment1-JohnDoe.pdf
o METCS777-Assignment1-Problem1-JohnDoe.py
o METCS777-Assignment1-Problem2-JohnDoe.py
Evaluation Criteria for Coding Tasks
Criteria |
Excellent |
Good |
Fair |
Poor |
Points |
Correctness |
Code accurately completes all tasks |
Code completes most tasks correctly |
Code shows understanding but has inaccuracies |
Code fails most tasks |
40% |
Efficiency |
Highly optimized code |
Somewhat optimized code |
Code works but not optimized |
Inefficient code |
20% |
Code Structure and Organization |
Exceptionally well-organized code |
Mostly organized code |
Somewhat disorganized code |
Poorly structured code |
20% |
Error Handling and Data Cleaning |
Robust error handling and data cleaning |
Handles most data issues |
Some issues with error handling |
Poor error handling and data cleaning |
10% |
Reporting Processing Time |
Accurate processing time reported |
Mostly accurate processing time |
Significant inaccuracies in time reporting |
Inaccurate or no time reporting |
10% |
Total |
|
100% |
Academic Misconduct Regarding Programming
In a programming class like this, there is sometimes a very fine line between “cheating” and acceptable and beneficial interaction between peers. Thus, it is very important that you fully understand what is and what is not allowed in terms of collaboration with your classmates. We want to be 100% precise,so that there can be no confusion.
The rule on collaboration and communication with your classmates is as follows: you cannot transmit or receive code from or to anyone in the class in anyway —visually (by showing someone your code), electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading code to someone) or in any other way we have not yet imagined. Any other collaboration is acceptable.
It is not allowed to collaborate and communicate with people who are not your classmates (or your TAs or instructor). This means posting any questions of any nature to programming forums such as StackOverflow is strictly prohibited. As far as going to the web and using Google, we will apply the “two-line rule”. Go to any web page you like and do any search that you like. But you cannot take more than two lines of code from an external resource and include it in your assignment in any form. Note that changing variable names or otherwise transforming or obfuscating code you found on the web does not render the “two-line rule” inapplicable. It is still a violation to obtain more than two lines of code from an external resource and turn it in, whatever you do to those two lines after you first obtain them.
Furthermore, you must always cite your sources. Add a comment to your code that includes the URL(s) that you consulted when constructing your solution. This turns out to be very helpful when you’re looking at something you wrote a while ago and you need to remind yourself what you were thinking.