MET CS777 Big Data Analytics with Python

Assignment 1

Big Data Analytics with Python

MET CS777

Description

The goal of this assignment is to write programs using Python (not PySpark) to answer some data analysis questions given the datasets.

Problem 1 (50 points)

“Social computing research at the university of Minnesota” has released moving rating datasets at different sizes at “grouplens.org” website. Load MovieLens 10M dataset, which consists of 10 million movie ratings. You can download the data by going to grouplens.org, and under the “datasets” tab, download “ MovieLens 10M dataset” that is 63 MB (direct link:https://grouplens.org/datasets/movielens/10m/).

Tasks

a.   (5 points) Divide the data to 5 almost equal size files and use the five files in the rest of the assignment.

b.   (15 points) Sort the data from the highest rating movie to the lowest one. Measure how much time sorting takes.

•    Part 1: Don’t use any built-in sort function and write the “sort” function yourself.

•    Part 2: Use built-in sort function.

c.    (5 points) Create histogram of the movie ratings.

Measure how much time it takes to create the histogram.

d.   (10 points) Data contains more than 10M ratings of 10681 movies by 71567 users.

Create a histogram of the number of times each movie got rated.

Measure how much time it takes to create the histogram.

e.   (15 points) Choose the lowest three bins of histogram in part C and create a histogram of movie ratings for these three bins. Do the same thing for the top three bins of the histogram.

Problem 2 (50 points)

Taxi Data set

The data set consists of New York City Taxi trip reports in the Year 2013, which was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong.com/open-data/foil_nyc_taxi/).

The data set itself is a simple text file. Each taxi trip report is a different line in the file. Among other things, each trip report includes the starting point, the drop-off point, corresponding timestamps, and information related to the payment. The data are reported by the time that the trip ended, i.e., upon arrive in the order of the drop-off timestamps. The attributes present on each line of the file are in order as it was shown in table 1. The data files are in comma separated values (CSV) format.

Table 1: Taxi data set fields.

Obtaining the Dataset

The data set (93 MB compressed, uncompressed 384 MB), namely taxi-data-sorted- small.csv.bz2 is attached to the assignment.

Tasks

Task 1 (10 points): Top-10 Active Taxis

Many different taxis have had multiple drivers. Write and execute a Python program that computes the top ten taxis that have had the largest number of drivers. Your output should be a set of (medallion, number of drivers) pairs.

Note: You should consider that this is a real-world data set that might include wrongly formatted data lines. You should cleanup the data before the main processing,a line might not include all the fields. If a data line is not correctly formatted, you should drop that line and do not consider it.

Report the processing time of the task as well.

Task 2 (20 Points): Top-10 Best Drivers

We would like to figure out who the top 10 best drivers are in terms of their average earned money per minute spent carrying a customer. The total amount field is the total money earned on a trip. In the end, we are interested in computing a set of (driver, money per minute) pairs.

Report the processing time of the task as well.

Task 3 (20 Points): Best time of the day to Work on Taxi

We would like to know which hour of the day is the best time for drivers that has the highest profit per miles. Consider the surcharge amount in dollar for each taxi ride (without tip amount) and the distance in miles and sum up the rides for each hour of the day (24 hours) – consider the pickup time for your calculation. The profit ratio is the ration surcharge in dollar divided by the travel distance in miles for each specific time of the day.

Profit Ratio = (Surcharge Amount in US Dollar) / (Travel Distance in miles) We are interested to know the time of the day that has the highest profit ratio.

Report the processing time of the task as well.

Submission Guidelines

● Naming Convention:

METCS777-Assignment1-[ProblemX-]FIRST+LASTNAME.[zip/pdf/py/ipynb] Where:

o ProblemX doesn’t apply for . [zip/pdf] files

o No space between first and lastname

● Folder Structure

o Create one document in pdf that has screenshots of running results of all tasks. Explain clearly and precisely the results.

o Create one code file for each coding problem as follows:

METCS777-Assignment1-ProblemX-FIRST+LASTNAME.[py/ipynb]

Please zip the folder containing all your code and document files (use .zip only!!!).

● For example, sample submission of John Doe’s Assignment 1 should be one file METCS777-Assignment1-JohnDoe.zip, which includes:

o METCS777-Assignment1-JohnDoe.pdf

o METCS777-Assignment1-Problem1-JohnDoe.py

o METCS777-Assignment1-Problem2-JohnDoe.py

Evaluation Criteria for Coding Tasks

Criteria

Excellent

Good

Fair

Poor

Points

Correctness

Code accurately

completes all

tasks

Code completes most tasks

correctly

Code shows

understanding

but has

inaccuracies

Code fails most tasks

40%

Efficiency

Highly optimized code

Somewhat

optimized code

Code works but not optimized

Inefficient code

20%

Code Structure

and Organization

Exceptionally

well-organized

code

Mostly organized code

Somewhat

disorganized

code

Poorly

structured

code

20%

Error Handling and Data

Cleaning

Robust error handling and data cleaning

Handles most data issues

Some issues with error handling

Poor error handling and data cleaning

10%

Reporting

Processing Time

Accurate

processing time

reported

Mostly accurate processing time

Significant inaccuracies in time reporting

Inaccurate or no time

reporting

10%

Total

100%

Academic Misconduct Regarding Programming

In a programming class like this, there is sometimes a very fine line between “cheating” and acceptable and beneficial interaction between peers. Thus, it is very important that  you fully understand what is and what is not allowed in terms of collaboration with your classmates. We want to be 100% precise,so that there can be no confusion.

The rule on collaboration and communication with your classmates is as follows: you cannot transmit or receive code from or to anyone in the class in anyway —visually (by  showing someone your code), electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading code to someone) or in any other way we have not yet imagined. Any other collaboration is acceptable.

It is not allowed to collaborate and communicate with people who are not your classmates (or your TAs or instructor). This means posting any questions of any nature to programming forums such as StackOverflow is strictly prohibited. As far as going to  the web and using Google, we will apply the “two-line rule”. Go to any web page you   like and do any search that you like. But you cannot take more than two lines of code from an external resource and include it in your assignment in any form. Note that changing variable names or otherwise transforming or obfuscating code you found on  the web does not render the “two-line rule” inapplicable. It is still a violation to obtain more than two lines of code from an external resource and turn it in, whatever you do to those two lines after you first obtain them.

Furthermore, you must always cite your sources. Add a comment to your code that includes the URL(s) that you consulted when constructing your solution. This turns out to be very helpful when you’re looking at something you wrote a while ago and you need to remind yourself what you were thinking.


发表评论

电子邮件地址不会被公开。 必填项已用*标注