Assignment 2
Big Data Processing with SPARK RDDs
MET CS777
Description
The goal of this assignment is to explore Spark RDDs and write programs using PySpark RDDs to answer some data analysis questions given the datasets.
Problem 1 (5 points)
What are the differences of following RDD operations in terms of functionality and their computational costs. Assign a level of computation complexity from level 1 (less costly) to level 3 (most costly) to each operation.
1. aggregateByKey()
2. reduceByKey()
3. groupByKey()
4. combineByKey()
You can findSpark documentation online athttps://spark.apache.org .
Problem 2 (5 points)
Name at least three differences between Spark and Hadoop MapReduce?
Problem 3 (5 points)
How does Spark run an application and what are the driver layer functionalities? Explain from the Spark architecture’sperspective.
Problem 4 (5 points)
What are the differences of running on multi-core computer versus running on multi worker/executor environment? Provide a list of the advantages and disadvantages associated with each approach.
Problem 5 (10 points)
Why are RDDs immutable? Is this immutability a design flaw in RDDs, or does it offer some advantages?
Problem 6 (10 points)
Spark transformations are categorized into narrow transformations and wide
transformations. Referencing the Spark documentation, explain the differences between these two types of transformations.
Problem 7 (10 points)
List 10 spark RDD transformation operations with one line of example.
List 5 spark RDD action operations with one line example.
Note: For following coding problems, please use Spark RDDs only
Problem 8 (10 points)
Given the data file assignment2-student-data.csv, which consists of following columns:
- Row number
- First name of student
- Last name of student
- Course number
- Grade
Write Spark program to calculate:
1. Min grade of each student
2. Max grade of each student
3. GPA
4. Number of courses taken
Problem 9 (20 points)
Estimation area of a circle:
Write Spark program to estimate area of the unit circle by "throwing darts" at the circle. Assume you don’t knowhow to calculate area of a circle in a closed form, but you know how to calculate area of a square. You throw random darts/points in the 2 by 2 square ((-1, -1) to (1,1)) and count how many falls in the unit circle,a circle with radius of one. The fraction can be used to estimate of the area of the unit circle.
(Hint: Generate random numbers as coordinates for each point within the square shown in the illustration below. Given the count of how many points are in the circle vs total number of throws and the area of the square, you can estimate the area of the circle accordingly)
Problem 10 (20 points)
Load the data from a file called “assignment2-customer-orders.csv” .
Write a Spark program to report:
- Top 5 customers who spent the most.
- If you consider top 10 customers who spent the most, which item has been purchased the most.
Submission Guidelines
● Naming Convention:
METCS777-Assignment2-[ProblemX-]FIRST+LASTNAME.[pdf/py/ipynb] Where:
o ProblemX doesn’t apply for .[pdf] files
o No space between first and lastname
● Files:
o Create one document in pdf that has answers to all problems and
screenshots of running results of all coding problems. Explain clearly and precisely the results.
o Create one code file for each coding problem as follows:
METCS777-Assignment2-ProblemX-FIRST+LASTNAME.[py/ipynb]
o Please submit each and every file separately (DO NOT ZIP them this time!!!).
● For example, sample submission of John Doe’s Assignment 2 should be the following files:
o METCS777-Assignment2-JohnDoe.pdf
o METCS777-Assignment2-Problem8-JohnDoe.ipynb
o METCS777-Assignment2-Problem9-JohnDoe.ipynb
o METCS777-Assignment2-Problem10-JohnDoe.ipynb
Criteria |
Excellent |
Good |
Fair |
Poor |
Point s |
Correctness |
Code accurately completes all tasks |
Code completes most tasks correctly |
Code shows understanding but has inaccuracies |
Code fails most tasks |
40% |
Efficiency |
Highly optimized code |
Somewhat optimized code |
Code works but not optimized |
Inefficient code |
20% |
Code Structure and Organization |
Exceptionally well-organized code |
Mostly organized code |
Somewhat disorganized code |
Poorly structured code |
20% |
Criteria |
Excellent |
Good |
Fair |
Poor |
Point s |
Error Handling and Data Cleaning |
Robust error handling and data cleaning |
Handles most data issues |
Some issues with error handling |
Poor error handling and data cleaning |
10% |
Reporting Processing Time |
Accurate processing time reported |
Mostly accurate processing time |
Significant inaccuracies in time reporting |
Inaccurate or no time reporting |
10% |
Total |
|
100% |
Academic Misconduct Regarding Programming
In a programming class like this, there is sometimes a very fine line between “cheating” and acceptable and beneficial interaction between peers. Thus, it is very important that you fully understand what is and what is not allowed in terms of collaboration with your classmates. We want to be 100% precise,so that there can be no confusion.
The rule on collaboration and communication with your classmates is as follows: you
cannot transmit or receive code from or to anyone in the class in anyway —visually (by showing someone your code), electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading code to someone) or in any other way we
have not yet imagined. Any other collaboration is acceptable.
It is not allowed to collaborate and communicate with people who are not your
classmates (or your TAs or instructor). This means posting any questions of any nature to programming forums such as StackOverflow is strictly prohibited. As far as going to the web and using Google, we will apply the “two-line rule”. Go to any web page you like and do any search that you like. But you cannot take more than two lines of code from an external resource and include it in your assignment in any form. Note that
changing variable names or otherwise transforming or obfuscating code you found on the web does not render the “two-line rule” inapplicable. It is still a violation to obtain more than two lines of code from an external resource and turn it in, whatever you do to those two lines after you first obtain them.
Furthermore, you must always cite your sources. Add a comment to your code that
includes the URL(s) that you consulted when constructing your solution. This turns out to be very helpful when you’re looking at something you wrote a while ago and you
need to remind yourself what you were thinking.