MET CS777 Big Data Processing with SPARK RDDs

Assignment 2

Big Data Processing with SPARK RDDs

MET CS777

Description

The goal of this assignment is to explore Spark RDDs and write programs using PySpark RDDs to answer some data analysis questions given the datasets.

Problem 1 (5 points)

What are the differences of following RDD operations in terms of functionality and their computational costs. Assign a level of computation complexity from level 1 (less costly)   to level 3 (most costly) to each operation.

1.   aggregateByKey()

2.   reduceByKey()

3.   groupByKey()

4.   combineByKey()

You can findSpark documentation online athttps://spark.apache.org .

Problem 2 (5 points)

Name at least three differences between Spark and Hadoop MapReduce?

Problem 3 (5 points)

How does Spark run an application and what are the driver layer functionalities? Explain from the Spark architecture’sperspective.

Problem 4 (5 points)

What are the differences of running on multi-core computer versus running on multi worker/executor environment? Provide a list of the advantages and disadvantages     associated with each approach.

Problem 5 (10 points)

Why are RDDs immutable? Is this immutability a design flaw in RDDs, or does it offer some advantages?

Problem 6 (10 points)

Spark transformations are categorized into narrow transformations and wide

transformations. Referencing the Spark documentation, explain the differences between these two types of transformations.

Problem 7 (10 points)

List 10 spark RDD transformation operations with one line of example.

List 5 spark RDD action operations with one line example.

Note: For following coding problems, please use Spark RDDs only

Problem 8 (10 points)

Given the data file assignment2-student-data.csv, which consists of following columns:

-     Row number

-     First name of student

-     Last name of student

-     Course number

-     Grade

Write Spark program to calculate:

1.   Min grade of each student

2.   Max grade of each student

3.   GPA

4.   Number of courses taken

Problem 9 (20 points)

Estimation area of a circle:

Write Spark program to estimate area of the unit circle by "throwing darts" at the circle. Assume you don’t knowhow to calculate area of a circle in a closed form, but you know  how to calculate area of a square. You throw random darts/points in the 2 by 2 square    ((-1, -1) to (1,1)) and count how many falls in the unit circle,a circle with radius of one.    The fraction can be used to estimate of the area of the unit circle.

(Hint: Generate random numbers as coordinates for each point within the square shown in the illustration below. Given the count of how many points are in the circle vs total number of throws and the area of the square, you can estimate the area of the circle accordingly)


Problem 10 (20 points)

Load the data from a file called “assignment2-customer-orders.csv” .

Write a Spark program to report:

-     Top 5 customers who spent the most.

-     If you consider top 10 customers who spent the most, which item has been purchased the most.

Submission Guidelines

● Naming Convention:

METCS777-Assignment2-[ProblemX-]FIRST+LASTNAME.[pdf/py/ipynb] Where:

o ProblemX doesn’t apply for .[pdf] files

o No space between first and lastname

● Files:

o Create one document in pdf that has answers to all problems and

screenshots of running results of all coding problems. Explain clearly and precisely the results.

o Create one code file for each coding problem as follows:

METCS777-Assignment2-ProblemX-FIRST+LASTNAME.[py/ipynb]

Please submit each and every file separately (DO NOT ZIP them this time!!!).

● For example, sample submission of John Doe’s Assignment 2 should be the following files:

o METCS777-Assignment2-JohnDoe.pdf

o METCS777-Assignment2-Problem8-JohnDoe.ipynb

o METCS777-Assignment2-Problem9-JohnDoe.ipynb

o METCS777-Assignment2-Problem10-JohnDoe.ipynb

Evaluation Criteria for Coding Tasks


Criteria

Excellent

Good

Fair

Poor

Point s

Correctness

Code

accurately

completes all

tasks

Code

completes most tasks correctly

Code shows

understanding

but has

inaccuracies

Code fails most tasks

40%

Efficiency

Highly

optimized code

Somewhat

optimized code

Code works but not optimized

Inefficient code

20%

Code Structure

and

Organization

Exceptionally

well-organized

code

Mostly

organized code

Somewhat

disorganized

code

Poorly

structured

code

20%



Criteria

Excellent

Good

Fair

Poor

Point s

Error Handling and Data

Cleaning

Robust error handling and data cleaning

Handles most data issues

Some issues

with error

handling

Poor error

handling and

data

cleaning

10%

Reporting

Processing

Time

Accurate

processing

time reported

Mostly

accurate

processing

time

Significant inaccuracies in time reporting

Inaccurate or no time reporting

10%

Total

100%

Academic Misconduct Regarding Programming

In a programming class like this, there is sometimes a very fine line between “cheating” and acceptable and beneficial interaction between peers. Thus, it is very important that  you fully understand what is and what is not allowed in terms of collaboration with your classmates. We want to be 100% precise,so that there can be no confusion.

The rule on collaboration and communication with your classmates is as follows: you

cannot transmit or receive code from or to anyone in the class in anyway —visually (by  showing someone your code), electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading code to someone) or in any other way we

have not yet imagined. Any other collaboration is acceptable.

It is not allowed to collaborate and communicate with people who are not your

classmates (or your TAs or instructor). This means posting any questions of any nature to programming forums such as StackOverflow is strictly prohibited. As far as going to  the web and using Google, we will apply the “two-line rule”. Go to any web page you   like and do any search that you like. But you cannot take more than two lines of code   from an external resource and include it in your assignment in any form. Note that

changing variable names or otherwise transforming or obfuscating code you found on  the web does not render the “two-line rule” inapplicable. It is still a violation to obtain more than two lines of code from an external resource and turn it in, whatever you do to those two lines after you first obtain them.

Furthermore, you must always cite your sources. Add a comment to your code that

includes the URL(s) that you consulted when constructing your solution. This turns out to be very helpful when you’re looking at something you wrote a while ago and you

need to remind yourself what you were thinking.

发表评论

电子邮件地址不会被公开。 必填项已用*标注