Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Assignment 2
Instructions - Read this first!
This is an individual homework assignment. This means that:
- You may discuss the problems in this assignment with other students in this course and your instructor/TA, but YOUR WORK MUST BE YOUR OWN.
- Do not show other students code or your own work on this assignment.
- You may consult external references, but not actively receive help from individuals not involved in this course.
- Cite all references outside of the course you used, including conversations with other students which were helpful. (This helps us give credit where it is due!). All references must use a commonly accepted reference format, for example, APA or IEEE (or another citation style of your choice).
If any of these rules seem ambiguous, please check with your professors for help interpreting them.
Please complete this homework assignment within the Jupypter notebook environment. You may use Anaconda at home or https://portal.cloudlabs.ai/. However, all submitted code must be runnable via a cloudlabs vm.
Submission instructions
Your submission will be graded manually. To ensure that everything goes smoothly, please follow these instructions to prepare your notebook for submission to the D2L Dropbox for Assignment 1:
- Please remove any print statments used to test your work (you can comment them out)
- Please provide your solutions where asked; please do not alter any other parts of this notebook.
- If you need to add cells to test your code please move them to the end of the notebook before submission- or you may included your commented out answers and tests in the cells provided.
In this assignment, we will focus on familiarizing you with using Python to work with files, and performing basic inferences on data. Included should be a file "all_alpha_25.csv" which comes from the official u.s. government source for fuel economy information (Datafile3Datafile3 2025 Guide)
Question 1 (1 Point)
In the cell below, use the csv package and open the csv file. Use a for loop and the readlines() method to read each line of the file and print it out. Make sure you're using the correct file.
Question 2 (1 Point)
In the cell below, using mysql.connector, create a function which can be used to connect to your vm, print out the databases you have access to.
REMEMBER: Handle your errors! If something goes wrong, we don't want production code crashing.
Question 3 (1 Point)
Now that we have a database, lets go ahead and create a table for storing the all_alpha_25.csv data into the database.
Question 4 (2 Points)
Now that the table exists, create a function to take your opened csv file, and insert the values into your newly created table. Remember to clean your data!
Let's sort the data. In python we might use the built-in sorted method like so:
Explanation (For those who are curious)
We can break the function call above into smaller pieces. sorted() is being given two arguments: the first being the 2D array you created previously. But what is the second argument? the key= is telling sorted to use a key comparison function. A key comparison function is a function which returns a value that can specify the sort order of an object. Under the hood python is going to determine the order of your objects similar to the following pseudo code
if item_1 > item_2: store item 2 first else: store item 1 firstIf you're familiar with programming you'll know that sorting is a lot more complicated than this, but it gets the point across. We can tell python how to store items by specifying if items are greater than, equal to, or less than one another. So how does lambda x: (x[0], x[1]) achieve this?
Firstly, lambda x: x+1 is just syntax for specifying a function that you don't want to give a name to. It is identical to
def _(x): return x+1except we don't bother keeping a name for it because we only need it for this very specific use-case. So what does lambda x: (x[0], x[1]) mean? It means the following
def _(x): return (x[0],x[1])Essentially it is taking a list, and returning the first and second item in that list as a pair.
This is useful to us because when python is checking the order of every item it will execute your function on the item to get the keys you want to order by. If we look at our first example of how python sorts, we could include the use of our lambda function like so:
if lambdaFunction(item_1) > lambdaFunction(item_2): store item 2 first else store item 1 firstMore specifically, when comparing pairs of things, Python compares the first item in a pair, then the second item in a pair like so:
if lambdaFunction(item_1)[0] > lambdaFunction(item_2)[0]: store item 2 first else if lambdaFunction(item_1)[0] < lambdaFunction(item_2)[0]: store item 1 first else: if lambdaFunction(item_1)[0] > lambdaFunction(item_2)[0]: store item 2 first else: store item 1 firstSo by specifying lambda x: (x[0], x[1]) we are saying to Python that we want to use a custom compare, and that we want to first compare the [0]th element of each item, then [1]st element. We then want them sorted by the order in which they compare.
This gives us a lot of power with regards to how we want to sort things. For example, try to guess what sorted(your_2D_array, key=lambda x: 1) does. It doesn't do anything! this is because we are saying that when we want to check which item is greater than one another, our compare function should return that both items are 1 and so they are equal. If every item in a list is equal to every other item then there's nothing to sort.
Question 5 (2 Points)
Fortunately, a DBMS can abstract a lot of those details away for us. It's important to remember that everything in a database is a set meaning that there is no inherent order in how it is stored. If we simply SELECT from our database we have to assume that the values are returned in a random order. However, we can use ORDER BY statement to specify the order we want. In the following cell, retrieve all the values in your database first by the Model attribute, then the City MPG, Hwy MPG, and Cmb MPG attributes. Print the result to the console.
Question 1 (1 Points)
Now that we have our data in a format we can use, let's take a look at the contents of the dataset. There are lots of things we could infer from this data. Write 5 inferences you could make from the data. For example, a possible answer might be
- The average Air Pollution Score of 4 Cyclinder vehicles.
Use the markdown cell below to write your 5 inferences (you cannot use mine)
Type Markdown and LaTeX: α2
Question 2 (10 Points)
In the following cell, write functions to query the database and calulate the values for each of your answers from above.
Question 3 (2 Points)
A metric I am interested in is the MPG for vehicles. In the following cell write a function which asks your database to calculate the average MPG for each vehicle class, for each MPG category using SQL.
Make your function return the following format:
{"mpgs": [ {"vehicle_class":"Vehicle_Class","avg_city":0,"avg_hwy":0,"avg_bmb":0, }, ] }Question 4 (1 Point)
What's interesting about averages, however, is that it is not so straigthforward as summing the values, and dividing by the number of elements. This question will show you why this may lead to erroneous conclusions.
Say I had 3 vehicles. I drive each for the same number of miles. The first vehicle has a mpg of 30, the second, a mpg of 10, and the third, a mpg of 25. The following code shows a quick average calculation for the average mpg I would use to drive the full 30 miles. What we are calculating is what's called the arithmetic mean.
Question 5 (1 Point)
However, let's break this down a bit. Assume that for each vehicle, I drive 50km. Calculate the total number of gallons I would consume after driving each vehicle 50km
Question 6 (1 Point)
Now calculate the average mpg I get by dividing the total number of miles I drove divided by the number of gallons I consumed.
Question 7 (3 Point)
Is the suspected average the same as the calulated average? Why do you think this? Use the markdown cell below to answer.
Type Markdown and LaTeX: α2
What happened is that the values I am wanting to average are reciprocal values (i.e. there is a numerator, and a denominator in the units). Notice how in question 4 you had to do some unit conversion first in order to calculate the number of gallons consumed. For vehicle 1, for example, rather than calculate (30 mpg) * 50 miles, we had to reciprocate the value to (1/ 30)mpg * 50 miles to get the units correct.
Calculating averages like this is so common it has it's own name: The Harmonic Mean.
Knowing that the harmonic mean is what is used to recirprocal units, how would we achieve this in SQL? The AVG function in sql only returns the arithmetic mean so we can't use that.
What we can do, however, is write a procedure for sql to use. These are called Stored Procedures.
Below I've created code to insert a stored procedure into the database. When you run this code, the mysql connector will tell your database to create a procedure called harmonic_mean which you will then be able to use in your queries exactly like you would use the AVG function.
Question 8 (1 Points)
Now use the harmonic_mean udf to calculate the harmonic averages of the mpgs. This will require learning how to invoke a stored procedure via mysql.connector!
Make your function return the following format:
{"mpgs": [ {"vehicle_class":"Vehicle_Class","h_avg_city":0,"h_avg_hwy":0,"h_avg_bmb":0, }, ] }The arithmetic mean can be expressed as
The harmonic mean can be expressed as
A third type of mean known as the geometric mean can be expressed as
Or, in other words
When using the arithmetic mean we are considering values which get added together. When using the harmonic mean we are considering values which get reciprocaled then added together. The geometric mean is used when we want to average values which get multiplied together. For example, interest rates would be something where a geometric mean could be used since they are values which compound.
interest is calculated as where
- A = Final amount
- P = Principal amount
- r = Interest rate
- n = number of compounding periods per unit of time
- t = time.
In this example, the calculation can be expanded to thus this would require a geometric mean.
Question 9 (3 Points)
Give two real-world examples for when using the arithmetic mean would be appropriate, two real-world examples for when using the harmonic mean would be appropriate, and two real-world examples for when using the geometric mean would be appropriate
Type Markdown and LaTeX: α2
Question 1 (2 Points)
Describe something you struggled with on this assignment, and why you think you struggled with it. (This gives us an opportunity to see where students are struggling and how we can help; answering this as descriptively as possible is in your best interest)
Type Markdown and LaTeX: α2
Question 2 (2 Points)
Name one thing about data science you learned in this assignment. Also name one thing you learned about SQL in particular in this assignment. Be descriptive; single word answers will not receive marks.
Type Markdown and LaTeX: α2
Question 3 (3 Points)
In 100-200 workds, How do you think this assignment prepares you for a career as a data scientist?
Type Markdown and LaTeX: α2