Python to work

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Assignment 2

Instructions - Read this first!

This is an individual homework assignment. This means that:

  • You may discuss the problems in this assignment with other students in this course and your instructor/TA, but YOUR WORK MUST BE YOUR OWN.
  • Do not show other students code or your own work on this assignment.
  • You may consult external references, but not actively receive help from individuals not involved in this course.
  • Cite all references outside of the course you used, including conversations with other students which were helpful. (This helps us give credit where it is due!). All references must use a commonly accepted reference format, for example, APA or IEEE (or another citation style of your choice).

If any of these rules seem ambiguous, please check with your professors for help interpreting them.

Please complete this homework assignment within the Jupypter notebook environment. You may use Anaconda at home or https://portal.cloudlabs.ai/. However, all submitted code must be runnable via a cloudlabs vm.

Submission instructions

Your submission will be graded manually. To ensure that everything goes smoothly, please follow these instructions to prepare your notebook for submission to the D2L Dropbox for Assignment 1:

  • Please remove any print statments used to test your work (you can comment them out)
  • Please provide your solutions where asked; please do not alter any other parts of this notebook.
  • If you need to add cells to test your code please move them to the end of the notebook before submission- or you may included your commented out answers and tests in the cells provided.

In this assignment, we will focus on familiarizing you with using Python to work with files, and performing basic inferences on data. Included should be a file "all_alpha_25.csv" which comes from the official u.s. government source for fuel economy information (Datafile3Datafile3 2025 Guide)

Part A: Setup (7 Points)

Question 1 (1 Point)

In the cell below, use the csv package and open the csv file. Use a for loop and the readlines() method to read each line of the file and print it out. Make sure you're using the correct file.

importcsv## PLACE YOUR CODE HERE

Important!

While pandas can be incredibly convenient, it's important to get an understanding for what is happening "under the hood". For the rest of this assignment, we're going to do things through mysql.connector.

importmysql.connectorfrommysql.connectorimportError## ALSO BE SURE TO ADD YOUR DB USERNAME AND PASSWORD HEREPASSWORD="your_password"

You are not allowed to import any other packages for this assignment

Question 2 (1 Point)

In the cell below, using mysql.connector, create a function which can be used to connect to your vm, print out the databases you have access to.

REMEMBER: Handle your errors! If something goes wrong, we don't want production code crashing.

try:connection=mysql.connector.connect(host="localhost",user="student",password=PASSWORD)# Add code here to print your databasesexceptErrorase:print("Error while connecting to MySQL",e)finally:ifconnection.is_connected():cursor.close()# This is a hint! Since cursor isn't defined you'll obviously need one somewhereconnection.close()

Question 3 (1 Point)

Now that we have a database, lets go ahead and create a table for storing the all_alpha_25.csv data into the database.

defcreate_table(db_name,table_name):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()create_table_query=f"""PLACE YOUR QUERY HERE"""cursor.execute(create_table_query)connection.commit()print("Table 'car_emissions' created successfully.")exceptErrorase:print(f"Error while connecting to MySQL:{e}")finally:ifconnection.is_connected():cursor.close()connection.close()create_table("student","table1")

Question 4 (2 Points)

Now that the table exists, create a function to take your opened csv file, and insert the values into your newly created table. Remember to clean your data!

definsert_data(db_name,table_name,csv_file_path):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()insert_query=f"""PLACE YOUR QUERY HERE"""withopen(csv_file_path,mode='r')ascsv_file:csv_reader=csv.reader(csv_file)next(csv_reader)forrowincsv_reader:cursor.execute(insert_query,tuple(row))connection.commit()print("Data inserted successfully.")exceptErrorase:print(f"Error while connecting to MySQL:{e}")finally:ifconnection.is_connected():cursor.close()connection.close()insert_data("student","table1","all_alpha_25.csv")

Let's sort the data. In python we might use the built-in sorted method like so:

# after_sort = sorted(array, key=lambda x: (x[0], x[1]))

Explanation (For those who are curious)

We can break the function call above into smaller pieces. sorted() is being given two arguments: the first being the 2D array you created previously. But what is the second argument? the key= is telling sorted to use a key comparison function. A key comparison function is a function which returns a value that can specify the sort order of an object. Under the hood python is going to determine the order of your objects similar to the following pseudo code

if item_1 > item_2: store item 2 first else: store item 1 first

If you're familiar with programming you'll know that sorting is a lot more complicated than this, but it gets the point across. We can tell python how to store items by specifying if items are greater than, equal to, or less than one another. So how does lambda x: (x[0], x[1]) achieve this?

Firstly, lambda x: x+1 is just syntax for specifying a function that you don't want to give a name to. It is identical to

def _(x): return x+1

except we don't bother keeping a name for it because we only need it for this very specific use-case. So what does lambda x: (x[0], x[1]) mean? It means the following

def _(x): return (x[0],x[1])

Essentially it is taking a list, and returning the first and second item in that list as a pair.

This is useful to us because when python is checking the order of every item it will execute your function on the item to get the keys you want to order by. If we look at our first example of how python sorts, we could include the use of our lambda function like so:

if lambdaFunction(item_1) > lambdaFunction(item_2): store item 2 first else store item 1 first

More specifically, when comparing pairs of things, Python compares the first item in a pair, then the second item in a pair like so:

if lambdaFunction(item_1)[0] > lambdaFunction(item_2)[0]: store item 2 first else if lambdaFunction(item_1)[0] < lambdaFunction(item_2)[0]: store item 1 first else: if lambdaFunction(item_1)[0] > lambdaFunction(item_2)[0]: store item 2 first else: store item 1 first

So by specifying lambda x: (x[0], x[1]) we are saying to Python that we want to use a custom compare, and that we want to first compare the [0]th element of each item, then [1]st element. We then want them sorted by the order in which they compare.

This gives us a lot of power with regards to how we want to sort things. For example, try to guess what sorted(your_2D_array, key=lambda x: 1) does. It doesn't do anything! this is because we are saying that when we want to check which item is greater than one another, our compare function should return that both items are 1 and so they are equal. If every item in a list is equal to every other item then there's nothing to sort.

Question 5 (2 Points)

Fortunately, a DBMS can abstract a lot of those details away for us. It's important to remember that everything in a database is a set meaning that there is no inherent order in how it is stored. If we simply SELECT from our database we have to assume that the values are returned in a random order. However, we can use ORDER BY statement to specify the order we want. In the following cell, retrieve all the values in your database first by the Model attribute, then the City MPG, Hwy MPG, and Cmb MPG attributes. Print the result to the console.

deffetch_sorted_data(db_name,table_name):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()query=f"""PLACE YOUR QUERY HERE"""cursor.execute(query)results=cursor.fetchall()forrowinresults:print(row)exceptErrorase:print(f"Error while connecting to MySQL:{e}")finally:ifconnection.is_connected():cursor.close()connection.close()fetch_sorted_data("student","table1")

Part B: Understanding our data (23 Points)

Question 1 (1 Points)

Now that we have our data in a format we can use, let's take a look at the contents of the dataset. There are lots of things we could infer from this data. Write 5 inferences you could make from the data. For example, a possible answer might be

  1. The average Air Pollution Score of 4 Cyclinder vehicles.

Use the markdown cell below to write your 5 inferences (you cannot use mine)

Type Markdown and LaTeX: α2

Question 2 (10 Points)

In the following cell, write functions to query the database and calulate the values for each of your answers from above.

defqueries(db_name,table_name):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()## PUT YOUR QUERIES, AND THEIR EXECUTION HERE. MAKE SURE TO PRINT THE VALUES TO THE CONSOLE FOR READINGexceptErrorase:print(f"Error while connecting to MySQL:{e}")finally:ifconnection.is_connected():cursor.close()connection.close()queries("student","table1")

Question 3 (2 Points)

A metric I am interested in is the MPG for vehicles. In the following cell write a function which asks your database to calculate the average MPG for each vehicle class, for each MPG category using SQL.

Make your function return the following format:

{"mpgs": [ {"vehicle_class":"Vehicle_Class","avg_city":0,"avg_hwy":0,"avg_bmb":0, }, ] }
deffetch_average_mpg_by_vehicle_class(db_name,table_name):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()# HANDLE STORING THE RESULTS IN THE OUTPUT OBJECToutput={"mpgs": []}query=f"""PLACE YOUR QUERY HERE"""returnoutputexceptErrorase:print(f"Error while connecting to MySQL:{e}")finally:ifconnection.is_connected():cursor.close()connection.close()avgs=fetch_average_mpg_by_vehicle_class("student","table1")print(avgs)

Question 4 (1 Point)

What's interesting about averages, however, is that it is not so straigthforward as summing the values, and dividing by the number of elements. This question will show you why this may lead to erroneous conclusions.

Say I had 3 vehicles. I drive each for the same number of miles. The first vehicle has a mpg of 30, the second, a mpg of 10, and the third, a mpg of 25. The following code shows a quick average calculation for the average mpg I would use to drive the full 30 miles. What we are calculating is what's called the arithmetic mean.

mpgs=[30,10,25]# PUT YOUR CODE HEREprint(f"{arithmetic_mean}miles per gallon")

Question 5 (1 Point)

However, let's break this down a bit. Assume that for each vehicle, I drive 50km. Calculate the total number of gallons I would consume after driving each vehicle 50km

miles_driven_per_vehcile=50total_consumed=## WRITE CODE TO SET THIS ##print(total_consumed)

Question 6 (1 Point)

Now calculate the average mpg I get by dividing the total number of miles I drove divided by the number of gallons I consumed.

calculated_avg=## WRITE CODE TO SET THIS ##print(calculated_avg)

Question 7 (3 Point)

Is the suspected average the same as the calulated average? Why do you think this? Use the markdown cell below to answer.

Type Markdown and LaTeX: α2

What happened is that the values I am wanting to average are reciprocal values (i.e. there is a numerator, and a denominator in the units). Notice how in question 4 you had to do some unit conversion first in order to calculate the number of gallons consumed. For vehicle 1, for example, rather than calculate (30 mpg) * 50 miles, we had to reciprocate the value to (1/ 30)mpg * 50 miles to get the units correct.

Calculating averages like this is so common it has it's own name: The Harmonic Mean.

Knowing that the harmonic mean is what is used to recirprocal units, how would we achieve this in SQL? The AVG function in sql only returns the arithmetic mean so we can't use that.

What we can do, however, is write a procedure for sql to use. These are called Stored Procedures.

Below I've created code to insert a stored procedure into the database. When you run this code, the mysql connector will tell your database to create a procedure called harmonic_mean which you will then be able to use in your queries exactly like you would use the AVG function.

defcreate_harmonic_mean_udf(db_name):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()harmonic_mean_udf="""CREATE PROCEDURE calculate_harmonic_mean(IN tbl_name VARCHAR(64),IN col_name VARCHAR(64),OUT harmonic_mean DOUBLE)BEGINSET @query = CONCAT('SELECT COUNT(', col_name, ') / SUM(1 / ', col_name, ') INTO @hm ','FROM ', tbl_name, ' WHERE ', col_name, ' <> 0');PREPARE stmt FROM @query;EXECUTE stmt;DEALLOCATE PREPARE stmt;SET harmonic_mean = @hm;END"""# Drop the stored procedure if it existscursor.execute("DROP PROCEDURE IF EXISTS calculate_harmonic_mean")# Create the stored procedurecursor.execute(harmonic_mean_udf)print("Harmonic Mean stored procedure created successfully.")exceptErroraserr:print(f"Error:{err}")finally:ifconnection.is_connected():cursor.close()connection.close()create_harmonic_mean_udf("table1")

Question 8 (1 Points)

Now use the harmonic_mean udf to calculate the harmonic averages of the mpgs. This will require learning how to invoke a stored procedure via mysql.connector!

Make your function return the following format:

{"mpgs": [ {"vehicle_class":"Vehicle_Class","h_avg_city":0,"h_avg_hwy":0,"h_avg_bmb":0, }, ] }
deffetch_harmonic_average_mpg_by_vehicle_class(db_name,table_name):try:connection=mysql.connector.connect(host='localhost',user='student',password=PASSWORD,database=db_name)ifconnection.is_connected():cursor=connection.cursor()# HANDLE STORING THE RESULTS IN THE OUTPUT OBJECToutput={"mpgs": []}query=f"""PLACE YOUR QUERY HERE"""returnoutputexceptErrorase:print(f"Error while connecting to MySQL:{e}")finally:ifconnection.is_connected():cursor.close()connection.close()avgs=fetch_harmonic_average_mpg_by_vehicle_class("student","table1")

The arithmetic mean can be expressed as

The harmonic mean can be expressed as

A third type of mean known as the geometric mean can be expressed as

Or, in other words

When using the arithmetic mean we are considering values which get added together. When using the harmonic mean we are considering values which get reciprocaled then added together. The geometric mean is used when we want to average values which get multiplied together. For example, interest rates would be something where a geometric mean could be used since they are values which compound.

interest is calculated as  where

  • A = Final amount
  • P = Principal amount
  • r = Interest rate
  • n = number of compounding periods per unit of time
  • t = time.

In this example, the calculation  can be expanded to  thus this would require a geometric mean.

Question 9 (3 Points)

Give two real-world examples for when using the arithmetic mean would be appropriate, two real-world examples for when using the harmonic mean would be appropriate, and two real-world examples for when using the geometric mean would be appropriate

Type Markdown and LaTeX: α2

Part C: Reflection (6 marks)

Question 1 (2 Points)

Describe something you struggled with on this assignment, and why you think you struggled with it. (This gives us an opportunity to see where students are struggling and how we can help; answering this as descriptively as possible is in your best interest)

Type Markdown and LaTeX: α2

Question 2 (2 Points)

Name one thing about data science you learned in this assignment. Also name one thing you learned about SQL in particular in this assignment. Be descriptive; single word answers will not receive marks.

Type Markdown and LaTeX: α2

Question 3 (3 Points)

In 100-200 workds, How do you think this assignment prepares you for a career as a data scientist?

Type Markdown and LaTeX: α2

发表评论

电子邮件地址不会被公开。 必填项已用*标注