首页 » 数据科学与大数据技术 » DATA 8 Foundations of Data Science

DATA 8 Foundations of Data Science

2024-08-12 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

DATA 8

Foundations of Data Science

Fall 2023

Final Exam

INSTRUCTIONS

You have 2 hours and 50 minutes to complete the exam.

• The exam is closed book, closed notes, closed computer, closed calculator, except the provided reference sheet.

• Mark your answers on the exam itself in the spaces provided. We will not grade answers written on scratch paper or outside the designated answer spaces.

• If you need to use the restroom, bring your phone and exam to the front of the room. For questions with circular bubbles, you should select exactly one choice.

o You must choose either this option

o Or this one, but not both!

For questions with square checkboxes, you may select multiple choices.

□ You could select this choice.

□ You could select this one too!

Preliminaries

You can complete and submit these questions before the exam starts. The exam is worth 140 points.

The sections are as follows: True or False - 30 points

Community - 12 points Merch - 40 points

Spotify - 30 points Bears - 28 points

There is also a Just For Fun section, worth 0 points, and a Last Words section, where you can state any assumptions you made on the exam, also worth 0 points.

(a) What is your full name?

(b) What is your student ID number?

(c) Who is your lab GSI? You may write Unknown if you don’t know their name.

(d) Sign here to conﬁrm that all work on this exam is your own (or type your name if online).

1. (30.0 points) True or False

(a) (2.0 pt) If a scatterplot has a correlation coeﬃcient of 1, all of the points must lie perfectly on a straight line.

o True

o False

(b) (2.0 pt) When building a classiﬁer, ensuring that you have a large and diverse training set is a good way to mitigate overﬁtting.

o True

o False

(c) (2.0 pt) According to the Central Limit Theorem, if a sample is large, and drawn at random from the population with replacement, then the probability distribution of the sample mean is roughly normal.

o True

o False

(d) (2.0 pt) If we use linear regression to predict y-values based on our x-values, where both x and y are standardized, the estimate of the intercept could be negative.

o True

o False

(e) (2.0 pt) If you are sampling a numerical attribute that can only take on values of 0 or 1, the SD of your sample could have a value of 0.5.

o True

o False

(f) (2.0 pt) If we use linear regression to predict y-values based on our x-values, the average of our residuals will always be zero, regardless of whether x and y are standardized.

o True

o False

(g) (2.0 pt) If you use k-nearest neighbors on a data set that has only 2 possible categories for class (e.g. 0 or 1) and a k of 4, there is guaranteed to be a unique class that has the majority among the k nearest neighbors in the training set.

o True

o False

(h) (2.0 pt) The total variation distance can only be applied to categorical distributions in which there are 3 or more unique categories.

o True

o False

(i) (2.0 pt) The recommended way to estimate a classiﬁer’s accuracy on the population is to evaluate its accuracy on the training set.

o True

o False

(j) (2.0 pt) For any distribution, the percent of data that lies within 3 SDs of the average is at least 80%.

o True

o False

(k) (2.0 pt) When conducting a randomized control experiment, random assignment of treatment and control serves as away to simulate data from the null hypothesis.

o True

o False

(l) (2.0 pt) You can have two individuals whose distance is zero if calculated using only 1 numerical attribute, but whose distance is greater than zero if calculated using 2 numerical attributes.

o True

o False

(m) (2.0 pt) Chebychev’s Rule allows us to model subjective beliefs about events that involve randomness.

o True

o False

(n) (2.0 pt) Modern neural networks are powerful machine learning models for classifying images because their features are learned (as opposed to being inputted as columns from the training set).

o True

o False

(o) (2.0 pt) If a scatterplot has a correlation coeﬃcient of 0, there is noway that all of the points lie on a straight line.

o True

o False

2. (12.0 points) Community

Writers for the upcoming Community movie are writing the script as having three acts.

For each act, they will randomly select a theme for it to be about. The themes are randomly chosen from the following distribution generated from a public poll from X (formerly Twitter):

• 60% chance of paintball ﬁght

• 40% chance of multiverse

Note: Assume each act is sampled with the same set of probabilities regardless of what is picked for the other acts.

(a) (3.0 pt) What is the probability that three acts are multiverse, paintball ﬁght and multiverse, in that order?

o (2 × 0.4) × (0.6)

o (2 × 0.4) + 0.6

o 0.42 × 0.6

o 0.6 × 0.4 × 0.6

o 1 - (0.6 × 0.4 × 0.6)

Donald Glover, an actor from the original Community TV show, hasn’t yet conﬁrmed whether he will return for the movie.

Suppose we know the following conditional probabilities:

(b) • If the third act has a paintball ﬁght,there is a 20% chance Glover will return for the movie

• If the third act has a multiverse theme, there is a 50% chance Glover will return for the movie

i. (3.0 pt) What is the chance that the third act has a paintball ﬁght and Glover does not return for the movie?

o 0.4 × 0.8 o 0.8

o 1 - (0.2 + 0.5)

o 0.6 × 0.8 + 0.4 × 0.5

o None of the above.

ii. (3.0 pt) Suppose the script has now been ﬁnalized and Glover announces that he will be returning for the movie.

What is the probability that the third act will be a paintball ﬁght?

o 0.2×0.6+0.5+0.8/0.2×0.6

o 0.6 × 0.2

o 0.6×0.2+0.4×0.2/0.6×0.2

o 0.6×0.2+0.4×0.5/0.6×0.2

o 0.6 × 0.2 + 0.4 × 0.5

o None of the above.

iii. (3.0 pt) Suppose that before the script is ﬁnalized, there is a leak on social media that indicates the chance of the third act being a paintball ﬁght is 90%.

The script then gets ﬁnalized and Glover announces that he will be returning for the movie.

Given the new information in the leak, what is our updated uprobability that the third act will be a paintball ﬁght?

o 0.2×0.9+0.5×0.8/0.2×0.9

o 0.9×0.2+0.1×0.2/0.9×0.2

o 0.9 × 0.2

o 0.9×0.2+0.4×0.5/0.9×0.2

o 0.9 × 0.2 + 0.1 × 0.5

o None of the above.

3. (40.0 points) Merch

Ernest and Mollie watched Barbie on opening day and were surprised that in the weeks after they saw several people wearing shirts that say “I am Kenough” .

(a) (3.0 pt) Suppose that Mollie wants to randomly sample movies and use their merchandise sales to create a 95% conﬁdence interval for the population mean of 2023 merchandise sales.

If she knows that the population SD is $10 million, what is the minimum sample size she needs to create a conﬁdence interval that has a width of $2 million?

Please draw a box around your ﬁnal answer.

(b) (2.0 pt) Suppose that Mollie uses bootstrapping to create a 95% conﬁdence interval using a sample size smaller than the one from part (a). Ernest states that the interval is guaranteed to be wider than $2 million.

Is Ernest’s statement true or false?

Note: Assume your answer in part (a) is correct.

o True

o False

(c) (3.0 pt) Suppose that Mollie uses the sample size from part (a) and constructs a 95% conﬁdence interval of [35.1, 78.9].

What is the probability that the true population mean of merchandise sales is outside of this interval?

o 2.5% o 5% o 10% o 95%

o There is not enough information to answers because we don’t know the endpoints of the conﬁdence interval.

o None of the above. There is no chance involved in whether our conﬁdence interval contains the true parameter.

(d) (3.0 pt) Suppose that Mollie wants to create a 95% conﬁdence interval for the population 75th percentile of merchandise sales.

Which of the following methods could be used to create such an interval? Select all that apply.

□ Chebychev’s Inequality

□ Bootstrapping

□ Central Limit Theorem

□ Randomized Control Experiment

□ None of the above

Mollie suspects that a movie’s Rotten Tomatoes score might have a relationship with the amount of merchandise it sells within the ﬁrst month of theatrical release.She randomly samples movies released in 2023 from Rotten Tomatoes and collects them into a table called movies. The ﬁrst few rows are shown here:

Name Critics Audience Sales

Barbie 88 83 202.3

Hunger Games 64 89 51.8

The Flash 63 83 36.9

. . . (57 rows omitted)

The table has the following columns:

(e) • Name : (string) the movie’s name

• Critics: (int) the movie’s Rotten Tomatoes Tomatometerscore (a percentage from 0 to 100)

• Audience: (int) the movie’s Rotten Tomatoes Audience score (a percentage from 0 to 100)

• Sales : (ﬂoat) the amount of movie’smerchandise sold (in millions of dollars) Note: The table has exactly 60 rows in it.

i. Mollie wants to ﬁta regression line to predict Sales from Critics, so she writes the following partially completed code:

def su(array):

return (array - np.mean(array)) / np.std(array) def intercept(x, y):

correlation =

(a)

slope = correlation *

(b)

return

(c)

The intercept function returns the intercept of the regression line. Note: Both functions take in arrays as input.

A. (3.0 pt) Write a Python expression to ﬁll in blank (a). Note: Both functions take in arrays as input.

B. (3.0 pt) Write a Python expression to ﬁll in blank (b).

C. (3.0 pt) Write a Python expression to ﬁll in blank (c).

ii. (3.0 pt) Mollie ﬁtsa regression line to predict Sales from Critics and gets a slope of -2.1. Which of the following would she expect to happen with the regression line’s predictions?

Select all that apply.

□ The regression line will tend to overestimate Sales for movies with a below average Critics score.

□ The regression line will tend to underestimate Sales for movies with a below average Critics score.

□ The regression line will tend to overestimate Sales for movies with an above average Critics score.

□ The regression line will tend to underestimate Sales for movies with an above average Critics score.

□ None of the above.

iii. (3.0 pt) Ernest thinks the true slope of the regression line in the population is 0 and that the value observed in the sample above is due to chance. He bootstraps the data in movies to generate a

conﬁdence interval for the true slope.

Which of the following statements are true? Select all that apply.

□ Every bootstrapped estimate of the slope will be negative.

□ The size of the bootstrap resamples will all be exactly 60.

□ All 60 movies in the original sample will appear in every bootstrap resample.

□ The bootstrap process is equivalent to permuting the rows of the dataset repeatedly.

□ None of the above.

iv. (3.0 pt) Ernest constructs a 90% conﬁdence interval for the true slope and ﬁnds it to be [-4.5, -1.1].

Assuming a p-value cutoﬀ of 5%, which of the following can Ernest conclude based on his conﬁdence interval?

Select all that apply.

□ The true slope in the population is 0.

□ The true slope in the population is not 0.

□ The true slope in the population is less than 0.

□ None of the above.

v. (3.0 pt) Mollie’s sister, Anna, argues that Ernest should have made a conﬁdence interval for the correlation coeﬃcient instead.

Which of the following statements are true? Select all that apply.

□ The correlation coeﬃcient should be used instead because it is unitless.

□ The correlation coeﬃcient should be used instead because the magnitude of the slope could be aﬀected by the units of the x-axis and y-axis.

□ It doesn’t matter which value is used since the slope is equal to the correlation coeﬃcient.

□ It doesn’t matter which value is used since a slope of 0 implies the correlation coeﬃcient is 0 as well.

□ None of the above.

(f) Rather than using the critics’ scores, Mollie thinks it’sa better idea to use the audience scores to predict merchandise sales.

Suppose she knows the following:

• the Audience column has a mean of 70 and a standard deviation of 10

• the Sales column has a mean of 100 and a standard deviation of 50

• the correlation between the Audience and Sales columns is 0.4

i. (3.0 pt) If Mollie wants to predict Sales from Audience, what would be the intercept of her regression line?

Please draw a box around your ﬁnal answer.

ii. (3.0 pt) For a movie that has an audience score of 80, what would the regression line above predict as the merchandise sales?

o 200 o 150 o 140 o 120 o 110

o None of the above

iii. (2.0 pt) What are the units for the slope in the above regression?

o Dollars per Percent

o Millions of Dollars

o Millions of Dollars per Tomato

o Dollars per Ounce of Ketchup

o None of the above

4. (30.0 points) Spotify

Barbara and Jeanine were quarantined for a week with 13 other friends due to unforseen circumstances. They are curious to understand what songs each friend listened to during the quarantine period.

To evaluate this, they randomly sample song “plays” by the 15 people during quarantine and put that into spotify table. Here are the ﬁrst few rows:

Username	Artist	Song	Genre	Duration
barbz23	Olivia Rodrigo	Vampire	Pop	3.14
jea9	The Weeknd	Popular	R&B	2.78
ronnieboi	Doja Cat	Paint the Town Red	Hip-Hop	3.05

. . . (328 rows omitted)

The table has the following columns:

• Username : (string) the spotifyusername of the person who played the song

• Artist : (string) the song’s artist

• Song: (string) the song’s name

• Genre : (string) the song’s genre

• Duration : (ﬂoat) the number of minutes the song was played on that occasion

Note: There is a row for each time a song was played, so many rows will be repeated. For example, if Jeanine listened to the song Vampire 3 times, then there will be 3 rows in the table for those “plays”.

(a) (3.0 pt) Which of the following Python expression returns the name of the artist with the most plays in the table?

Hint: Each row of the table is equivalent to a single play. Select all that apply.

□ spotify.sort( ' Duration ' , descending=True) .column(1) .item(0)

□ spotify.group( ' Artist ' ) .sort(1, descending=True) .column(0) .item(0)

□ spotify.sort( ' Duration ' , descending=True) .column( ' Artist ' ) .item(0)

□ spotify.select( ' Artist ' , ' Duration ' ) .group(0, max) .sort(1, descending=True) .column(0) .item(0)

□ None of the above.

(b) (3.0 pt) Write a Python expression that returns a table with more than 3 columns that displays the average play duration for each unique combination of artist and song.

(c) (3.0 pt) Write a Python expression that returns the name of the artist that has the largest number of unique songs in the table.

(d) While looking at atable of song plays is helpful, Barbara notices that the table doesn’t contain the names of people who played the songs.

She creates a separate table called accounts that contains their friends’ Spotify accounts. The ﬁrst few rows are shown here:

Identifier DisplayName

jmarsdenofficial James Marsden

margarita23 Inez De Leon

ken_the_og Ken Hyun

. . . (12 rows omitted)

The table has the following columns:

• Identiﬁer : (string) the account’s ID in Spotify’s database

• DisplayName: (string) the account’sdisplay name (ﬁrst and lastname)

Barbara notices that one of the friends, ' Todd Gregory ' , tends to skip Pop songs after listening to them for just a few seconds.

She writes the following partially completed code, which assigns result to an array containing the average play duration for every unique Pop song that Todd played.

combined = (a)

todd_pop_songs = (b)

result = todd_pop_songs. (c)

Recall : The spotify table has columns Username, Artist, Song, Genre and Duration. i. (3.0 pt) Write a Python expression to ﬁll in blank (a).

ii. (3.0 pt) Write a Python expression to ﬁll in blank (b).

iii. (3.0 pt) Write a Python expression to ﬁll in blank (c).

(e) Jeanine notices that average play durations for ’Pop’ songs are typically lower than those for ’Hip-Hop’ songs across all 15 friends.

Barbara argues that any diﬀerences observed in the sample are only due to chance.

Recall : The spotify table has columns Username, Artist, Song, Genre and Duration.

i. (3.0 pt) Which of the following is an alternative hypothesis that Jeanine could use to assess her claims?

Select all that apply.

□ ’Pop’ song plays have a lower Duration on average than ’Hip-Hop’ song plays.

□ ’Pop’ song plays have have the same Duration distribution as ’Hip-Hop’ song plays.

□ All ’Pop’ song plays have a lower Duration than all ’Hip-Hop’ song plays.

□ ’Hip-Hop’ song plays have a higher Duration on average than ’Pop’ song plays.

□ None of the above.

ii. (3.0 pt) Which of the following test statistics could Jeanine use to assess her claims? Select all that apply.

□ The total variation distance between the Duration distribution of ’Pop’ song plays and the Duration distribution of ’Hip-Hop’ song plays.

□ The mean Duration among ’Hip-Hop’ song plays minus the mean Duration among ’Pop’ song plays.

□ The mean Duration among ’Pop’ song plays minus the mean Duration among ’Hip-Hop’ song plays.

□ The mean Duration among ’Pop’ song plays.

□ The mean Duration among ’Hip-Hop’ song plays plus the mean Duration among ’Pop’ song plays.

□ None of the above.

iii. (3.0 pt) Jeanine chooses a test statistic such that large values favor the alternative.

She simulates the test statistic many times and stores these in an array called test_stats. Suppose the observed value of the test statistic is 12.1.

Write a Python expression that returns the p-value for this hypothesis test.

iv. (3.0 pt) Jeanine usea p-value cutoﬀ of 5% and ﬁnds that this corresponds to a simulated test statistic of 10.2.

Given the information in part (iii), Which of the following can she conclude? Select all that apply.

□ The data are consistent with the null hypothesis.

□ The data are consistent with the alternative hypothesis.

□ There is a 5% chance that the null hypothesis is true.

□ There is a 5% chance that the alternative hypothesis is true.

□ ’Pop’ song plays had a lower duration on average than ’Hip-Hop’ songs.

□ There is not enough information to make a conclusion of any kind.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *