DS1000A Data Science

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

DS1000A - Assignment #2

Due: June 6, 2023 @ 11:55pm

• Assignment submissions must be done via Gradescope.

• You will receive a grade of zero in each case below:

a.   Submission is not in PDF format.

b.   Questions have no pages assigned to them.

c.   Submission is illegible – blurry or too small (zoom won’t be used to enlarge print)

• You must submit a single PDF file.

a.   Part 1 – Written Answer: Refer to the file “Scanning & Uploading a Document to Gradescope” for instructions.

b.   Part 2 – Python: Write your Python code (e.g. in Jupyter notebook) then save it as a pdf file.

c.   Combine all the pdf files above into one pdf file.

• Each student must submit their own work.

Scholastic offences are taken seriously, and students are directed to read the policy on what constitutes a scholastic offence:

http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf

Grade Breakdown:

Part 1: Written Answer

Question 1          12

Question 2          13

Question 3         10

Question 4 10

Total Points =    45

Part 2: Python

Question 5           8

Question 6         11

Question 7 16

Total Points =     35

Total Points:                    80

Part 1 - Written Answer (round answers to 2 decimal places)

Question 1 [12 Points]

The height of adult males in a certain population is approximately Normal, with a mean of 1.75  metres and a standard deviation of 0.1 metres. For each part below, sketch a Normal curve and shade in the area representing the proportion being calculated.

a.   [4 Points] What proportion of have heights are greater than 1.75 metres?

b.   [4 Points] What proportion of have heights of exactly 1.3 metres?

c.    [4 Points] How tall must a male be to be in the top 2.5 % of adult males?

Question 2 [13 Points]

The table below gives the self-reported ages of 10 first-born children (Child’sage) along with the ages of their mothers (Mother’sage), both in years.

Mothersage (x)

Childsage (y)

45

26

50

29

55

60

38

40

48

52

17

2

19

9

30

28

13

5

11

27

a.   [6 Points] Draw, by hand, a scatterplot for this dataset. Comment on the direction, form, and strength of this relationship.

b.   [5 Points] Find the correlation between the mother’sand child’sage. Show all your work.

c.    [2 Points] Do the value and the sign (positive or negative) of the correlation in part (b) make sense based on the scatterplot from part (a)?   Briefly explain.

Question 3 [10 Points]

A study investigates the relationship between students'daily study hours and final exam scores. The students in the study have an average daily study time (x) of approximately 6 hours,with a standard deviation of about 1 hour. The average final examscore (y) is around 75 points, with a standard deviation of about 10 points. Suppose the correlation between daily study hours and final exam scores is r=0.6.

a. [2 Points] Calculate the slope and intercept of the regression line of final exam scores on daily study hours. Explain the meaning of the slope in the context of this study.

b. [1 Point] Using the calculated regression line equation, predict the final examscore for a student who studies 6 hours daily.

c. [4 Points] Draw, by hand, the regression line representing the relationship between daily study hours and final exam scores for study times ranging from 2 to 6 hours. On the graph, mark the predicted examscore for a student who studies 6 hours daily.

e. [1 Point] Calculate the percentage of the final examscore variation explained by the straight-line relationship with daily study hours.

f. [2 Points] Briefly discuss whether you expect the prediction made in part c to be accurate and explain your reasoning.

Question 4 [10 Points]

A study investigates the relationship between the number of hours of sleep a person gets per night and their job performance score. Based on the observations in the table below, assume  that you have already calculated the regression line to be: y=5.17x+42.06.

Hours of Sleep (x

)

Job Performance Score ()

0

2

4

5

6

7

8

8

9

12

32

43

70

75

78

80

95

88

90

85

a.   [4 Points] Use the regression line to calculate the 10 residuals for the observations. What do the residuals add to?  Does this make sense?

b.   [4 Points] Draw a residual plot.  What does this plot tellus about the regression model?

c.    [2 Points] What is an influential observation?  Do you think this dataset contains any?

Part 2 - Python

• be sure to show all code and results

•   you do not have to use the exact coding learnt in the labs to earn full marks

•    all parts to the questions should be done in Python; use comments in Python for the written answer questions.

Question 5 [8 points]

The length of the fin in a population of a specific type of fish is approximately normal, with mean 50 cm and standard deviation 29 cm.

a.   [2 points] What proportion of fish have fin length less than 2.9 cm?

b.   [2 points] What proportion of fish have fin length greater than 150 cm?

c.    [2 points] What proportion of fish have fin length between 79 cm and 89 cm?

d.   [2 points] What value offin length gives a 15% proportion of fish above it?

Question 6 [11 points]

Simulation study plays an important role in statistics and data science. It provides a fundamental tool to study the properties of statistical estimators and models under various situations, because in simulation study, we know the “ground truth” of the data, which we seldom have in real-world data analysis. In this question, we are going to perform a basic simulation study related to the standard normal distribution.

a.   [2 points] Compute the proportion of values smaller than two standard deviation above the mean for a standard normal distribution.

b.   [5 points] Write a function called `stats_normal` with argument `n` to perform the following task: (Hint: review what we have learned in Lab 2)

1.   Generate a sample from standard normal with a sample size equal ton.

2.   Compute the mean of the sample.

3.   Compute the standard deviation of the sample.

4.   Compute the proportion of values smaller than two sample standard deviation above the sample mean in the sample.

5.   Return a dictionary (Recall Lab 1) containing the three numerical quantities above.

c.    [2 points] Set the random seed as 5. Run the function `stats_normal’ with n = 100, 1000, 10000. (Hint: you can simply write three lines of codes. It is optional to run a loop.)

d.   [2 points] What kind of pattern can you see from the results in part c?

Question 7 [16 points]

The dataset (waterlev.csv) records water level measurements from January 1, 2000,to August 27, 2019. It includes two main fields: the date of the measurement and the mean water gauge height in feet. Each row represents a measurement taken on a specific date.

a.   [3 points] Create a histogram of the mean water gauge heights with the density curve estimate on the same plot. Comment on the shape of the distribution.

b.   [2 points] Calculate the mean (̅(x)) and standard deviation (s) of the mean water gauge heights.

c.    [4 points] Using the results of part b, follow the steps below:

1.   Find the number of data points with mean water gauge heights between  ̅(x) − S and̅(x)  + S.

2.   Divide this number by the total number of data points in the dataset to obtain a proportion.

3.   By comparing this proportion to the proportion between μ-σ and μ+σ given by the normal density curve, which one is larger?

4.   Based on your results, does thereappear to be a departure from the normal distribution? (Hint: for the tail part of the distribution, is it heavier/lighter than normal?)

d.   [7] Follow the steps below:

1.   Make a boxplot of the mean water gauge heights.

2.   Generate a sample from a normal distribution with mean and standard deviation equal to the ones computed in part b. The sample size should be equal to the total number of points in the dataset. Set the random seed as 2023. (Hint: use rnorm)

3.   Make a boxplot of this normal sample.

4.   Set the random seed as 2022. Redo steps 2 and 3.

5.   By comparing these three boxplots, do they support your findings in part c (regarding whether there is a departure from the normal distribution)? Why? (Hint: pay attention to the outlier part. The reason we redraw another boxplot for the normal sample is to reduce the effects of randomness, so that you can see a general outlier pattern of a normal distribution.)

发表评论

电子邮件地址不会被公开。 必填项已用*标注