DS1000B Data Science Concepts

DS1000B – Assignment #2

Due: Feb 18, 2024 @ 11:55pm

You may work with a partner for this assignment. If you choose to do so, only one of you should submit the  assignment. Be sure to include your partner's name in the designated place in Gradescope, linking the grade to both of  you. In the event of forgetting this step, be certain that both names are on the submitted PDF file. If you are not linked to an assignment or your name is not on a submission file, than you will receive a grade of zero.

You  must carefully assign pages to their corresponding questions. You will  receive a grade of zero in each case below:

a.   Submission is not in PDF format.

b.   Questions with no pages assigned to them.

You must submit a single PDF file. Here is a recommended way to achieve this:

a)   If you write your answers on paper, you can scan them into a pdf file (if they are images, paste images to a word document then save as a pdf file).

b)   Write your Python code (e.g. in Jupyter notebook) then save it as a pdf file. c)   Combine all the pdf files above into one pdf file.

.    Each  assignment submission, whether it be an individual submission or a partnered submission, must be your own work. Scholastic offences are taken seriously. Please refer to this website for details:

http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf

Grade Breakdown:

Part 1: Written Answer

Question 1          12

Question 2          13

Question 3          10

Question 4 10

Total Points =    45

Part 2: Python

Question 5         10

Question 6         15

Question 7 20

Total Points =    45

Total Points: 90

Part 1 – Written Answer (Be sure to show all your work)

Question 1 [12 Points]

The height of elephants are approximately Normal, with a mean of 2.5 metres and standard deviation of 0.5 metres. For each part below, sketch a Normal curve and shade the area representing the region, as well as use Table A to solve.

a.   [4 Points] What proportion of elephants have heights greater than 3.5 metres?

b.   [4 Points] What proportion of elephants have heights between 2.2 and 3.5 metres?

c.    [4 Points] What height must an elephant be to fall in the top 25% of elephants?

Question 2 [13 Points]

The table below gives the self-reported heights of 10 university men (Son’s height) along with the heights of their fathers (Father’s height), both in inches.

Father’s height (x)

Son’s height (y)

73

75

69

72

74

68

76

71

75

70

72

74

68

71

73

69

74

71

73

70


a.   [6 Points] Draw, by hand, a scatterplot for this dataset. Comment on the direction, form and strength of this relationship.

b.   [5  Points]  Find the correlation between the father’s and son’s height. Show all of your work. Round each calculation to 2 decimal places.

c.    [2 Points] Do the value and the sign (positive or negative) of the correlation in part (b) make sense based on the scatterplot from part (a)?  Explain.

Question 3 [10 Points]

Consider a study investigating the relationship between husband’s height and wife’s height in young couples. The mean height of wives (x) in their thirties participating in the study is about 66.2 inches, and the standard deviation is about 3.6 inches. The mean height of husbands (y) the same age is about 75.7 inches, with a standard deviation of about 2.2 inches. Suppose that the correlation between the heights of husbands and wives is about r = 0.64.

a.   [3 Points] What are the slope and intercept of the regression line of the husband’s height on the wife’s height in young couples? Interpret the slope in the context of the problem.

b.   [4 Points] Draw a graph, by hand, of this regression line for heights of wives between 55 and 70 inches. Predict the height of the husband of a woman who is 67 inches tall and plot the wife’s height and predicted husband’s height on your graph.

c.    [3  Points]  What percentage of the variation in husband’s height is explained by the straight-line relationship with wife’s height? Do you expect the prediction made in part b) to be accurate?

Question 4 [10 Points]

Consider a study investigating the relationship between poverty and male life expectancy. Based on the observations in the table below, assume that you have already calculated the regression line to be: y = -0.1824242x + 85.33

Poverty Rank (x)

Male Life Expectancy (y)

10

20

30

40

50

60

70

80

90

100

83

82

80

78

76

75

72

71

70

66


a.    [4 Points] Use the regression line to calculate the 10 residuals for the observations.  What do the residuals add to?  Does this make sense?

b.    [4 Points] Draw a residual plot.  What does this plot tell us about the relationship between our two variables?

c.    [2 Points] What is an influential observation?  Do you think this dataset contains any?

Part 2 – Python (Be sure to show all your code and results)

Important Note: Since it is a Python part, without further notice, all numbers and graphs need to be produced using Python by default.

Question 5 [10 points]

The common fruit fly Drosophila melanogaster is the most studied organism in genetic research because it is small, is easy to grow, and reproduces rapidly. The length of the thorax (where the wings and legs attach) in a population of male fruit flies is approximately normal, with mean 0.830 millimeter (mm) and standard deviation 0.075 mm.

a.   [2 points] What proportion of flies have thorax length less than 0.7 mm?

b.   [2 points] What proportion of flies have thorax length greater than 1.0 mm?

c.    [2  points]  Explain  how  you  can  calculate  the  proportion  of  flies  with  thorax  length between 0.7 mm and 1.0 mm. (You can use words/math/diagrams or combination of them. You can write/draw them by hand.)

d.   [2 points] Calculate the proportion illustrated in part c.

e.   [2 points] What value of thorax length gives a 25% proportion of flies above it?

Question 6 [15 points]

Simulation study plays an important role in statistics and data science. It provides a fundamental tool to study the properties of statistical estimators and models under various situations, because in simulation study, we know the “ground truth” of the data, which we seldom have in real-world data analysis. In this question, we are going to perform a basic simulation study related to the standard normal distribution.

a.    [2 points] Compute the proportion of values smaller than 0.8 for a standard normal distribution.

b.   [7  points]  Write  a function called  `stats_normal`  with  argument  `n`  to  perform  the following task: (Hint: review what we have learned in Lab 2).

1.   Generate a sample from standard normal with a sample size equal to n.

2.   Compute the mean of the sample.

3.   Compute the standard deviation of the sample.

4.   Compute the proportion of values smaller than 0.8 in the sample.

5.   Return a dictionary (Recall Lab 1) containing the three numerical quantities above.

c.    [2 points] Set the random seed as 3. Run the function `stats_normal’ with n = 50, 500, 5000. (Hint: you can simply write three lines of codes. It is totally optional to run a loop.)

d.   [4 points] Explain in words: what kind of pattern can you see from the results in part c? (Hint: for a standard normal distribution, recall the population mean, standard deviation and proportion of values smaller than 0.8.)

Question 7 [20 points]

The file (RBC_Returns.csv) contains the % daily change in Royal Bank Stock price from 1995 until October 1st, 2021.

a.   [3 points] Make a histogram of the % daily changes with the density curve estimate on the same plot. Comment on the shape of the distribution.

b.   [2 points] Calculate the mean (̅(x)) and standard deviation (s) of the % daily changes.

c.    [7 points] Using the results of part b, follow the steps below:

1.   find the number of data points with % daily change between ̅(x) − 3s and ̅(x)  + 3s.

2.   Divide this number by the total number of data points in the dataset obtaining a proportion.

3.   Explain  in words:  by comparing this proportion to the proportion between μ − 3σ and μ + 3σ given by the normal density curve, which one is larger?

4.   Explain in words: based on your results, does there appear to be a departure from the   normal distribution?  (Hint: for the tail part   of   the   distribution, is it heavier/lighter than normal?)

d.   [8 points] Follow the steps below:

1.   Make a boxplot of these % daily changes.

2.   Generate a sample from a normal distribution with mean and standard deviation equal to the ones computed in part b. The sample size is equal to the total numberof the points in the dataset. Set the random seed as 123. (Hint: use `norm.rvs`)

3.   Make a boxplot of this normal sample.

4.   Set the random seed as 456. Redo steps 2 and 3.

5.   Explain  in  words:  by  comparing  these  three  boxplots,  do  they  support  your findings in the part c (regarding whether there is a departure from the normal distribution)?Why? (Hint: pay attention to the outlier part. The reason we redraw another boxplot for the normal sample is to reduce the effects of randomness, so that you can see a general outlier pattern of a normal distribution.)

发表评论

电子邮件地址不会被公开。 必填项已用*标注