STAT0045. In-course Assessment 2 (2023/24 Session)
Department of Statistical Science
General Instructions
This assessment is classified as Coursework as defined in the UCL Student Regulations for Exams and Assessments (link (https://www.ucl.ac.uk/academic-manual/chapters/chapter-4-assessment-framework-taught-programmes/student-regulations- exams-and-assessments)). It contributes 40% to the overall mark for this module.
· The release date for this assessment is 16:00 (UK time) on Thursday, 29 February 2024.
· The submission deadline is 16:00 (UK time) on Monday, 11 March 2024.
· This assessment is individual work: you are required to work alone.
· Individual extensions to the submission deadline can only be granted where a student has been issued with a Summary of Reasonable Adjustments (SoRA) or has made a valid claim for extenuating circumstances.
· If you have a SoRA, your extension should be setup automatically and you should see it reflected in the deadline displayed in the submission portal. If you think that your SoRA adjustment has not been applied, please contact the module lead at the earliest opportunity.
· Extenuating circumstances are handled by your parent department and all claims should be submitted via Portico (link (https://www.ucl.ac.uk/academic-manual/chapters/chapter-2- student-support-framework/2-short-term-illness-and-other-extenuating-1)). Depending on the nature and severity of the circumstances, an alternative type of mitigation to a deadline extension may be considered more suitable.
· In preparation for this assessment, please ensure that you are familiar with the Department of Statistical Science’s guidance on academic integrity (link (https://www.ucl.ac.uk/statistics/sites/statistics/files/shbpc.pdf)). When submitting your work, you will be required to make a declaration that you have read and understood this guidance.
· Parts of your submission may be scanned using similarity detection software. If any breach of the assessment regulations is suspected, it will be investigated in accordance with UCL’s Student Academic Misconduct Procedure (link (https://www.ucl.ac.uk/academic-manual/chapters/chapter-6- student-casework-framework/section-9-student-academic-misconduct-procedure)).
· To facilitate anonymous marking, you should not write your name anywhere on your work, including in file names or file descriptions requested as part of the submission process.
· You must only submit your work via the designated portal in Moodle. If you try to submit via email or any other channel this will not count as a submission and will not be marked.
· There are strict, non-negotiable penalties for late submission, which for coursework are as follows.
Up to 2 working days late: deduction of 10 percentage points, but no lower than the pass mark.
2-5 working days late: capped at the pass mark.
。 More than 5 working days late: mark of 1.00%.
· If the module lead becomes aware of a significant technical issue or outage affecting Moodle during the assessment, a message will be circulated to explain what has happened and the steps being taken to mitigate the issue. If you do not receive notification of a more widespread issue and you experience technical difficulties, you should refer to the Help & Support resources provided by UCL’s central IT service (link (https://www.ucl.ac.uk/isd/help-support)). However, last-minute technical issues will not be considered as valid grounds for missing the deadline, so ensure that you leave plenty of time to prepare, upload and check your submission.
· Non-submission (in the absence of any valid extenuating circumstances) will mean that your mark for this component is recorded as 0.00% and you will be deemed to have made an attempt.
· You should expect to receive feedback on this assessment within one calendar month of the submission deadline. In the event of a delay, the module lead will contact students directly with details of the revised timeline.
The assessment
· This assessment consists of two parts. For Part A, you can submit scanned/photographed hand-written solutions. Make sure that scanned work can be read clearly. Note the UCL advice on submitting scanned/photographed work (link (https://www.ucl.ac.uk/news/2020/apr/seven-simple-steps-submit-handwritten-answers-moodle-exams-or-assessments)). For Part B you are required to write a report and this report should be typed. Include a word count for this part.
Part A and Part B are both marked on a scale 0-100, and are equally weighted for the final mark. For Part A, marks for the constituent parts are listed in bold face. Marks are given for correct answers, but also for succinctness and clarity of explanation.
· To ensure anonymous marking, only provide your Student ID number at the top of Part A and B (and not your name). Part A and B should be submitted together in one PDF file. Submit the file with your Student ID as name; for example, if your ID is 20001234, use the name 20001234.pdf .
· You can use R for the questions in Part A, but do not hand in R code. R code in the submission will be ignored in the marking.
· For Part B, you are allowed to use an AI tool (such as ChatGTP), but you should acknowledge the use of this and explain the way you used it.
· You can use the course Forum to raise queries during the assessment, but only if the queries concern clarification of tasks in the assessment. The forum will be closed from 12 noon March 8th till March 12th.
Part A
Question 1
For this question, you have to download a data set that is identified by your Student ID number.
You can find the data in the Section ICA 2 on Moodle. Your data set is identified by your student ID number. Be careful to identify your data specifically. Marking is partly based on student-specific data analysis.
· If your ID is 20001234 for example, then select and download the text file 20001234.txt and put the file in the working directory of your R session.
· Read in the file in your R session by the command dta <- read.csv(file="20001234.txt") , and have a look at the data. Example of using R for this:
> dta <- read.csv(file="20001234.txt") > head(dta) y x 1 21.27 1 2 22.06 2 3 22.14 3 4 24.92 4 5 26.30 5 6 19.81 1 |
· If you cannot read in the data in R, contact the module lead as soon as possible via email: [email protected] .
The data are created in the format of a 30 × 2 table. The first column ( y ) is for response Y , and the second column ( x ) identifies the level of the treatment variable.
Your data concern a one-way ANOVA experiment for the time it takes to cycle to work in a big city in the UK. Response Y is in minutes, and the five treatment levels correspond with five different routes. The aim of the experiment is to establish how the time it takes to get into work is affected by the choice of route.
For the statistical inference, use a significance level of 5%.
(a) Define a one-way linear ANOVA model for response Y . Define the model such that the intercept can be estimated by the mean of the observed values for Y . Write down the model equation and specify this equation completely for your data. [7]
(b) Fit the model in (a) to your data and report the ANOVA table with clearly defined rows and columns. Using the model definition in (a), define the hypothesis for testing whether all five treatment group means are equal. Test this hypothesis using the ANOVA table. Be explicit about the distribution you use for this test. [8]
(c) Provide the point estimates for all the model parameters in (a). [5]
(d) For this experiment explicitly, give two examples of data collection that are not in line with the required randomisation for a one-way ANOVA. [10]
(e) Consider a one-way ANOVA defined for a treatment variable with three groups. Sample sizes for the three groups are given by ni , for i = 1, 2, 3. The standard linear model is defined by E[Yij] = μ + αi , where values i and j are defined by the design. To identify the model, the following corner-point parameterisation is used: α1 = 0 .
Define the estimators of μ and α2 as functions of response means Y(⎯⎯⎯⎯)i ⋅ , for i = 1, 2, 3 , and derive their variance as a function of the sample sizes ni and the error variance σ2 . Clearly explain your derivations. [10]
Question 2
Consider a city with three football clubs - labelled X, Y, and Z. Number of members are 400, 300, and 60, respectively. A sport scientist wants to investigate health outcome C for football players in the clubs. Outcome C is a continuous random variable.
(a) Give two reasons why a stratified sample design is a good choice in this case. [5]
Stratified sampling is used to collected data on C. Sample means and variances are derived:
Club Sample size Sample mean Sample variance
X |
25 |
2.0 |
0.5 |
Y |
30 |
2.3 |
1.0 |
Z 25 2.9 0.5
(b) Calculate the 95% confidence interval for the mean of health outcome C. You may consider the sample sizes to be large. Show the details of your calculation. [5]
(c) Consider the situation where the sample means and sample variances are fixed to the values in the table. Illustrate numerically that another choice of sample sizes (with same total sample size) would be more efficient. Explain your reasoning and show the details of your calculation. [8]
(d) A journalist reads the report of the scientist and reports the interval in (b) as news about football players in the city. Assume that the data were correctly collected. What is likely to be incorrect about this news? [7]
As a follow-up study, the scientist plans to sample 50 of the city’s households at random and collect data on C for all the adult occupants of each sampled home.
(e) Explain why the sampling of households in the follow-up study is cluster sampling. And give an advantage of this sample design in this case. [5]
Question 3
(a) Consider Theorem 2.3 in Section 2.8.2 of the lecture notes. Using the notation in Section 2.8, provide the final details of the proof of Theorem 2.3. That is, derive that E(CL) = “ , and that
1 ( i - CL)2
m(m - 1)
is indeed an unbiased estimator of Var( CL). Provide the details of the derivation. You do not have to explain the equations that are used in the proof of Theorem 2.3 in the lecture notes, but be clear which of these equations you use in your derivation. [12]
There are ten schools in a particular area. As part of an investigation into teaching standards, an inspection team proposes to visit three of the schools and administer a test to all of the 14-year old students in each school visited. The school sizes (in hundreds of pupils) are as follows:
School 1 2 3 4 5 6 7 8 9 10
Size 22 18 17 21 11 23 16 22 26 24
(b) Three pseudo-random numbers, distributed uniformly on (0, 1) , have been obtained using R. They are 0.821, 0.228 and 0.307. Use these to select a PPS sample of three schools, explaining your procedure clearly. [10]
(c) Suppose that schools 4, 7 and 2 were selected (note that these are not necessarily the schools that would be chosen using the random numbers provided above), and that the average test results (out of 20) for these three schools were 14.5, 16.7 and 13.6 respectively. Use these data to estimate the average test result across all ten schools. Provide an estimated standard error for your estimate. [8]
Part B
For this part you are required to write a short report discussing aspects of data ethics for a given scenario.
The scenario: At a UK university, the head of the Department of Data Science wants to predict which undergraduates will end up in good positions after graduation. The idea is to use data from past undergraduates to define a statistical prediction model, and use this model to identify high-potential current undergraduates and provide them with additional study options and extended personal mentoring.
You are asked to lead this project. The main statistical parts of the project are: collecting relevant data, data analysis, defining a model that can be used for prediction, and using the model to make a prediction for current undergraduates in the department.
Assume that the chosen prediction model is a logistic regression model for a binary response variable with value 1 for a good position and value 0 otherwise.
Instructions and guidelines for the report:
· Write a report that discusses the scenario with a focus on data ethics. Limit the scope of data ethics to the material that is discussed in STAT0045.
· You should explicitly use the following terms in the report (and reflect on the concepts attached to these terms): “data subject”, “model subject”, “fairness”, and “transparency” .
· You should discuss to some extent the importance of GDPR in this project and give at least one concrete example of a measure that you would implement to warrant that GDPR guidelines are followed. In the discussion of GDPR you should explicitly use the term “personal data” in the report.
· Assume the reader knows the logistic regression model; do not discuss standard aspects; for example, how the model is defined or how to estimate model parameters.
· Give the report a title.
· Type the report in a text editor and add the word count at the end of the report. Use font size 12.
· Write the report in paragraphs and complete sentences. Using a few bullet points is OK, but do not write the report as a list of bullet points.
· Maximum word count for the report (including the title) is 700 words. Report longer than 700 words will be penalised.
· If you use an AI tool (see instructions), then use an appendix to acknowledge this use. This appendix does not count towards the maximum word count.
· You can add literature references to the report. References do not count towards the maximum word count. No need to add references to the STAT0045 course material.
Hints:
· There is no need to use AI tools for this report, and it is not likely to be helpful. Mind the danger of using AI tools; see Slides 23-25.
· Although it is fine to refer to literature beyond the course material, there is no specific need to do so.
· The aim of this assignment is to see whether you are able to critically reflects on aspects of data ethics in a practical scenario. Do not just enumerate definitions or aspects of data ethics, focus instead on some of the aspects and explain why they are important in this scenario.
· You are not asked to solve potential problems in this scenario, or provide details of specific actions. The report should focus on potential issues with respect to data ethics - not knowing how the issues can be addressed in detail is OK.
Marking criteria: adherence to the above instructions and guidelines, and the quality of the presentation (readability, structure, language). [100]
Submission check
Make sure that you only use your Student ID number in the submission (and not your name), that you answer all the questions in Part A, that you include Part B (with a word count), and that the pdf file you submit has your Student ID as name.