MATH1041 Statistics for Life and Social Science
Term 1, 2024
MATH1041 Assignment
Data: Together with this document, you should have received your unique dataset in an e-mail sent to your official university email address. The data (that is, your dataset) are available in a text file with the name 5380675.csv. If you have not received your dataset (double check your UNSW email inbox and the spam folder), please contact your lecturer.
Submission due date: Tuesday 9th April (Week 9) before 11:59 PM (Sydney time, AEST). Note that a late penalty of 5% of the maximal possible mark per day will apply. No assignment will be accepted more than five days after the deadline.
Your submission must contain your full name and student zID at the top of your assignment. Submit your assignment through Turnitin via Moodle. See the “Assessments Hub” section on Moodle for further information regarding online submission.
Please submit a neatly typed assignment as a Microsoft Word document (.doc or .docx), see the information and help about the assignment in the assessment section on Moodle, or as a PDF document (.pdf) created for instance using Google Docs, LATEX, RMarkdown or similar tools. For your convenience, there is a Microsoft Word template that can be downloaded from Moodle which you can write your assignment in, that is
already in a format appropriate for this assignment.
Verify that your assignment has been submitted correctly by downloading the submission receipt and clicking on the link to check that it displays correctly in the Turnitin viewer. If not, it is your responsibility to make the necessary amendment.
Typesetting (*)
|
/2
|
Q1
|
/5
|
Q2
|
/9
|
Q3
|
/13
|
Q4
|
/17
|
Q5
|
/15
|
Q6
|
/4
|
Total
|
/65
|
(*) See the next pages and the “Assessments Hub” on Moodle for details, help and explanations about the assignment and typesetting.
Note that your assignment and dataset is unique. You cannot show your dataset or your assignment to anyone. It is your responsibility to keep your dataset and your assignment secret. Also, your assignment must be your own work. You cannot get any outside help in any form. If you have a question about the assignment, the only places where you can ask it is on the MATH1041 Assignment forum, provided you do not reveal your data, or at a staff consultation.
Computing assignment format
Keep in mind that this assignment is not only about assessing your Statistical skills; it is also about giving you feedback on your Mathematical writing skills. The assignment must be typeset correctly and provide complete explanations in complete English sentences and paragraphs. Think of this as practice for a document you might produce in your future studies or career that includes mathematical explanations.
Here are some more details that may assist you:
• Regarding the overall assignment structure, please answer all questions in the given order (that is, 1.a., 1.b., ... etc). Do not re-write the assignment questions again, only their label (write “3.e.” for instance when you start question 3.e.). Keep your answers brief, clear and concise. There is NO need to reproduce the cover sheet, i.e., the first 5 pages of the pdf file sent to you, in your assignment.
• Start your answer to each Question (1, 2, etc.) on a new page. Each Question should start on a new page, but sub-parts of a Question (such as Question 3.d., 3.e.) should continue on the same page.
• You are required to type up your entire assignment (in Microsoft Word, Google docs, LATEX, Overleaf or RMarkdown) including any equations. The only exception are the plots produced by RStudio, for which you can save the figures (use “export” in the bottom right window in RStudio) which you then paste in your assignment. Nothing can be handwritten then scanned. As a UNSW student, you can download Microsoft Word for free, see: https://www.myit.unsw.edu.au/software-students.
• As in any properly typeset document containing mathematic symbols, you must use an equation editor for all maths symbols. For instance, you should write “X is normal”, rather than “X is normal” (Notice how the ‘X’ looks different?) and you should write “tobs = 1.23”, rather than “tobs = 1.23”. The marking scheme for this criterion is the following: Are mathematical symbols typeset using the equation editor? 2 marks for ‘almost always’, 1 mark for ‘sometimes’, 0 mark for ‘rarely’. Help about Microsoft equation editor can be found in a document called Microsoft Word Equation editor help for MATH1041 located on Moodle in the Assignment (20%) section within the Assessments Hub section of the MATH1041 Moodle page.
• You should add some working out for the questions involving calculations; do not just give the final answer. Note that you may get partial marks for clear explanations and a correct method even if you get the wrong answer. However, try to keep your solutions brief and concise. Depending on what the question is asking, your working out could consist of RStudio commands, a formula, or perhaps the main steps explaining how you arrived at your answer. You do not need to add all your R-code.
• Keeping your results to 3 or 4 significant figures should be fine. If there are multiple steps in a calculation, do not round any numbers until you have reached the final step. To help you do calculations correctly in RStudio without rounding, values should be stored as variables, rather than copying the output number into a further calculation. For example, if you are constructing a confidence interval and need to calculate t
∗
, you should write the code: tstar <- qt(0.975, df = 10) and then use the variable tstar in calculating your confidence interval, rather than pasting in the number 2.228139.
• There is no requirement for font size and line spacing but please make sure your assignment is readable — do not make the font size too small or the spacing too compact.
• If the question asks you to produce a graph/plot, you should always include that graph in your answer, unless otherwise specified.
Scenario Do NOT copy-paste these data
Parkinson’s disease (PD), or simply Parkinson’s, is a chronic degenerative disorder of the central nervous system in the brain that affects both the motor system and non-motor systems. The symptoms usually emerge slowly, and as the disease progresses, non-motor symptoms become more common. Early symptoms are tremor, rigidity, slowness of movement, and difficulty with walking, speaking or swallowing.
Problems may also arise with cognition, behaviour, sleep, and sensory systems.a The original datasetb analysed by J. Hlavnička et al. in 2017c
includes a random sample of 30 patients with early untreated Parkinson’s disease (PD), a second independent random sample of 50 patients with Rapid Eye Movement (REM) sleep behaviour disorder (RBD), which are at high risk of developing Parkinson’s disease; and a third independent random sample of 50 healthy controls (HC). All patients were scored clinically by a well-trained professional neurologist with experience in movement disorders.
All subjects were also examined during a single session with a speech specialist. In the (first) column Code, an entry such as RBD01 would indicate that this is Patient 01 out of 50 in the REM sleep Behaviour Disorder group.
The data you received by email is a random sample extracted from the original data described above. A limited number of rows of your personal dataset is shown on the right. The variables considered here are: Age, Sex, Duration of pause intervals (ms) and RateSpeech timing (-/min) (acoustic information about the rhythmic organization of speech describing its quality), and FingerTaps (giving an ordered score in {0, 1, 2, 3, 4} to a finger tapping task, where 0 indicates “no problem” and 4 indicates “cannot or can only barely perform the task”). It is usually assumed that people with Parkinson’s disease tend to have, on average, a higher Duration of pause intervals and a lower Rate of speech timing.
aSource: Wikipedia.
bSource: UC Irvine Machine Learning Repository
cSource: Sci Rep, (2017) Feb 2;7(1):12. The original paper
conducts an entirely different analysis from this assignment.
Reading this paper will not help you complete this assignment, and you should not refer to it in any of your answers.
## Code Age Sex Duration RateSpeech FingerTaps
## PD20 70 M 140 312 1
## RBD40 60 M 245 296 1
## PD28 60 M 154 340 1
## RBD17 69 F 201 279 1
## PD08 59 F 145 338 1
## PD21 70 M 155 334 0
## PD14 70 M 146 338 1
## RBD37 68 M 158 301 0
## PD12 37 F 129 365 2
## PD06 58 M 186 317 3
## HC34 66 M 119 402 <NA>
## PD16 64 F 137 386 1
## HC50 54 M 171 264 <NA>
## RBD10 69 M 145 329 0
## RBD20 75 M 132 339 0
## PD10 66 M 213 281 1
## RBD29 56 M 226 293 0
## HC42 68 M 158 315 <NA>
## PD17 73 F 146 339 1
## RBD38 65 M 190 309 2
## HC11 65 M 130 403 <NA>
## HC31 72 M 244 279 <NA>
## HC19 58 M 130 381 <NA>
## RBD07 64 M 175 270 0
## RBD41 65 M 203 264 0
## RBD12 63 M 250 285 0
## HC06 65 M 148 347 <NA>
## HC44 54 M 154 350 <NA>
## RBD42 68 M 181 302 1
## HC40 67 M 156 321 <NA>
## RBD35 62 M 162 337 0
## RBD16 61 M 181 278 0
## RBD08 74 M 133 352 0
## PD03 68 M 377 211 1
## HC07 45 M 138 312 <NA>
## RBD22 59 M 126 354 0
## HC20 60 M 129 329 <NA>
## HC21 40 F 105 399 <NA>
## PD25 77 M 220 311 1
## PD04 75 M 360 140 1
## .........................................
4Reading the data into RStudio
The data are in a text file with the name 5380675.csv. This file was sent to you by e-mail (see page 1). To complete this assignment, you need to use the FULL dataset provided to you by email. Do NOT copy and paste
the data on the previous page as your dataset.
The first step is to read the data into RStudio. The data format is like what you have already worked with in the Weekly Mobius lessons. Follow the instructions given in section R1.4 “How to import a text file into RStudio” of the RStudio “How-To-Manual” available on Moodle. Alternatively, you can also review your lecture slides to find the R function to use to import a CSV file into RStudio. Two arguments that are often used when calling this function are header = TRUE (to indicate that names of variables are present on the first line in the file) and row.names = 1 (to indicate that names of the cases are provided in the first column). Another very important argument is colClasses which takes a vector of classes to be assumed for the columns (such as "character" for strings, "factor" for a categorical variable, and "numeric" for a quantitative variable). Once you have uploaded the data then you are ready to start your analysis!
Checkpoint: To make sure everything is all right, we suggest that you first calculate the average of the n = 105 values read from your file 5380675.csv for each quantitative variable, and check that they match the values given below. If your data have been stored in an R object called student.data, you can type print(colMeans(student.data[, unlist(lapply(student.data, is.numeric))], na.rm = TRUE), digits = 5) where na.rm = TRUE indicates to remove non available (<NA>) (i.e., missing) values.
## Age Duration RateSpeech
## 63.571 164.667 329.514
They do? It means you imported the data correctly in RStudio. You are ready to start!
IMPORTANT: Completing this checkpoint is essential. If you load in the dataset incorrectly, you will have incorrect answers throughout the entire assignment, and you will have marks removed for every incorrect answer.
The Analysis Tasks
The questions below follow a logical order that can be used for analysing real data. Also, working through these questions will help you better understand some concepts presented in the slides, which will be helpful for the
final exam.
PART I: Study Design
Q1. In this question, you will think about the research questions and aspects of study design. For all parts in Q1, your answers should be no more than one sentence long.
1.a. Briefly, explain what is the research question that the stakeholders are interested in based on what is described in the scenario. Keep this in mind when you analyse the data in Parts II and III.
1.b. What is the population that is of interest to researchers?
1.c. What are the cases here? (We do not expect a list of all cases here.)
1.d. Is it an observational study or is it an experiment? Provide a brief justification for your answer.
With the markers in mind, in your assignment, please start every question on a new page.
Q2. In this question, you will describe the organisation of the data. For each one of part a–c, your answer should be no more than two sentences.
2.a. Your data is provided to you in a specific file format. What is the extension of the data file and what does the extension stand for?
2.b. What is the sample size? (We expect a value here.)
2.c. What are the IDs (labels)? Give only the ID of the first observation.
2.d. Complete the table below so that it lists all of the variables that are contained in the dataset and the type of each variable. You should add rows to the table as required. When describing the type of each variable, you should be more specific than just saying that the variable is categorical or quantitative, i.e. you should specify what kind of categorical or quantitative variable it is.
Table 1: Table to be completed and submitted with your assignment.
Variable Name Variable Type
With the markers in mind, in your assignment, please start every question on a new page.
PART II: Exploratory Data Analysis
Q3. Your second task, as any statistician would, is to explore your data with univariate analyses to gain a good understanding of each variable in the data set. This is always a good strategy to help you detect problems
in a data set, and also to know enough about your data to better answer the research questions.
3.a. Let us deal with missing values first, if any. How many missing values are there in your dataset? You can determine this using the R function is.na(). (They are indicated by NA entries after importation into R, a code meaning “Non Available”.) Just state the number of missing values.
3.b. When doing initial data exploration, it is always good to consider the potential reasons for missing data and where they appear in the data set. One way to handle missing values is sometimes to replace all of them with a suitably chosen value. Other times, it is more appropriate to leave them as they are. Considering the scenario, and looking closely at your data, what is the appropriate strategy here?
Justify your answer. Your answer should be no more than two sentences.
3.c. We now move on to univariate graphical summaries. Create a boxplot of the variable Age. Include it in your submitted assignment properly labelled.
3.d. Comment on the presence or absence of outliers in the boxplot you produced in part 3.c (in no more
than one sentence).
3.e. Create an appropriate graphical summary for the variable FingerTaps (only for the subjects that are NOT healthy controls). Include it in your submitted assignment properly labelled.
3.f. Comment in no more than one sentence on the trend that you see in the graphical summary in part 3.e.
3.g. We now move on to univariate numerical summaries. Create an appropriate numerical summary for the variable Sex.
3.h. In no more than one sentence, comment on the result of part 3.g.
3.i. Compute the five number summary of variable Duration for all subjects combined (healthy and non-healthy). (Do NOT use the fivenum() function.)
3.j. In no more than one sentence, comment on the result of part 3.i. With the markers in mind, in your assignment, please start every question on a new page.
Q4. 4.a. We now want to study the relationship between the variables Duration and RateSpeech. What type of graphical summary is appropriate for this? Just state the name of the summary (no justification needed here).
4.b. It is sometimes appropriate to add a least-squares line to graphical summaries of the kind referred to in part 4.a. Is it the case here? Just answer yes or no for this part.
4.c. Justify your answer to the part 4.b. Write no more than 3 sentences.
4.d. Now, produce the graphical summary referred to in part 4.a. Ensure that your plot is properly labelled and include it in your assignment.
4.e. Describe the nature of the relationship observed on the plot you produced in part 4.d, using the four adjectives (or their antonyms) given in the lecture slides. Your answer should be no more than four sentences, but writing only one sentence should suffice. What else do you notice on this plot?
4.f. What is an appropriate numerical summary to describe the relationship between the Duration and Rate of Speech? Just state the name of the numerical summary.
4.g. Compute the value of the numerical summary referred to in part 4.f. Give your answer to at least two decimal places.
4.h. Comment on the value of the numerical summary you computed in part 4.g in no more than one sentence.
4.i. Given the results you obtained in the previous parts, it is only necessary to study either Duration or Rate of Speech. (By the way, do you understand why?) We will now focus on Duration. Produce two (side-by-side) boxplots to compare the Duration of healthy controls to the other subjects. Ensure that your plot is properly labelled and include it in your assignment.
4.j. Comment on the trend you see in the plot you produced in part 4.i in no more than one sentence.
With the markers in mind, in your assignment, please start every question on a new page.
PART III: Modeling and Inference
Q5. Now, we are going to do some modeling and statistical inference.
5.a. Let µ1 be the mean of variable Duration for healthy people. Let µ2 be the mean of variable Duration for non-healthy people. We want to compare µ1 to µ2 and we assume that µ1 is KNOWN (equal to 146) while µ2 is UNKNOWN. Recall the name of the hypothesis test strategy you can use here.
5.b. Perform an appropriate hypothesis test to compare the true means of Duration between healthy and non-healthy people. You must summarise all steps in your solution:
• (1) state the null (give both H0 and Hf0) and alternative (Ha) hypotheses relevant to the research objectives stated in this scenario,
• (2) an expression/formula for a suitable test statistic,
• (3) its observed value in the sample,
• (4) the null distribution for this statistic,
• (5) the expression of the P-value,
• (6) the numerical value of the P-value,
• (7) your interpretation of the P-value and
• (8) your conclusion in plain language.
5.c. Given what you know about the scenario, and by referring to the boxplot obtained in part II, and to any other calculation you can do, briefly discuss the validity of all the assumptions needed to safely apply this hypothesis test. What other graphs could you do here to verify some of these assumptions? (No need to include the graph in your assignment, just give the name of the graph and what it can be used for.)
With the markers in mind, in your assignment, please start every question on a new page.
Q6. 6.a. Produce a one-sided 95% confidence interval for the difference in means of Duration between healthy controls and non-healthy ones, still assuming that the mean for healthy people is known.
6.b. Does this confidence interval include the value µ1 given in part 5.a? Is your answer to this consistent with your conclusions from the hypothesis test in part 5.b?
6.c. Referring back to the scenario, write a one-sentence plain-language interpretation of the confidence interval obtained above.
END OF ASSIGNMENT