Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
HDAT9400
ASSIGNMENT 2: DATA CLEANING
Due: Friday 1st November by 11:59 pm (AEST)
Weight: 20% of the total grade
Submission platform: Open Learning
Assignment components
For this assignment you will submit two documents:
1) Written documentation (saved as Word or PDF) of reproducible notes (data cleaning notes, flowcharts, data dictionary), and answers to assignment questions, AND
2) SAS code (saved as .sas)
Learning outcomes assessed
CLO2 : Design and document data management plans involving data dictionaries and generation of metadata.
CLO3 : Evaluate data quality.
CLO6 : Generate syntax (code) required to produce analysis ready datasets.
Assessment and Submission
Your assignment will be assessed on quality based on the rubric provided on page 6. If you have multiple documents – please zip all documents in a single folder and submit this zipped folder via Open Learning. Your zipped file name should be in the following format:
zID_assignment2
Example, z123456_assignment2
Penalty for late submission
A penalty will apply for late submissions of assessment tasks (5% per day) if special consideration has not been granted. Assessments will not be marked if submitted more than 5 days after the assessment due date and will receive a value of 0 (in line with UNSW policy). For example, if you submit your assessment 2 days late, then 10% (5% x 2 days) will be deducted from the assessment mark. Thus, if your assessment was marked as 75% but was submitted 2 days late, then your final mark will be 65% only.
This assessment is eligible for a short extension of 2 days. All applications for the short extension need to be submitted viahttps://specialconsideration.unsw.edu.au/ before the assignment due date.
In case of illness or misadventure you may apply for an extension, only if requested up to 3 days after the assignment due date. Special consideration requests are handled by central student administration and should be submitted viahttps://student.unsw.edu.au/special-consideration. Documentation is required.
Instructions
Read the assignment questions and familiarise yourself with the dataset. Start your work early to give yourself ample time to address any challenges that may arise. Maintain clear and reproducible notes, ensuring that your SAS code is well-annotated for ease of understanding and future reference. Regularly save your progress and make sure to back up your work frequently to prevent any data loss.
Assignment dataset
For this assignment you will be using a simulated (made-up) dataset, which contains information about the people and services in a fictitious neighbourhood in Australia called Riverside which has a high proportion of culturally and linguistically diverse populations.
St Mungo's General Practice (SMGP) data
St Mungo's GP serves 60% of Riverside’s population and has multilingual staff. The SMGP data contains information about the clientele that visited the practice. It was extracted by the Practice Manager, with one record for one client, based on client’s information recorded in their most recent GP visit in the year 2023. The information about variables in the SMGP data is outlined in the SMGP data dictionary (page 5).
The dataset and a format program are saved in the zipped folder on Open Learning:
SMGP data: assignmt2_data.sas7bdat
SAS format program: assignmt2_formats.sas.
Examine the contents of the SMGP data dictionary, the SAS dataset and formats program to familiarise yourself with the data for this assignment.
Assignment questions
Part 1: Data cleaning, documentation and data dictionary update 60% (out of 100%)
Explore the data and decide on the approach to clean the data. For Part 1, you are required to:
A. Present your work in cleaning the data in the following forms (38 marks):
- A written Word or PDF document explaining the process of data exploration, data cleaning, decisions made, and results of your analyses (20 marks);
- A flowchart to graphically communicate the procedures taken for cleaning data (8 marks); and
- Annotated SAS code showing your data exploration and data cleaning that results in a cleaned analysis file (10 marks).
B. Create new variables in the SMGP data (see definitions in Table 1) (14 marks):
- Create variable smoke_status to indicate a person’s smoking status. Describe/justify your decision in the written document (8 marks);
- Create variable risky_alcohol to classify health risk alcohol consumption (1 mark);
- Calculate BMI score (variable BMI) (1 mark);
- Create variable obese to indicate whether a person is obese (1 mark);
- Create variable highBP to indicate whether a person has high blood pressure (1 mark);
- Create variable multi_risk to indicate whether a person has multiple risk factors (2 marks).
New variable Values, definition, and label of values
smoke_status 0=Never smoked
1=Current smoker
2=Ex-smoker
risky_alcohol 0=No (≤2 drinks per day)
1=Yes (>2 drinks per day)
BMI Weight/Height2 (weight divided by the square of the height;
weight is measured in kg, height is measured in meters)
obese 0 = No (BMI < 30)
1 = Yes (BMI ≥ 30)
highBP 0 = Normal blood pressure (Systolic < 140 mmHg & Diastolic < 90 mmHg)
1 = High blood pressure (Systolic ≥ 140 mmHg or Diastolic ≥ 90 mmHg)
multi_risk 0 = Has less than two risk factors out of (current smoker, risky alcohol, obesity, highBP)
1 = Has at least two risk factors out of (current smoker, risky alcohol, obesity, highBP)
C. Update the data dictionary. If you decide not to update a data dictionary, you should provide reasons for not updating the data dictionary (10 marks).
- Present an updated data dictionary based on results of parts A and B above.
Part 2: Research Questions 40% (out of 100%)
The manager of the St Mungo’s GP practice wants to plan a new health risk management program and needs to understand the characteristics of their patients. You will be helping the practice manager to analyse the GP dataset and report on patient characteristics. The analysis will be based on a cleaned GP dataset and the key findings in your report should be reproducible.
For Part 2, you are required to analyse the dataset that you cleaned in Part 1 and present your findings to the following questions in a Word or PDF document. You will also submit your SAS code which must clearly show how your results were generated.
A. Describe the patient cohort seen at St Mungo’s GP practice in 2023 in terms of their Age and Sex distribution and Country of Birth. Present your results in table format with accompanying text. Is the cohort representative of the Australian population? (14 marks)
- Your table(s) should be presented in an academic format similar to what would be found in the results section of a published journal article. You can present more than one table.
- You will need to conduct some very basic research on population characteristics in Australia.
The next questions are focussed just on 35–64-year-olds in the St Mungo’s patient cohort.
B. Examining just 35–64-year-olds, investigate and report on the relationship between the various health risk factors and socio-demographic characteristics (eg. age, sex, country of birth)
- Your written interpretation of results should be presented in academic writing style.
- You will need to support your answer with relevant tables.
C. The practice manager wishes to run a risk reduction program for 35–64-year-olds with some available funding. It will focus only on people with multiple risk factors (eg. as measured using the newly created multi_risk variable). They wish to start the prevention program in just one region initially. Which region in Riverside would you recommend based on risk factor rates and/or prevalence in each region? Explain your choice. (10 marks)
- You will need to explain and justify your choice by including supporting tables or results.
D. Discuss whether any of the data quality issues found in Part 1 could impact the results to the questions above. If so, describe what impact it may have on your conclusions. (6 marks)
SMGP data dictionary
Variable |
Description |
Variable type |
Format name |
Allowable entries |
ID |
Unique person ID |
Number |
|
|
region |
Riverside region of residence |
Character |
|
‘ Southlands’, ‘Westlands’, ‘Northlands’, ‘Other’ |
GP_last |
Date of most recent GP visit |
Date |
DDMMYY 10. |
Dates in the range 01/01/2023– 31/12/2023 |
age |
Age of patient at the most recent GP visit in 2023 |
Number |
|
|
sex |
Gender of the patient |
Character |
|
1=male 2=female |
cob |
In what country were you born? |
Number |
cobf. |
1= Born in Australia 2= Born overseas |
healthcare_card |
Do you have a healthcare card1? |
Number |
ynf. |
1= Yes 0= No |
ever smoked _ |
Have you ever been a regular smoker? |
Number |
ynf. |
1= Yes 0= No |
smoke_now |
Are you a regular smoker now? |
Number |
ynf. |
1= Yes 0= No |
age_start |
How old were you when you started smoking regularly? |
Number |
|
Invalid if <10 or >105 |
age_stop |
How old were you when you stopped smoking? Or when did you stop smoking? |
Number |
|
Invalid if <10 or >105 |
drinks_day |
About how many alcoholic drinks do you drink per day? |
Number |
|
Invalid if >20 |
height |
How tall are you without shoes? (meters) |
Number |
|
Invalid if <0.55m or >2.40m |
weight |
About how much do you weigh? (kilograms) |
Number |
|
Invalid if <5.0kg or >270kg |
adverse_reaction |
Have you had any adverse reaction to any medication? |
Number |
ynf. |
1= Yes 0= No |
syst_bp |
Systolic blood pressure (mmHg) |
Number |
|
|
diast_bp |
Diastolic blood pressure (mmHg) |
Number |
|
|
1 Australian residents may be eligible to have a Health Care Card if they receive financial support from the government. Benefits include a lower fee for prescription medicines under the
Pharmaceutical Benefits Scheme, higher refunds for medical expenses through the Medicare Safety Net, and some other social concession.