首页 » 应用统计学 » HDAT9400 ASSIGNMENT 2: DATA CLEANING

HDAT9400 ASSIGNMENT 2: DATA CLEANING

2024-11-14 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

HDAT9400

ASSIGNMENT 2: DATA CLEANING

Due: Friday 1st November by 11:59 pm (AEST)

Weight: 20% of the total grade

Submission platform: Open Learning

Assignment components

For this assignment you will submit two documents:

1) Written documentation (saved as Word or PDF) of reproducible notes (data cleaning notes, flowcharts, data dictionary), and answers to assignment questions, AND

2) SAS code (saved as .sas)

Learning outcomes assessed

CLO2 : Design and document data management plans involving data dictionaries and generation of metadata.

CLO3 : Evaluate data quality.

CLO6 : Generate syntax (code) required to produce analysis ready datasets.

Assessment and Submission

Your assignment will be assessed on quality based on the rubric provided on page 6. If you have multiple documents – please zip all documents in a single folder and submit this zipped folder via Open Learning. Your zipped file name should be in the following format:

zID_assignment2

Example, z123456_assignment2

Penalty for late submission

A penalty will apply for late submissions of assessment tasks (5% per day) if special consideration has not been granted. Assessments will not be marked if submitted more than 5 days after the assessment due date and will receive a value of 0 (in line with UNSW policy). For example, if you submit your assessment 2 days late, then 10% (5% x 2 days) will be deducted from the assessment mark. Thus, if your assessment was marked as 75% but was submitted 2 days late, then your final mark will be 65% only.

This assessment is eligible for a short extension of 2 days. All applications for the short extension need to be submitted viahttps://specialconsideration.unsw.edu.au/ before the assignment due date.

In case of illness or misadventure you may apply for an extension, only if requested up to 3 days after the assignment due date. Special consideration requests are handled by central student administration and should be submitted viahttps://student.unsw.edu.au/special-consideration. Documentation is required.

Instructions

Read the assignment questions and familiarise yourself with the dataset. Start your work early to give yourself ample time to address any challenges that may arise. Maintain clear and reproducible notes, ensuring that your SAS code is well-annotated for ease of understanding and future reference. Regularly save your progress and make sure to back up your work frequently to prevent any data loss.

Assignment dataset

For this assignment you will be using a simulated (made-up) dataset, which contains information about the people and services in a fictitious neighbourhood in Australia called Riverside which has a high proportion of culturally and linguistically diverse populations.

St Mungo's General Practice (SMGP) data

St Mungo's GP serves 60% of Riverside’s population and has multilingual staff. The SMGP data contains information about the clientele that visited the practice. It was extracted by the Practice Manager, with one record for one client, based on client’s information recorded in their most recent GP visit in the year 2023. The information about variables in the SMGP data is outlined in the SMGP data dictionary (page 5).

The dataset and a format program are saved in the zipped folder on Open Learning:

SMGP data: assignmt2_data.sas7bdat

SAS format program: assignmt2_formats.sas.

Examine the contents of the SMGP data dictionary, the SAS dataset and formats program to familiarise yourself with the data for this assignment.

Assignment questions

Part 1: Data cleaning, documentation and data dictionary update 60% (out of 100%)

Explore the data and decide on the approach to clean the data. For Part 1, you are required to:

A. Present your work in cleaning the data in the following forms (38 marks):

A written Word or PDF document explaining the process of data exploration, data cleaning, decisions made, and results of your analyses (20 marks);
A flowchart to graphically communicate the procedures taken for cleaning data (8 marks); and
Annotated SAS code showing your data exploration and data cleaning that results in a cleaned analysis file (10 marks).

B. Create new variables in the SMGP data (see definitions in Table 1) (14 marks):

Create variable smoke_status to indicate a person’s smoking status. Describe/justify your decision in the written document (8 marks);
Create variable risky_alcohol to classify health risk alcohol consumption (1 mark);
Calculate BMI score (variable BMI) (1 mark);
Create variable obese to indicate whether a person is obese (1 mark);
Create variable highBP to indicate whether a person has high blood pressure (1 mark);
Create variable multi_risk to indicate whether a person has multiple risk factors (2 marks).

Table 1: New health risk factors variables, values and definitions (in people aged >=18 years)

New variable Values, definition, and label of values

smoke_status 0=Never smoked

1=Current smoker

2=Ex-smoker

risky_alcohol 0=No (≤2 drinks per day)

1=Yes (>2 drinks per day)

BMI Weight/Height2 (weight divided by the square of the height;

weight is measured in kg, height is measured in meters)

obese 0 = No (BMI < 30)

1 = Yes (BMI ≥ 30)

highBP 0 = Normal blood pressure (Systolic < 140 mmHg & Diastolic < 90 mmHg)

1 = High blood pressure (Systolic ≥ 140 mmHg or Diastolic ≥ 90 mmHg)

multi_risk 0 = Has less than two risk factors out of (current smoker, risky alcohol, obesity, highBP)

1 = Has at least two risk factors out of (current smoker, risky alcohol, obesity, highBP)

C. Update the data dictionary. If you decide not to update a data dictionary, you should provide reasons for not updating the data dictionary (10 marks).

Present an updated data dictionary based on results of parts A and B above.

Part 2: Research Questions 40% (out of 100%)

The manager of the St Mungo’s GP practice wants to plan a new health risk management program and needs to understand the characteristics of their patients. You will be helping the practice manager to analyse the GP dataset and report on patient characteristics. The analysis will be based on a cleaned GP dataset and the key findings in your report should be reproducible.

For Part 2, you are required to analyse the dataset that you cleaned in Part 1 and present your findings to the following questions in a Word or PDF document. You will also submit your SAS code which must clearly show how your results were generated.

A. Describe the patient cohort seen at St Mungo’s GP practice in 2023 in terms of their Age and Sex distribution and Country of Birth. Present your results in table format with accompanying text. Is the cohort representative of the Australian population? (14 marks)

Your table(s) should be presented in an academic format similar to what would be found in the results section of a published journal article. You can present more than one table.
You will need to conduct some very basic research on population characteristics in Australia.

The next questions are focussed just on 35–64-year-olds in the St Mungo’s patient cohort.

B. Examining just 35–64-year-olds, investigate and report on the relationship between the various health risk factors and socio-demographic characteristics (eg. age, sex, country of birth)

Your written interpretation of results should be presented in academic writing style.
You will need to support your answer with relevant tables.

C. The practice manager wishes to run a risk reduction program for 35–64-year-olds with some available funding. It will focus only on people with multiple risk factors (eg. as measured using the newly created multi_risk variable). They wish to start the prevention program in just one region initially. Which region in Riverside would you recommend based on risk factor rates and/or prevalence in each region? Explain your choice. (10 marks)

You will need to explain and justify your choice by including supporting tables or results.

D. Discuss whether any of the data quality issues found in Part 1 could impact the results to the questions above. If so, describe what impact it may have on your conclusions. (6 marks)

SMGP data dictionary

Variable	Description	Variable type	Format name	Allowable entries
ID	Unique person ID	Number
region	Riverside region of residence	Character		‘ Southlands’, ‘Westlands’, ‘Northlands’, ‘Other’
GP_last	Date of most recent GP visit	Date	DDMMYY 10.	Dates in the range 01/01/2023– 31/12/2023
age	Age of patient at the most recent GP visit in 2023	Number
sex	Gender of the patient	Character		1=male 2=female
cob	In what country were you born?	Number	cobf.	1= Born in Australia 2= Born overseas
healthcare_card	Do you have a healthcare card1?	Number	ynf.	1= Yes 0= No
ever smoked _	Have you ever been a regular smoker?	Number	ynf.	1= Yes 0= No
smoke_now	Are you a regular smoker now?	Number	ynf.	1= Yes 0= No
age_start	How old were you when you started smoking regularly?	Number		Invalid if <10 or >105
age_stop	How old were you when you stopped smoking? Or when did you stop smoking?	Number		Invalid if <10 or >105
drinks_day	About how many alcoholic drinks do you drink per day?	Number		Invalid if >20
height	How tall are you without shoes? (meters)	Number		Invalid if <0.55m or >2.40m
weight	About how much do you weigh? (kilograms)	Number		Invalid if <5.0kg or >270kg
adverse_reaction	Have you had any adverse reaction to any medication?	Number	ynf.	1= Yes 0= No
syst_bp	Systolic blood pressure (mmHg)	Number
diast_bp	Diastolic blood pressure (mmHg)	Number

1 Australian residents may be eligible to have a Health Care Card if they receive financial support from the government. Benefits include a lower fee for prescription medicines under the

Pharmaceutical Benefits Scheme, higher refunds for medical expenses through the Medicare Safety Net, and some other social concession.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *