HDAT9400 ASSIGNMENT 2: DATA CLEANING

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


HDAT9400

ASSIGNMENT 2: DATA CLEANING

Due: Friday 1st November by 11:59 pm (AEST)

Weight: 20% of the total grade


Submission platform: Open Learning

Assignment components

For this assignment you will submit two documents:

1)  Written documentation (saved as Word or PDF) of reproducible notes (data cleaning notes, flowcharts, data dictionary), and answers to assignment questions, AND

2)   SAS code (saved as .sas)

Learning outcomes assessed

CLO2 : Design and document data management plans involving data dictionaries and generation of metadata.

CLO3 : Evaluate data quality.

CLO6 : Generate syntax (code) required to produce analysis ready datasets.

Assessment and Submission

Your assignment will be assessed on quality based on the rubric provided on page 6. If you have   multiple documents – please zip all documents in a single folder and submit this zipped folder via Open Learning. Your zipped file name should be in the following format:

zID_assignment2

Example, z123456_assignment2


Penalty for late submission

penalty will apply for late submissions of assessment tasks (5% per day) if special consideration has not been granted. Assessments will not be marked if submitted more than 5 days after the assessment due date and will receive a value of 0 (in line with UNSW policy). For example, if you submit your assessment 2 days late, then 10% (5% x 2 days) will be deducted from the assessment mark. Thus, if your assessment was marked as 75% but was submitted 2 days late, then your final  mark will be 65% only.

This assessment is eligible for a short extension of 2 days. All applications for the short extension   need to be submitted viahttps://specialconsideration.unsw.edu.au/ before the assignment due date.

In case of illness or misadventure you may apply for an extension, only if requested up to 3 days after the assignment due date. Special consideration requests are handled by central student administration and should be submitted viahttps://student.unsw.edu.au/special-consideration. Documentation is required.

Instructions

Read the assignment questions and familiarise yourself with the dataset. Start your work early to give yourself ample time to address any challenges that may arise. Maintain clear and reproducible   notes, ensuring that your SAS code is well-annotated for ease of understanding and future reference. Regularly save your progress and make sure to back up your work frequently to prevent any data loss.

Assignment dataset

For this assignment you will be using a simulated (made-up) dataset, which contains information about the people and services in a fictitious neighbourhood in Australia called Riverside which has a high proportion of culturally and linguistically diverse populations.

St Mungo's General Practice (SMGP) data

St Mungo's GP serves 60% of Riverside’s population and has multilingual staff. The SMGP data contains information about the clientele that visited the practice. It was extracted by the Practice Manager, with one record for one client, based on client’s information recorded in their most recent GP visit in the year 2023. The information about variables in the SMGP data is outlined in the SMGP data dictionary (page 5).

The dataset and a format program are saved in the zipped folder on Open Learning:

SMGP data:                    assignmt2_data.sas7bdat

SAS format program:   assignmt2_formats.sas.

Examine the contents of the SMGP data dictionary, the SAS dataset and formats program to familiarise yourself with the data for this assignment.


Assignment questions

Part 1: Data cleaning, documentation and data dictionary update 60% (out of 100%)


Explore the data and decide on the approach to clean the data. For Part 1, you are required to:

A. Present your work in cleaning the data in the following forms (38 marks):


  • A written Word or PDF document explaining the process of data exploration, data cleaning, decisions made, and results of your analyses (20 marks);
  • A flowchart to graphically communicate the procedures taken for cleaning data (8 marks); and
  • Annotated SAS code showing your data exploration and data cleaning that results in a cleaned analysis file (10 marks).


B. Create new variables in the SMGP data (see definitions in Table 1) (14 marks):


  • Create variable smoke_status to indicate a person’s smoking status. Describe/justify your decision in the written document (8 marks);
  • Create variable risky_alcohol to classify health risk alcohol consumption (1 mark);
  • Calculate BMI score (variable BMI) (1 mark);
  • Create variable obese to indicate whether a person is obese (1 mark);
  • Create variable highBP to indicate whether a person has high blood pressure (1 mark);
  • Create variable multi_risk to indicate whether a person has multiple risk factors (2 marks).
Table 1: New health risk factors variables, values and definitions (in people aged >=18 years)


New variable                Values, definition, and label of values

smoke_status 0=Never smoked

1=Current smoker

2=Ex-smoker

risky_alcohol 0=No (≤2 drinks per day)

1=Yes (>2 drinks per day)

BMI Weight/Height2 (weight divided by the square of the height;

weight is measured in kg, height is measured in meters)

obese 0 = No (BMI < 30)

1 = Yes (BMI ≥ 30)

highBP 0 = Normal blood pressure (Systolic < 140 mmHg & Diastolic < 90 mmHg)

1 = High blood pressure (Systolic ≥ 140 mmHg or Diastolic ≥ 90 mmHg)

multi_risk 0 = Has less than two risk factors out of (current smoker, risky alcohol, obesity, highBP)

1 = Has at least two risk factors out of (current smoker, risky alcohol, obesity, highBP)

C. Update the data dictionary. If you decide not to update a data dictionary, you should provide reasons for not updating the data dictionary (10 marks).


  • Present an updated data dictionary based on results of parts A and B above.


Part 2: Research Questions 40% (out of 100%)

The manager of the St Mungo’s GP practice wants to plan a new health risk management program and needs to understand the characteristics of their patients. You will be helping the practice manager to analyse the GP dataset and report on patient characteristics. The analysis will be based on a cleaned GP dataset and the key findings in your report should be reproducible.

For Part 2, you are required to analyse the dataset that you cleaned in Part 1 and present your findings to the following questions in a Word or PDF document. You will also submit your SAS code which must clearly show how your results were generated.

A.  Describe the patient cohort seen at St Mungo’s GP practice in 2023 in terms of their Age and Sex distribution and Country of Birth. Present your results in table format with accompanying text. Is the cohort representative of the Australian population? (14 marks)


  • Your table(s) should be presented in an academic format similar to what would be found in the results section of a published journal article. You can present more than one table.
  • You will need to conduct some very basic research on population characteristics in Australia.



The next questions are focussed just on 35–64-year-olds in the St Mungo’s patient cohort.

B.  Examining just 35–64-year-olds, investigate and report on the relationship between the various health risk factors and socio-demographic characteristics (eg. age, sex, country of birth)


  • Your written interpretation of results should be presented in academic writing style.
  • You will need to support your answer with relevant tables.


C. The practice manager wishes to run a risk reduction program for 35–64-year-olds with some available funding. It will focus only on people with multiple risk factors (eg. as measured using the newly created multi_risk variable). They wish to start the prevention program in just one region initially. Which region in Riverside would you recommend based on risk factor rates and/or prevalence in each region? Explain your choice. (10 marks)


  • You will need to explain and justify your choice by including supporting tables or results.


D.  Discuss whether any of the data quality issues found in Part 1 could impact the results to the questions above. If so, describe what impact it may have on your conclusions. (6 marks)


SMGP data dictionary

Variable

Description

Variable type

Format name

Allowable entries

ID

Unique person ID

Number

region

Riverside region of residence

Character

 Southlands’, Westlands’,

Northlands’, ‘Other

GP_last

Date of most recent GP visit

Date

DDMMYY 10.

Dates in the range 01/01/2023–

31/12/2023

age

Age of patient at the

most recent GP visit in 2023

Number

sex

Gender of the patient

Character

1=male

2=female

cob

In what country were you born?

Number

cobf.

1= Born in Australia 2= Born overseas

healthcare_card

Do you have a healthcare card1?

Number

ynf.

1= Yes 0= No

ever smoked

_

Have you ever been a regular smoker?

Number

ynf.

1= Yes 0= No

smoke_now

Are you a regular smoker now?

Number

ynf.

1= Yes 0= No

age_start

How old were you when you started smoking

regularly?

Number

Invalid if <10 or >105

age_stop

How old were you when you stopped smoking? Or

when did you stop smoking?

Number

Invalid if <10 or >105

drinks_day

About how many

alcoholic drinks do you drink per day?

Number

Invalid if >20

height

How tall are you without shoes? (meters)

Number

Invalid if <0.55m or >2.40m

weight

About how much do you weigh? (kilograms)

Number

Invalid if <5.0kg or >270kg

adverse_reaction

Have you had any

adverse reaction to any medication?

Number

ynf.

1= Yes 0= No

syst_bp

Systolic blood pressure (mmHg)

Number

diast_bp

Diastolic blood pressure (mmHg)

Number

1 Australian residents may be eligible to have a Health Care Card if they receive financial support from the government. Benefits include a lower fee for prescription medicines under the

Pharmaceutical Benefits Scheme, higher refunds for medical expenses through the Medicare Safety Net, and some other social concession.

发表评论

电子邮件地址不会被公开。 必填项已用*标注