BUSI2045 Data Analytics for Business Decision Making

BUSI2045 - Homework 1

Note: Please include both R codes and results in your solutions. (You may use the “Compile Report” function under Menu “File” in RStudio to generate a Word/PDF report of both R codes and results)

Question 1: Data Exploration and Visualization (20 points)

Read the dataset income.csv into R and answer the below questions.

A.  Visualize the distribution for variable income with a histogram, your output should be similar as the below.

Hints: (1) set binwidth as 800; (b) set the aesthetical attributes fill as “gold” and color as “black”; (3) change andaxis labelsand then add a plot title.

B.   Create a bar plot to visualize the average income for people with different education levels, your output

should look similar as the below.

Hints: (1) visualize education levels with different colors; (2) change andaxis labelsas well as scale name (i.e.,fill), add a plot title.

Question 2 Describe Data (40 Points)

Read the file Assignment1_Q2.csv into R and answer following questions.

A.  How many observations in this data set? For each variable, what are the data types (numeric, integer, etc.)?

B.  Which variable(s) are discrete? Check the unique values for all discrete variables.  Which variable(s) are

continuous? Check the mean, standard deviation, and range for all continuous variables.

C.    Construct a frequency table as below.

Hints: you may inspect the dataset to find suitable variables firstIgnore the order of the columns and rows.

D.  Create a boxplot to show the distribution of X4 across each level of X2. The result should look like the below.

What summary statistics does a boxplot visualize?

E.   Construct a subset named Q2E where variable X4 only contains positive values. Then create a variable X6

which is the sum of Xand Xin Q2E. Visualize the distribution of X6 as below

Hints: please fill the area under the density curve with color ‘gold’.

Question 3: Describe Data (40 Points)

Read the file Assignment1_Q3.csv in R and answer the below questions.

A.    Construct a subset named data_sub where the variable Income contains no missing value, i.e., noNA.  How many observations and variables are in this subset?

Answer following questions based on the subset data_sub.

B.  What is the maximum, 75% quantile, 25% quantile and minimum of the variable Income?

C.  Write a function name ratio to compute the ratio of the interquartile range (the difference between 75% and 25% quantile) against the range (i.e., the difference between max and min value)  of a variable. Then apply this function to three variables IncomeNumStorePurchases and NumWebPurchases in the dataset.

Hint: you may need to create a data frame with only the three variables before using the apply()function.

D.  Can you repeat step C using an anonymous function?

Hint: you should be able to receive the same result as step C.

发表评论

电子邮件地址不会被公开。 必填项已用*标注