STA220H1 The Practice of Statistics I (Fall 2024)

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

STA220H1 The Practice of Statistics I (Fall 2024)
Assignment 2 Instructions
Due Date: Nov. 22 at 11:59pm on Crowdmark

Instructions

This is an individual assignment. You are expected to work on this independently. While you may discuss ideas and concepts, please do not share your code or written answers. It is expected that all code and written work should be written by yourself. Please note, this assignment is fairly open, so the context of most of the work completed here should not match your peers.

Submission Format and Instructions

Your final submission will be in PDF file. You will submit your solutions on Crowdmark. There will be a different upload box for each question, so it is recommended that you place each question on different pages or files.

Your PDF file will need to show (1) R code, (2) R output/figures, and (3) your written answers.

Here are some suggested ways you can create your final submission:

  • Use Microsoft Word to type out your answers. Screenshot your R output and place these images throughout the document. For the R code, either copy/paste as text or screenshot.
  • Use an app like Notability, OneNote, etc., where you can write/type your answers and include screenshots of your R code and output.
  • Use RMarkdown and knit to a PDF. Alternatively, you can knit to an HTML file and then save it as a PDF.

How you create the final file is up to you, as long as it is clear and organized. You don’t want the TA to be frustrated while marking your work!

Use of Built-In Functions in R

You are allowed to use built-in functions and packages in R. This includes functions that help with confidence intervals and hypothesis testing. However, if you are going to use built-in functions for intervals and tests, please be aware that not all built-in functions that we've seen in class will give you the proper test statistic required in some questions and the built-in functions may not give you all the required intermediate calculation results.

Late Penalty

As described on the course syllabus, late work will be deducted 20% per day. 

Data for this Assignment

In this assignment, you will work with the Airbnb dataset that you used in Assignment 1. Recall that dataset contains a comprehensive listing activity of homestays in New York City. It provides insights into geographical location, pricing, reviews, and host details of each listing.

The following variables are provided in the data:

• Name: The name of the property listing.
• host_identity_verified: Indicates whether the host's identity has been verified.
• host.name: The name of the host.
• neighbourhood.group: The larger neighborhood group where the listing is located.
• neighbourhood: The specific neighborhood where the listing is located.
• lat: The latitude of the listing, based on the WGS84 geographic coordinate system.
• long: The longitude of the listing, based on the WGS84 geographic coordinate system.
• instant_bookable: Specifies whether the listing can be instantly booked.
• cancellation_policy: The type of cancellation policy applied to the listing.
• room.type: The type of room available for the listing (e.g., entire home, private room).
• construction.year: The year the property was constructed.
• price: The price of the listing per night, in USD.
• service.fee: The service fee per night, in USD.
• minimum.nights: The minimum number of nights required to book the listing.
• number.of.reviews: The total number of reviews received in the last 12 months.
• reviews.per.month: The average number of reviews received each month over the listing’s lifetime.
• review.rate.number: The average review rating of the listing.
• calculated.host.listings.count: The number of listings managed by the host, as calculated by Airbnb.
• availability.365: The number of days the listing is available for booking within the next 365 days.

In this assignment, you may wish to convert the format of the ‘price’ variable into numeric first, so that it can be proceeded for further analysis. For example, the following code create a new variable ‘price_numeric’ that deletes the dollar sign and comma from the ‘price’ variable and converts it into numeric format.

Airbnb_data$ price_numeric <- as.numeric(gsub("[$,]", "",
Airbnb_data$price))
You may also wish to remove observations that have NA (i.e., missing) for the variable of interest. For example, the following codes creates a new dataset called
‘Airbnb_data_cleaned’ such that all the observations with missing values for the two variables: ‘price_numeric’ and ‘instant_bookable’ are removed.
Airbnb_data_cleaned <- Airbnb_data
Airbnb_data_cleaned <-
Airbnb_data_cleaned[!is.na(Airbnb_data_cleaned$instant_bookable),]Airbnb_data_cleaned <-
Airbnb_data_cleaned[!is.na(Airbnb_data_cleaned$price_numeric),]

The calculations of the following questions are based on the dataset after removing missing values (i.e. ‘Airbnb_data_cleaned’ dataset).

Question 1 (10 marks)

We are interested in the proportion of listings that are instant bookable. Answer the following questions using the “Airbnb_data_cleaned” dataset. For calculations that you complete in R, show your code and output. (Please keep 3 decimal places for this question)

a) Test whether 50% of listings are instant bookable using ? = 0.1. State the null and alternative hypotheses, calculate the Z-statistic and p-value using R, and state the relevant conclusion. (4 marks)
b) Construct a 99% confidence interval for the proportion of listings that are instant bookable using the estimated sample proportion for the standard error. Interpret the interval. Show your intermediate calculation results including the standard error and the critical values using R. (4 marks)
c) Redo part (b) with a conservative choice for the standard error and state the margin of error for this interval. (2 marks)

Question 2 (10 marks)

We are interested in average price of the listing per night. Answer the following questions using the “Airbnb_data_cleaned” dataset. For calculations that you complete in R, show your code and output. (Please keep 3 decimal places for this question)
a) Test whether the average price of the listing per night is equal to $500 using ? = 0.05.

State the hypotheses, calculate the test statistic and p-value using R, and state the relevant conclusion. (4 marks)

b) Assuming the standard deviation of the price per night is known to be $200, re-do the hypothesis test in part (a). (2 marks)

c) Construct a 90% confidence interval for the average price of the listing per night and interpret the interval. Interpret the interval. Show your intermediate calculation results including the standard error and the critical values using R. (4 marks)

Question 3 (30 marks)

In this question you are going to write-up a short analysis based on the Airbnb dataset. The analysis should target a statistical question that you raise from the dataset.

Requirements:
• A few sentences introducing the dataset. Assume the reader does not know anything about the dataset.
• A few sentences introducing the question you consider based on the data. For your question, you may describe the population of interest and the parameters of interest.
• Create at least one graph/figure related to your question. Describe the patterns you see in the graphs.
• Conduct at least one hypothesis test. For your test, state the hypothesis, the test statistic, p-value, and conclusion.
• Compute at least one confidence interval. For the interval you constructed, make sure you specify the level of confidence. Interpret your constructed interval.
• A few sentences summarizing and concluding the results of your analysis.
• An appendix that shows all code and R output.
Notes:
• Written text and graphs should appear in your write-up. All code and output (excluding graphs) should be included in an appendix, and should not appear in the main part of your write-up.
• Write-ups (excluding the appendix) should not exceed 500 words.
• You are welcome to filter through the data before the analysis to adjust the population of interest.
• For the hypothesis tests and confidence intervals, you are expected to complete all the calculations in R.
• All writing should be in full sentences.
o In the body of the text, use full sentences to describe your test/interval. For example, “We wish to test the null hypothesis that ____ versus the alternative hypothesis that ____. The test statistic is ___ with a p-value of ____.”
o All calculations (R code and output) should be in the appendix.

• You are encouraged to use headings to organize your work. 

Question 3 Rubric


Inadequate
Fair
Good
Excellent
Writing Quality (10 marks)

0-4 marks

Some written components are not included. Writing is unclear.

5-6 marks

Most written components are provided. Written components contain major issues. The descriptions do not accurately describe the methods. Writing is somewhat unclear.

7-8 marks
All the written components are provided and shows that student is able to properly communicate statistical concepts. Writing isgenerally clear.

9-10 marks

All the written components are provided. Student exceeds expectations in statistical communication. Writing is clear and compelling.

Plots (5 marks)

0-2 marks

Does not meet the requirement of 1+ plots.

3 marks

Required plots are provided, but plots do not highlight the important information related to parameters of interest

4 marks

Required plots are provided, and mostly shows that the student is able to create a plot relevant for the situation. Plots are labelled properly.

5marks

Required plots are provided, and a lot of thought was put into creating the plot. Plots are interesting, compelling, and communicate well to the viewer.

Hypothesis Tests and Confidence Intervals (5 marks)

0-2 marks

Very few of the required hypothesis tests and confidence intervals are provided. Contains major errors.

3 marks

Some of the required hypothesis tests and confidence intervals are provided. Errors with the set-up, calculations, and/or  interpretations.

4 marks

The required hypothesis tests and confidence intervals are provided. Interpretations are provided and correct.

5 marks

The required hypothesis tests and confidence intervals are provided. Interpretations are provided. Conclusions are well written and provide an  interesting discussion to the analysis.

Appendix, R code (5 marks)

0-2 marks

R code is not shown or has many major errors.

3 marks

R code is somewhat provided but is difficult to follow.

4 marks

R code is provided but contains errors or is hard to follow.

5 marks

R code is provided. Appropriate functions and/or calculations are used. Useful comments are used to make them easy to read.

Formatting and

Organization (5

marks)

0-2 marks

Poorly organized and difficult to follow.

3 marks

Sometimes difficult to follow. Code may appear in body of the text.

4 marks

Organized and formatted well. Code does not appear in the body of the text.

5 marks

Very well organized and presentable. Code does not appear in the body of the text. Proper headings are used.

发表评论

电子邮件地址不会被公开。 必填项已用*标注