Mini-Project (100 Points)
STAT 011 — Introduction to Statistical Modeling
Due Date: Friday, 3/15 at 11:59pm
Submission Instructions
• Please submit the project report (with all of the code, prose, and output done in RMarkdown) as a knitted .html or .pdf file in Canvas. Any submissions not in this format are subjected to point deductions.
• No late reports will be accepted.
Background and Requirements
• The mini-project is designed to give you practical experience with statistical modeling concepts and tools, using real-world data in R. Specifically, you will choose a data set, explore and analyze some parts of the data, formulate a few key questions you can answer with the data, run the models, and evaluate those results.
• You are expected to turn in a 4-6 page report using RMarkdown, completing all of the tasks assigned.
• The report will be styled in a similar fashion as the lab assignments. As such, for each of the guidelines, you will need to have code, output, and prose. The report should be professional-looking, meaning that there should be complete sentences, correct grammar and spelling, and appropriate organization.
• Your report must be written through RMarkdown. A sample report template can be found on Canvas.
Working Time
Labs in Week 9 and 10 will be used as working time for the projects. Do not wait until Week 10 to get started on the mini-project. If questions arise as you are working on the mini-project, ask your TAs or instructor for clarification and guidance.
Guidelines
1. Pick a dataset. You will choose one of the following three datasets (available on canvas) to analyze in your report.
• insurance.csv: 1,338 observations with 7 variables. Contains demographic information and costs regarding medical patients. You can find more information about the dataset at this link.
• mlb2023.csv: 766 observations with 31 variables. Contains hitting statistics of Major League Baseball players from the 2023 regular season. You can find more information about the dataset at this link.
• usedcars.csv: 301 observations with 9 variables. Contains general information about used vehicle sales listed on different websites. You can find more information about the dataset at this link
2. Process the data into R. You will need to download the .csv file from Canvas and import the data into R. Additionally, some of the datasets may contain missing observations, which will need to be omitted before you continue this mini-project. You will find the following pieces of code to be helpful.
• read.csv(filepath/name, header = TRUE)
• na.omit(object)
3. Explore four different variables. (20 pts) You will analyze the distributions and behaviors of at least four different variables. To examine these behaviors, the following must be done.
• Show at least three different plots. You can look at an individual variable or multiple variables in the same plot.
• Provide a five-number summary of at least two variables.
• For each of the plots and summaries, write a few sentences about the trends, behaviors, and characteristics seen in these variables.
4. Pick two of the following tasks to do. After picking these two tasks, you will go to the next guideline.
a. Conduct a difference of two means or one-way ANOVA test in R.
b. Perform multiple linear regression using at least three explanatory variables and one continuous response variable in R.
c. Perform logistic regression using at least two continuous explanatory variables and one response variable in R.
5. Analyze and discuss the results of the two tasks picked in the last guideline. (40 pts for each task.) Follow the instructions seen below for the two chosen tasks.
a. Difference of two means or one-way ANOVA
• Explain the problem you want to solve and the hypotheses associated with the testing.
• Check all conditions with an explanation as to whether they have been satisfied.
• Run the test with R, providing code.
• Interpret the p-value and confidence interval results. Draw any conclusions based on the R output.
• Provide context to the conclusion. What does this say about your initial problem?
b. Multiple linear regression
• Explain which relationships you want to explore. Provide some context regarding why these relationships are important to investigate.
• Fit the linear regression model with R, providing the necessary code.
• Check diagnostics. Explain what each of the plots shows and whether they satisfy the least-squares conditions.
• Interpret the table of results from R. Explain what effect the explanatory variables have on the response.
• Explain which variables are significant and non-significant. Provide context to these conclu- sions.
• Provide predictions to the response variable with 5 new observations.
c. Logistic regression
• Explain which relationships you want to explore. Provide some context regarding why these relationships are important to investigate.
• Fit the logistic regression model with R, providing the necessary code.
• Check diagnostics. Explain whether these diagnostics are satisfied.
• Interpret the table of results from R. Explain what effect the explanatory variables have on the response.
• Explain which variables are significant and non-significant. Provide context to these conclu- sions.
• Provide predictions to the response variable (with the associated probabilities) with 2 new observations.