Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
PHP 2511: Applied Regression Analysis, Spring 2025 Homework 1
Homework 1: Simple Linear Regression
Due: February 7 at 11:59 PM ET
– Please read each question carefully and provide complete answers including plots, tables, or sentence explanations as appropriate.– Questions 1-3 must be completed using statistical software. You must show your code to receive full credit.
– county : County name– state : State name– women : Number of women residing in the county, estimated 2010 Census– median income : Median household income, estimated in 2010 Census– democrat 2008 : Proportion of votes that went to the Democratic candidate in the 2008 presidential election– highly restrictive : Whether or not the state is labeled as Highly Restrictive (more details below)– abortion count 2010 : Estimated number of abortions in the county in 2010– dist to closest facility miles : Estimated distance to the closest facility providing abortions in miles
– lcavol: log(cancer volume)– lweight: log(prostate weight)– age: age– lbph: log(benign prostatic hyperplasia amount)– svi: seminal vesicle invasion– lcp: log(capsular penetration)– gleason: Gleason score– pgg45: percentage Gleason scores 4 or 5– lpsa: log(prostate specific antigen)
– A single pdf file with complete answers and code to the homework questions (including all final tables and plots you used in your data analysis). The code that you used to conduct the statistical analysis should be well annotated. The code should be written in R software or other statistical software. Assignments submitted without any code is an incomplete assignment.
Question |
|
|
|
|
|
Presents application of statistical soft ware to correctly organize, summarize, and present data using graphical and tabular representations (a) to (c). Illustrates understanding of statistical tests to provide meaningful insights to assess associations, assumptions and the uncertainty of the results (a) to (d). |
Explains some concepts, but shows misunderstanding for at least one part. |
Does not address all concepts. Significant errors. |
Incorrect.
Partial credit.
|
2a |
Shows techniques using statistical software to fit regression models to data. Presents correct definition and interpretation of re gression models and estimated regression line (a) to (c). Illustrates understanding of hypothesis testing to provide meaningful insights to assess associations (d). |
Explains some concepts, but shows misunderstanding for at least one part. |
Does not address all concepts. Significant errors. |
Incorrect. Partial credit. |
2b |
Shows techniques using statistical software to fit regression models to data. Presents correct definition and interpretation of regression models and estimated regression line (a) to (c). Illustrates understanding of hypothesis testing to provide meaningful in sights to assess associations (d). |
Explains some concepts, but shows misunderstanding for at least one part. |
Does not address all ,concepts. Significant errors. |
Incorrect. Partial credit. |
3 |
Shows techniques using statistical software to present data using graphical representations. Presents correct definition and interpretation of results. |
Explains some concepts, but shows misunderstanding for at least one part. |
Does not address all concepts. Significant errors. |
Incorrect.
Partial credit.
|
4 (BONUS) |
Shows techniques using statistical software to calculate correlation coefficients, fit regression models to data, and extract summary statistics (a) to (e). |
Explains some concepts, but shows misunderstanding for at least one part. |
Does not address all concepts. Significant errors. |
Incorrect. Partial credit. |
(c) Create a new plot with the variables abortions_per_1000_women and dist_to_closest_facility_miles.
Does there seem to be a relationship between the two variables? Write 1-2 sentences.
(d) Select statistical tests to examine the bivariate associations that you plotted in parts (b) and (c).
Assuming an α = 0.05, are there associations between these variables? A complete answer should describe the statistical test you applied and the results. Write 2-4 sentences.
(b) Fit a simple linear regression model for the abortions_per_1000 women variable and using dist_to_closest_facility_miles. as the only predictor variable in the model.
(a) Calculate the correlation between all the variables. Hint: cor().
(b) List all the variables (excluding lpsa) in descending order (from large to small) according to their correlation (R2 ) with lpsa. The table should include two columns: the variable name and correlation Hint: You may order the variables manually or using R statements.
(c) Fit eight regression models. The response (dependent) variable in all the models should be lpsa. The first model should include only the variable with the highest correlation as predictor (based on the table that you created) The second model should include the highest and the second-highest variables as predictors. The third model should include all the variables of the second model + the third-highest variable. The eighth and last model should include all the variables. Store each model object in an R variable. You do not need to show any outputs.
(d) Extract the residual standard error (RSE) and R2 for each model and show these values in a table. The table should have three columns: The model number (1 to 8), RSE, and R2 . Create two plots. The x-axis should show the model number in both plots. The y-axis should show the RSE in the first plot and R2 in the second.
(e) Show the summary of model 8. Should the gleason variable be kept in the model? Based on the table and plots, what happens to the RSE and R2 when this variable was introduced.