PHP 2511: Applied Regression Analysis

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

PHP 2511: Applied Regression Analysis, Spring 2025 Homework 1
Homework 1: Simple Linear Regression
Due: February 7 at 11:59 PM ET

Name:
Instructions:
• All homework assignments must be submitted on Canvas by the due date. The course syllabus includes details on the late policy.
• You are encouraged to collaborate with peers on assignments, but you must submit an individual assignment and all work must be your own. You must abide by the Brown University Academic Code concerning: i) use of sources, ii) creative work, iii) examinations, quizzes, and tests, and iv) laboratory work and assignments (https://college.brown.edu/design-your-education/ academic-policies).
• There are 3 questions in this assignment.
– Please read each question carefully and provide complete answers including plots, tables, or sentence explanations as appropriate.
– Questions 1-3 must be completed using statistical software. You must show your code to receive full credit.
• This assignment requires data analysis of two datasets. Please make sure that you have the data downloaded from Canvas to complete your assignment:
1. The idea of this question comes from the paper ’Association of Highly Restrictive State Abortion Policies With Abortion Rates, 2000-2014’ by Brown et al. The data used for this problem is available in the file state rep laws.csv and comes from the American Community Survey (https://www. census.gov/programs-surveys/acs) and scraped state health department web sites (source http://www.johnstonsarchive.net/policy/abortion/). We use this data to examine how restrictions on abortion access are related to the number of abortions per 1000 women. The dataset has information on the following variables:
– county : County name
– state : State name
– women : Number of women residing in the county, estimated 2010 Census
– median income : Median household income, estimated in 2010 Census
– democrat 2008 : Proportion of votes that went to the Democratic candidate in the 2008 presidential election
– highly restrictive : Whether or not the state is labeled as Highly Restrictive (more details below)
– abortion count 2010 : Estimated number of abortions in the county in 2010
– dist to closest facility miles : Estimated distance to the closest facility providing abortions in miles
2. The prostate dataset comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy.
The data used for this problem is available in the file prostate.csv. The dataset has information on the following variables:
– lcavol: log(cancer volume)
– lweight: log(prostate weight)
– age: age
– lbph: log(benign prostatic hyperplasia amount)
– svi: seminal vesicle invasion
– lcp: log(capsular penetration)
– gleason: Gleason score
– pgg45: percentage Gleason scores 4 or 5
– lpsa: log(prostate specific antigen)
• Your homework submission should include:
– A single pdf file with complete answers and code to the homework questions (including all final tables and plots you used in your data analysis). The code that you used to conduct the statistical analysis should be well annotated. The code should be written in R software or other statistical software. Assignments submitted without any code is an incomplete assignment.
• Files to Submit: submit one file saved as “PHP2511 Lastname Firstname HW1 Spring2025”. It should include answers to Questions 1 to 3 and your code in the Appendix.
• Grading: This assignment is worth 25 points. The grading criteria is based on the following learning objectives in the table below.
Concepts: basic concepts of exploratory data analysis (plots and tables), bivariable associations (ttests, correlation analysis), simple linear regression, interpreting regression coefficients (Lectures 1-4 and Vittinghoff et al. CH 3.1-3.3, 4.1)
Learning Objectives:
1. Identify an appropriate method for visualizing data (univariate and bivariate distributions), de scriptive statistics (central tendency and dispersion measures) to provide meaningful insights and assess assumptions.
2. Understand statistical methods to assess bivariable associations (t-tests, correlation analysis).
3. Describe whether a regression model is appropriate and visually assess linear or nonlinear trends and outliers.
4. Interpret summary statistics for regression models including model coefficients (slope, intercept).
5. Understand how to draw inferences about regression coefficients.
6. Apply statistical software to organize, summarize, and present data using graphical and tabular representations.
Grading Rubric: This assignment is worth 25 points. Each question is worth 6 to 9 points. Points are given based on the following criteria: full points if student demonstrates proficiency in all learning goals, partial points if some learning goals are met, and 0 if answer is missing.

Question





Presents application of statistical soft ware to correctly organize, summarize, and present data using graphical and tabular representations (a) to (c). Illustrates understanding of statistical tests to provide meaningful insights to assess associations, assumptions and the uncertainty of the results (a) to (d).
Explains some concepts, but shows misunderstanding for at least one part.
Does not address all concepts. Significant errors.

Incorrect.

Partial credit.
2a
Shows techniques using statistical software to fit regression models to data. Presents correct definition and interpretation of re gression models and estimated regression line (a) to (c). Illustrates understanding of hypothesis testing to provide meaningful insights to assess associations (d).
Explains some concepts, but shows misunderstanding for at least one part.
Does not address all concepts. Significant errors.

Incorrect.

Partial credit.

2b
Shows techniques using statistical software to fit regression models to data. Presents correct definition and interpretation of regression models and estimated regression line (a) to (c). Illustrates understanding of hypothesis testing to provide meaningful in sights to assess associations (d).
Explains some concepts, but shows misunderstanding for at least one part.
Does not address all ,concepts. Significant errors.

Incorrect.

Partial credit.


3 Shows techniques using statistical software to present data using graphical representations. Presents correct definition and interpretation of results.
Explains some concepts, but shows misunderstanding for at least one part.
Does not address all concepts. Significant errors.
Incorrect.
Partial credit.
4 (BONUS)
Shows techniques using statistical software to calculate correlation coefficients, fit regression models to data, and extract summary statistics (a) to (e).
Explains some concepts, but shows misunderstanding for at least one part.
Does not address all concepts. Significant errors.

Incorrect.

Partial credit.

Homework 1
Question 1. Use statistical software to answer the questions below. Use the state rep laws.csv dataset to answer the following questions.
(a) Read the data from the external file and consider the variable states. There are some states that make abortions counts publicly available and others that do not. Which states are missing? Given this information, do you think that the population in the data give a good representation of the overall U.S. population? Write 1-3 sentences. Hint: there are different functions in R that you can use to answer this question. Consider the table() or unique() functions.
(b) Create a new variable named abortions per 1000 women that is equal to the number of abortions_per_1000 women in each county. Create at least two plots: 1) one plot should capture the distribution of abortions_per_1000_women and 2) the other plot should capture the distribution of abortions per 1000 women stratified by the variable highly_restrictive. What do you observe in these plots? Write 2-4 sentences.

(c) Create a new plot with the variables abortions_per_1000_women and dist_to_closest_facility_miles.

Does there seem to be a relationship between the two variables? Write 1-2 sentences.

(d) Select statistical tests to examine the bivariate associations that you plotted in parts (b) and (c).

Assuming an α = 0.05, are there associations between these variables? A complete answer should describe the statistical test you applied and the results. Write 2-4 sentences.

Question 2. Use statistical software to answer the questions below. Now that we have completed our exploratory data analysis (EDA), we are ready to fit models to the data. We will explore two models.
(a) Fit a simple linear regression model for the abortions per 1000 women variable and using highly_restrictive. as the only predictor variable in the model.
(a) Write the regression model.
(b) Write the estimated regression line.
(c) Print the summary of the model and interpret the estimated coefficients.
(d) Apply one of the hypothesis testing approaches we learned in class. What can you conclude about the effect of highly restrictive states on abortion rates?

(b) Fit a simple linear regression model for the abortions_per_1000 women variable and using dist_to_closest_facility_miles. as the only predictor variable in the model.

(a) Write the regression model.
(b) Write the estimated regression line.
(c) Print the summary of the model and interpret the estimated coefficients.
(d) Apply one of the hypothesis testing approaches we learned in class. What can you conclude about the effect of distance to closest facility on abortion rates?
Question 3. Use statistical software to answer the questions below. Use the prostate.csv dataset to answer the following question.
(a) Some of the variables in the dataset were log transformed before the regression analysis was conducted. Compare the distributions of log(prostate specific antigen)and log(caner volume) with their distribution before the transformation. Based on the histograms, do you think that the transformations were justified? Hint: Assume that the variables were transformed using the natural logarithm. Relevant R functions: hist(), exp().
Question 4 (BONUS). Use statistical software to answer the questions below. Use the prostate.csv dataset to answer the following questions continuing question 3.

(a) Calculate the correlation between all the variables. Hint: cor().

(b) List all the variables (excluding lpsa) in descending order (from large to small) according to their correlation (R2 ) with lpsa. The table should include two columns: the variable name and correlation Hint: You may order the variables manually or using R statements.

(c) Fit eight regression models. The response (dependent) variable in all the models should be lpsa. The first model should include only the variable with the highest correlation as predictor (based on the table that you created) The second model should include the highest and the second-highest variables as predictors. The third model should include all the variables of the second model + the third-highest variable. The eighth and last model should include all the variables. Store each model object in an R variable. You do not need to show any outputs.

(d) Extract the residual standard error (RSE) and R2 for each model and show these values in a table. The table should have three columns: The model number (1 to 8), RSE, and R2 . Create two plots. The x-axis should show the model number in both plots. The y-axis should show the RSE in the first plot and R2 in the second.

(e) Show the summary of model 8. Should the gleason variable be kept in the model? Based on the table and plots, what happens to the RSE and R2 when this variable was introduced.

发表评论

电子邮件地址不会被公开。 必填项已用*标注