STAT7038 REGRESSION MODELLING Assignment 2 for Semester 1, 2024

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS

REGRESSION MODELLING

(STAT7038)

Assignment 2 for Semester 1, 2024

Due date: 3:00 pm on Friday, 17th May 2024, Canberra time

INSTRUCTIONS:

•  This assignment is worth 15% of your overall marks for this course.

• You must complete this assignment by yourself.  If you copy someone else’s work or allow your work to be copied, you will receive a mark of zero for the assignment and risk very severe academic consequences.

• Your report should be submitted to Turnitin on Wattle as a single pdf document (less than 25MB) including the following:

1.  The assignment cover sheet (available to download from Wattle).

2. Your assignment (no more than 10 pages).

3. An appendix including the R codes you used. Failure to upload the R code will result in a penalty.

• Assignments should be typed. Your assignment may include some carefully edited R output (e.g. graphs, tables) showing the results of your data analysis and a discussion of these results, as well as some carefully selected code. Please  be selective about what you present and only include as many pages and as much R output as necessary to justify your solution.  Clearly label each part of your report with the part of the question that it refers to.

•  Unless otherwise advised, use a significance level of 5%. Round numeric answers to 4 decimal places (e.g., 0.0012).

•  Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be deducted if the total report is of an unreasonable length, i.e. more than 10 pages including graphs and tables. You may include an appendix that is in addition to the above page limits; however the appendix will not be assessed. It will only be checked if there is some question about what you have actually done.

•  Name your report “Course code-Uid”, e.g., “STAT7038-u1234567”.

•  Try to submit your assignment at least 15 mins before the deadline in case something unexpected happens, for instance internet issue.

•  Late submissions will NOT be accepted.  Extensions will usually be granted on med- ical or compassionate grounds on production of appropriate evidence, but must have lecturer’s permission at least 24 hours before the deadline.

Question [100 Marks]

You decide to work as an academic staff in a university.  Other than research ability, academic administrators pay attention to teaching quality in setting salaries. You would like to know how some ascriptive characteristics, such as beauty, affect the instructor’s ratings by students. You are given a dataset containing professor characteristics for 463 courses for the academic years 2000 −2002 at the University of Texas at Austin.

 The response variable is teaching evaluation scores (eval) and the predictors are ratings of the instructor’s physical appearance measured by a score (beauty), age (age), number of students that participated the evaluation (student), number of students enrolled in the course (allstudents), whether the instructor is male or female (gender), whether the instructor is from a minority group (minority), whether the instructor is on tenure track (tenure), and whether the instructor is a native English speaker (native).

In this assignment, we would like to use some of these variables to try and build a multiple regression model with eval as the response variable. Use R to further analyse the “teach” data (available on Wattle) and answer the following questions:

(a)  [6 marks]  First identify which variables are numeric in this dataset and fit a multi- ple linear regression (MLR) model with eval as the response variable and all other numeric variables as predictors. Present the main residual plot of the residuals against the fitted values for this model.  Are there are any obvious problems with underlying assumptions?

(b)  [10 marks]  It is not very difficult to see that eval is always positive (ranges from 0 to 5), so it would be worth trying to transform the variable such as the log transformation. Now fit a MLR model with ln(eval) as the response variable, still using all the other numeric variables (not log transformed) as explanatory variables. Again present the main residual plot of the residuals against the fitted values for this new model. Comment on this new residual plot. Then, test whether this model is significant.

(c)  [12 marks]  What are the estimated coefficients of the MLR model in part (b) and the standard errors associated with these coefficients? Interpret the values of each of the estimated coefficients with regards to model specification. Construct 95% Bonferroni joint confidence intervals for all the slope parameters. Comment on the t-test results in the summary output.

(d)  [12 marks]  Produce both a scatterplot matrix and a correlation matrix for the predictors included in the model and comment on any important relationships between the variables. Do you see a problem with this  MLR model as in part (b)?  Conduct a diagnostic check quantitatively to determine the severity of this particular problem. What could be done to solve this problem?

(e)  [12 marks]  You have now discussed this problem with the administrators and they suggest only to include age and beauty as potential predictors in the model.  How-ever, you doubt the importance of the variable age. You are not sure what kind of marginal relationship is between age and the response ln(eval), given that beauty is already included in the model. Generate an appropriate plot to visually check this relationship and comment on the plot.  Then conduct a partial F-test to determine whether age is a significant addition to a model that already includes beauty.

(f)  [8 marks]  The administrators remind you that a native English speaker and a non- native English speaker tend to have a different eval. Therefore, you want  to know how does the variable native affect the response ln(eval). Conduct a test of whether a native English speaker has higher eval than a non-native English speaker by fitting a simple linear regression model. Then provide a 95% confidence interval on the slope coefficient and interpret this interval.

(g)  [6 marks]  Finally, given above findings, you decide to fit a MLR model with ln(eval) as the response variable and with beauty and native as predictor. Conduct a t-test  for beauty in this model.

(h)  [16 marks]  Using the model in part (g), produce a plot of externally studentized residuals against fitted values, a normal QQ plot, a leverage plot, a Cook’s distance plot and a number of DFBETAs plots for all the slope coefficients in your model. Comment on the model assumptions and unusual points.

(i)  [8 marks]  Generate a scatter plot of eval (in its original scale) against beauty, using different color for native and non-native speaking instructors. Use the model from part (g) to predict the expected eval for both native and non-native speaking instructors over the full range of possible beauty measurements and include these on your plot as two different curves (using different color or line types).  Include appropriate titles, axis labels, a legend and a brief discussion of your plot.

(j)  [10 marks]  With the model in part  (g), we now consider adding the interaction term between beauty and native.  Before adding the interation, generate a scatter plot of ln(eval) (in log scale) against beauty, using different color for native and non-native speakers. Add fitted lines (using the model in part (g)) for native and non-native speakers in a different color (or a different line type). Comment on the plot whether there is a visible interaction.  Then add the interation into the model in part (g) and test whether the interaction is significant.

发表评论

电子邮件地址不会被公开。 必填项已用*标注