首页 » 统计学 » STAT7038 REGRESSION MODELLING Assignment 2 for Semester 1, 2024

STAT7038 REGRESSION MODELLING Assignment 2 for Semester 1, 2024

2024-05-22 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS

REGRESSION MODELLING

(STAT7038)

Assignment 2 for Semester 1, 2024

Due date: 3:00 pm on Friday, 17th May 2024, Canberra time

INSTRUCTIONS:

• This assignment is worth 15% of your overall marks for this course.

• You must complete this assignment by yourself. If you copy someone else’s work or allow your work to be copied, you will receive a mark of zero for the assignment and risk very severe academic consequences.

• Your report should be submitted to Turnitin on Wattle as a single pdf document (less than 25MB) including the following:

1. The assignment cover sheet (available to download from Wattle).

2. Your assignment (no more than 10 pages).

3. An appendix including the R codes you used. Failure to upload the R code will result in a penalty.

• Assignments should be typed. Your assignment may include some carefully edited R output (e.g. graphs, tables) showing the results of your data analysis and a discussion of these results, as well as some carefully selected code. Please be selective about what you present and only include as many pages and as much R output as necessary to justify your solution. Clearly label each part of your report with the part of the question that it refers to.

• Unless otherwise advised, use a signiﬁcance level of 5%. Round numeric answers to 4 decimal places (e.g., 0.0012).

• Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be deducted if the total report is of an unreasonable length, i.e. more than 10 pages including graphs and tables. You may include an appendix that is in addition to the above page limits; however the appendix will not be assessed. It will only be checked if there is some question about what you have actually done.

• Name your report “Course code-Uid”, e.g., “STAT7038-u1234567”.

• Try to submit your assignment at least 15 mins before the deadline in case something unexpected happens, for instance internet issue.

• Late submissions will NOT be accepted. Extensions will usually be granted on med- ical or compassionate grounds on production of appropriate evidence, but must have lecturer’s permission at least 24 hours before the deadline.

Question 1 [100 Marks]

You decide to work as an academic staﬀ in a university. Other than research ability, academic administrators pay attention to teaching quality in setting salaries. You would like to know how some ascriptive characteristics, such as beauty, aﬀect the instructor’s ratings by students. You are given a dataset containing professor characteristics for 463 courses for the academic years 2000 −2002 at the University of Texas at Austin.

The response variable is teaching evaluation scores (eval) and the predictors are ratings of the instructor’s physical appearance measured by a score (beauty), age (age), number of students that participated the evaluation (student), number of students enrolled in the course (allstudents), whether the instructor is male or female (gender), whether the instructor is from a minority group (minority), whether the instructor is on tenure track (tenure), and whether the instructor is a native English speaker (native).

In this assignment, we would like to use some of these variables to try and build a multiple regression model with eval as the response variable. Use R to further analyse the “teach” data (available on Wattle) and answer the following questions:

(a) [6 marks] First identify which variables are numeric in this dataset and ﬁt a multi- ple linear regression (MLR) model with eval as the response variable and all other numeric variables as predictors. Present the main residual plot of the residuals against the ﬁtted values for this model. Are there are any obvious problems with underlying assumptions?

(b) [10 marks] It is not very diﬃcult to see that eval is always positive (ranges from 0 to 5), so it would be worth trying to transform the variable such as the log transformation. Now ﬁt a MLR model with ln(eval) as the response variable, still using all the other numeric variables (not log transformed) as explanatory variables. Again present the main residual plot of the residuals against the ﬁtted values for this new model. Comment on this new residual plot. Then, test whether this model is signiﬁcant.

(c) [12 marks] What are the estimated coeﬃcients of the MLR model in part (b) and the standard errors associated with these coeﬃcients? Interpret the values of each of the estimated coeﬃcients with regards to model speciﬁcation. Construct 95% Bonferroni joint conﬁdence intervals for all the slope parameters. Comment on the t-test results in the summary output.

(d) [12 marks] Produce both a scatterplot matrix and a correlation matrix for the predictors included in the model and comment on any important relationships between the variables. Do you see a problem with this MLR model as in part (b)? Conduct a diagnostic check quantitatively to determine the severity of this particular problem. What could be done to solve this problem?

(e) [12 marks] You have now discussed this problem with the administrators and they suggest only to include age and beauty as potential predictors in the model. How-ever, you doubt the importance of the variable age. You are not sure what kind of marginal relationship is between age and the response ln(eval), given that beauty is already included in the model. Generate an appropriate plot to visually check this relationship and comment on the plot. Then conduct a partial F-test to determine whether age is a signiﬁcant addition to a model that already includes beauty.

(f) [8 marks] The administrators remind you that a native English speaker and a non- native English speaker tend to have a diﬀerent eval. Therefore, you want to know how does the variable native aﬀect the response ln(eval). Conduct a test of whether a native English speaker has higher eval than a non-native English speaker by ﬁtting a simple linear regression model. Then provide a 95% conﬁdence interval on the slope coeﬃcient and interpret this interval.

(g) [6 marks] Finally, given above ﬁndings, you decide to ﬁt a MLR model with ln(eval) as the response variable and with beauty and native as predictor. Conduct a t-test for beauty in this model.

(h) [16 marks] Using the model in part (g), produce a plot of externally studentized residuals against ﬁtted values, a normal QQ plot, a leverage plot, a Cook’s distance plot and a number of DFBETAs plots for all the slope coeﬃcients in your model. Comment on the model assumptions and unusual points.

(i) [8 marks] Generate a scatter plot of eval (in its original scale) against beauty, using diﬀerent color for native and non-native speaking instructors. Use the model from part (g) to predict the expected eval for both native and non-native speaking instructors over the full range of possible beauty measurements and include these on your plot as two diﬀerent curves (using diﬀerent color or line types). Include appropriate titles, axis labels, a legend and a brief discussion of your plot.

(j) [10 marks] With the model in part (g), we now consider adding the interaction term between beauty and native. Before adding the interation, generate a scatter plot of ln(eval) (in log scale) against beauty, using diﬀerent color for native and non-native speakers. Add ﬁtted lines (using the model in part (g)) for native and non-native speakers in a diﬀerent color (or a diﬀerent line type). Comment on the plot whether there is a visible interaction. Then add the interation into the model in part (g) and test whether the interaction is signiﬁcant.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *