Details of the Dataset
Ames House Price Dataset: These data are taken from the dataset available on the OpenIntro web site1 . The data was compiled by De Cock (2011) 2 and the full dataset can
be downloaded from the Ames Residential Home Sales page3 The original dataset contained 82 variables, but you are required to analyse a selection of them, some of them derived from the original dataset. The data concerns the residential home sales in Ames, Iowa between 2006 and 2010. The task is to build a suitable regression model to predict the Sale Price of the houses using the remaining variables. In addition, you should provide some interpretation of the model parameters.
Table 1: Description of the Data
Availability of the data
The data are available on Moodle as an R data frame called AmesHP.Rdata
Analysis Required
Main question: how is the Sale Price of house related to its other characteristics and is it possible to use a regression model to reliably predict the House Price?
You are to conduct an analysis of this dataset in R, in the groups to which you have been assigned. These can be found on the Moodle area for this assignment. An outline of the steps you should take in your analysis is given below.
1) Begin with an exploratory analysis of the data. Using appropriate numerical, tabular or graphical summaries, describe the distribution of the variables and investigate potential relationships.
2) Use regression models to investigate the relationship between the explanatory variables and the dependent variable SalePrice. Consider whether the response/explanatory variables should be transformed and pay attention to the possible existence of outliers. Select a clear final model and your rationale for doing so.
3) Fully investigate the validity of the model and any other potentials issues there may be with the data or the model. Clearly comment on your conclusions and the success of your model.
4) Illustrate the usefulness of the model by giving an interpretation of the parameters and by demonstrating how the Sale Price of a house may be predicted from the remaining variables. The aim of this explanation should be understandable to a non-specialist.
5) Write a joint report on your findings, describing the data, your analyses and your conclusions.
Further Points
Techniques and Approach
The aim of this assignment is to demonstrate what you have learnt in fitting a multiple regression model using a statistical approach. This assignment is a statistical assignment and NOT one on machine learning. You should use techniques that we have covered in the module. You are welcome to investigate further techniques that relate to regression analyses and if appropriate use those, but this is not a requirement. There is no need, for example, to split the data into training and test sets as you should not be using methods that require this (e.g. a neural network). Your assessment of the model should be based on the regression model output and the diagnostics you carry out in question 3.
Specific Issues for these Data
Think carefully about variables that are ordinal and whether they should be treated as categorical or interval. You may need to try different approaches here.
Think about how to treat the missing values. This is especially so for categorical variables; in some cases there is an obvious interpretation if the value is missing and merely deleting that observation may not be the best approach. For example, if GarageQu is “NA” it may be that there is no garage – deleting all NA’s will mean your data no longer contains any properties without a garage.
Report Requirements
1) The report should be submitted as a PDF file and should consist of two parts:
2) The main part for the report itself (there is a maximum of 10 typeset A4 sides including graphs, tables and references; minimum body-text font size 11pt, and minimum 2cm margins all round). This document should be written for intelligent readers who do not necessarily have advanced statistical training. It should be neat and professional. Figures should be clearly labelled and referenced. There should be suitably numbered headings and sub-headings.
3) A technical appendix at the end of the report giving the commented R code which was used in order to allow the analysis to be reproduced.
4) The report should:
a) give only your student numbers, not names.
b) be submitted to the group assignment area on Moodle.
5) Finally, each group member should submit a completed peer review form to the “Contribution” section for this assignment on Moodle. Failure to do so may result in you losing credit, even if your fellow group members give you full credit for your contribution. You must not discuss what you have filled in on this form with the other members of your group. The peer review form can be downloaded from Moodle. It contains further details of what you need to fill in. Based on the peer review forms from your other group members the individual mark you obtain will be the group mark for this assignment but potentially adjusted slightly up or down. You should aim to be honest and realistic in filling in this form.
Mark Scheme:
This assessment is worth 25% of your final mark on ST952. The assessment will be based on your understanding of the problem, the competence of your analysis and the presentation of your report. The report will be marked out of 100 and then weighted with your other marks from the second assignment and the exam. Your mark for both assignments averaged together must be above 50% to pass the coursework component of the module. You must pass both the coursework component and the exam to pass the module.
Marks for the actual analysis will be a maximum of 85 but different aspects are linked (e.g. to find an appropriate model it may be necessary to redo some types of initial plots, or as a result of diagnostics to revisit the model etc.) so marks may be slightly higher or lower in some categories as appropriate to your approach. The marks below are therefore meant as a rough guide
(Question 1) Initial investigation of Data 20-25 marks
(Question 2) Appropriate and well explained statistical analysis and investigation to find final model, including use of transformations and interpretation of tabular output, 25-30 marks
(Question 3) Residual, influential and any other diagnostics , 20-25 marks
(Question 4) Interpretation of final model 10-15 marks
(Question 5) Report structure and presentation (including quality of tables and figures, professionality, use of numbered headings, page numbers, contents page, figure labels etc.): 7 marks
Appropriate use of English language (including spelling and grammar, clarity, avoidance of statistical terms in their colloquial sense (e.g. significant): 8 marks