ISE 364
Homework 6
Fall 2023
Due: Sunday, December 3, 2023, 11:59 pm.
This is the final homework assignment for this class. You are required to apply the machine learning methods we have studied throughout this course to predict a target variable. As part of this assignment, you will be tasked with analyzing and cleaning a dataset and determining which model you believe is the most effective. Please note that your grade may be influenced by the results of other students.
For this assignment, you will be working with the House Price Data Set. Your task is to predict the sale price of a house based on its features. The training and test datasets have been provided on CourseSite. The training set comprises 1,385 rows and 71 columns, while the test set consists of 1,459 rows and 70 columns. In the training set, the target variable you are aiming to predict is “SalePrice”. The features encompass various house-related information, descriptions, residential details, and miscellaneous data (see Table 1 below).
You will need to submit three files via CourseSite: a PDF report, a Python notebook, and a submission.csv file. The PDF report should address the questions below and should be limited to no more than 3 pages. Within the constraints of these 3 pages, your report should comprehensively explain your methodology and the rationale behind your responses. In the Python notebook, you should include the code you used to answer the questions below. For the submission.csv file, please see Question 4 below.
Questions:
1. (20 points) Remove rows with missing values in the SalePrice column from the training dataset. Set up the dataset by stacking the training and test datasets. For Questions 1 and 2, utilize this stacked dataset. For Questions 3 and 4, separate the training dataset from the test dataset. Use the training dataset for Question 3 and the test dataset for Question 4. In your report, include a list of the numerical features and the categorical ones. Clean the dataset and include a description of how you did that by explaining your reasoning as to why you made the choices that you made. Here is a roadmap that you may want to consider:
(a) (Handling Missing Values) Find features with missing values and fill in any missing values. Two possibilities are to fill in the missing values with the mean/median/mode of the rest of the values for that feature or use another feature to impute and fill in the values (if some feature is correlated with another feature that has missing values, you can use this correlation to make a good approximation of what those values might be).
(b) (Numerical Features) Decide whether to bin some numerical features to create cat-egorical features. For example, if a numerical feature has outliers or if your model is overfitting, binning might be helpful.
(c) (Categorical Features) Decide whether to bin some categorical features. Convert categorical features into numerical features, using label encoding for ordered cat-egories (e.g., good, very good, excellent) to preserve the ordinal relationship or one-hot encoding.
(d) (Correlated Features) Decide whether to combine two heavily correlated features or drop one of them if you cannot combine them.
(e) (Feature Selection) Evaluate the correlation of features with the target variable SalePrice and decide whether to drop features that show low correlation.
(f) Standardize or normalize your dataset by scaling numerical features to ensure that all features contribute equally to the model.
Please, keep in mind that all the decisions that you make here will affect the performance of the models in Question 3. You may want to experiment with which choice leads to the best results before writing your report. You can also make different choices for different models.
2. (20 points) Use PCA for dimensionality reduction by transforming features into uncorre-lated principal components and selecting a subset of them based on explained variance. For Question 3 below, you have the option to use the principal components as new fea-tures or retain the original features from the dataset.
3. (50 points) Compare all relevant models for regression tasks covered during the semes-ter: Linear Regression, KNN, Decision Trees, Random Forests, Neural Networks. Since this is a regression task, assess the models using the Mean Squared Error (MSE) score. Choose the best model and describe why you think it is the best.
4. (10 points) The graders will evaluate your model’s prediction performance using the submissions.csv file. Your model’s ranking relative to other students’ models will play a key role in determining your grade for this question. The submissions.csv file should consist of a single column of SalePrice predictions for the test dataset (please do not change the order of the rows in the original test dataset). These predictions will be used to assess the accuracy of your model and establish your rank in comparison to other students’ model scores.
Please, incorporate relevant plots to illustrate your points. Keep in mind that visuals should aid comprehension, not overwhelm the reader. Be sure to describe the information conveyed by each plot. We will review your code to verify the reproducibility of the results presented in your report. Your grade for Questions 1–3 above will be determined by the breadth of your knowledge and understanding of machine learning processes along with best practices.3
You are expected to work on this assignment individually. You are allowed to consult other students for clarifications/discussions but please, ensure that your code is your own.
Table 1: Feature Descriptions
No. Feature Description
1 Heating Type of heating
2 HeatingQC Heating quality and condition
3 CentralAir Central air conditioning
4 Electrical Electrical system
5 1stFlrSF First Floor square feet
6 2ndFlrSF Second floor square feet
7 GrLivArea Above grade (ground) living area square feet
8 BsmtFullBath Basement full bathrooms
9 BsmtHalfBath Basement half bathrooms
10 FullBath Full bathrooms above grade
11 HalfBath Half baths above grade
12 Bedroom Number of bedrooms above basement level
13 Kitchen Number of kitchens
14 KitchenQual Kitchen quality
15 TotRmsAbvGrd Total rooms above grade (does not include bathrooms)
16 Functional Home functionality rating
17 Fireplaces Number of fireplaces
18 FireplaceQu Fireplace quality
19 GarageType Garage location
20 GarageYrBlt Year garage was built
21 GarageFinish Interior finish of the garage
22 GarageCars Size of garage in car capacity
23 GarageQual Garage quality
24 GarageCond Garage condition
25 PavedDrive Paved driveway
26 WoodDeckSF Wood deck area in square feet
27 OpenPorchSF Open porch area in square feet
28 3SsnPorch Three season porch area in square feet
29 ScreenPorch Screen porch area in square feet
30 PoolArea Pool area in square feet
31 PoolQC Pool quality
32 Fence Fence quality
33 MSZoning The general zoning classification
34 LotArea Lot size in square feet
35 Street Type of road access
36 Alley Type of alley access
37 LotShape General shape of property
38 LandContour Flatness of the property
39 Utilities Type of utilities available
40 LotConfig Lot configuration
41 Neighborhood Physical locations within Ames city limits4
42 Condition1 Proximity to main road or railroad
43 Condition2 Proximity to main road or railroad (if a second is present)
44 BldgType Type of dwelling
45 HouseStyle Style of dwelling
46 OverallQual Overall material and finish quality
47 OverallCond Overall condition rating
48 YearBuilt Original construction date
49 YearRemodAdd Remodel date
50 RoofStyle Type of roof
51 Exterior1st Exterior covering on house
52 Exterior2nd Exterior covering on house (if more than one material)
53 MasVnrType Masonry veneer type
54 MasVnrArea Masonry veneer area in square feet
55 ExterQual Exterior material quality
56 ExterCond Present condition of the material on the exterior
57 Foundation Type of foundation
58 BsmtQual Height of the basement
59 BsmtCond General condition of the basement
60 BsmtExposure Walkout or garden-level basement walls
61 BsmtFinType1 Quality of basement finished area
62 BsmtFinSF1 Type 1 finished square feet
63 BsmtUnfSF Unfinished square feet of basement area
64 TotalBsmtSF Total square feet of basement area
65 MiscFeature Miscellaneous feature not covered in other categories
66 MiscVal $Value of miscellaneous feature
67 MoSold Month Sold
68 YrSold Year Sold
69 SaleType Type of sale
70 SaleCondition Condition of sale
71 SalePrice The property’s sale price in dollars. This is the target variable.