ISE 364 Homework 6 Fall 2023

ISE 364

Homework 6

Fall 2023

Due: Sunday, December 3, 2023, 11:59 pm.

This is the final homework assignment for this class. You are required to apply the machine learning methods we have studied throughout this course to predict a target variable. As part of this assignment, you will be tasked with analyzing and cleaning a dataset and determining which model you believe is the most effective. Please note that your grade may be influenced by the results of other students.

For this assignment, you will be working with the House Price Data Set. Your task is to predict the sale price of a house based on its features. The training and test datasets have been provided on CourseSite. The training set comprises 1,385 rows and 71 columns, while the test set consists of 1,459 rows and 70 columns. In the training set, the target variable you are aiming to predict is “SalePrice”. The features encompass various house-related information, descriptions, residential details, and miscellaneous data (see Table 1 below).

You will need to submit three files via CourseSite: a PDF report, a Python notebook, and a submission.csv file. The PDF report should address the questions below and should be limited to no more than 3 pages. Within the constraints of these 3 pages, your report should comprehensively explain your methodology and the rationale behind your responses. In the Python notebook, you should include the code you used to answer the questions below. For the submission.csv file, please see Question 4 below.

Questions:

1. (20 points) Remove rows with missing values in the SalePrice column from the training dataset. Set up the dataset by stacking the training and test datasets. For Questions 1 and 2, utilize this stacked dataset. For Questions 3 and 4, separate the training dataset from the test dataset. Use the training dataset for Question 3 and the test dataset for Question 4. In your report, include a list of the numerical features and the categorical ones. Clean the dataset and include a description of how you did that by explaining your reasoning as to why you made the choices that you made. Here is a roadmap that you may want to consider:

(a) (Handling Missing Values) Find features with missing values and fill in any missing values. Two possibilities are to fill in the missing values with the mean/median/mode of the rest of the values for that feature or use another feature to impute and fill in the values (if some feature is correlated with another feature that has missing values, you can use this correlation to make a good approximation of what those values might be).

(b) (Numerical Features) Decide whether to bin some numerical features to create cat-egorical features. For example, if a numerical feature has outliers or if your model is overfitting, binning might be helpful.

(c) (Categorical Features) Decide whether to bin some categorical features. Convert categorical features into numerical features, using label encoding for ordered cat-egories (e.g., good, very good, excellent) to preserve the ordinal relationship or one-hot encoding.

(d) (Correlated Features) Decide whether to combine two heavily correlated features or drop one of them if you cannot combine them.

(e) (Feature Selection) Evaluate the correlation of features with the target variable SalePrice and decide whether to drop features that show low correlation.

(f) Standardize or normalize your dataset by scaling numerical features to ensure that all features contribute equally to the model.

Please, keep in mind that all the decisions that you make here will affect the performance of the models in Question 3. You may want to experiment with which choice leads to the best results before writing your report. You can also make different choices for different models.

2. (20 points) Use PCA for dimensionality reduction by transforming features into uncorre-lated principal components and selecting a subset of them based on explained variance. For Question 3 below, you have the option to use the principal components as new fea-tures or retain the original features from the dataset.

3. (50 points) Compare all relevant models for regression tasks covered during the semes-ter: Linear Regression, KNN, Decision Trees, Random Forests, Neural Networks. Since this is a regression task, assess the models using the Mean Squared Error (MSE) score. Choose the best model and describe why you think it is the best.

4. (10 points) The graders will evaluate your model’s prediction performance using the submissions.csv file. Your model’s ranking relative to other students’ models will play a key role in determining your grade for this question. The submissions.csv file should consist of a single column of SalePrice predictions for the test dataset (please do not change the order of the rows in the original test dataset). These predictions will be used to assess the accuracy of your model and establish your rank in comparison to other students’ model scores.

Please, incorporate relevant plots to illustrate your points. Keep in mind that visuals should aid comprehension, not overwhelm the reader. Be sure to describe the information conveyed by each plot. We will review your code to verify the reproducibility of the results presented in your report. Your grade for Questions 1–3 above will be determined by the breadth of your knowledge and understanding of machine learning processes along with best practices.3

You are expected to work on this assignment individually. You are allowed to consult other students for clarifications/discussions but please, ensure that your code is your own.

Table 1: Feature Descriptions

No.          Feature                  Description

1             Heating                  Type of heating

2           HeatingQC                Heating quality and condition

3           CentralAir                 Central air conditioning

4            Electrical                  Electrical system

5             1stFlrSF                  First Floor square feet

6             2ndFlrSF                 Second floor square feet

7            GrLivArea                 Above grade (ground) living area square feet

8           BsmtFullBath              Basement full bathrooms

9           BsmtHalfBath              Basement half bathrooms

10            FullBath                   Full bathrooms above grade

11            HalfBath                  Half baths above grade

12            Bedroom                 Number of bedrooms above basement level

13             Kitchen                   Number of kitchens

14           KitchenQual               Kitchen quality

15          TotRmsAbvGrd            Total rooms above grade (does not include bathrooms)

16            Functional                Home functionality rating

17            Fireplaces                 Number of fireplaces

18           FireplaceQu               Fireplace quality

19           GarageType               Garage location

20           GarageYrBlt               Year garage was built

21           GarageFinish              Interior finish of the garage

22            GarageCars               Size of garage in car capacity

23            GarageQual               Garage quality

24            GarageCond               Garage condition

25             PavedDrive                Paved driveway

26            WoodDeckSF               Wood deck area in square feet

27             OpenPorchSF             Open porch area in square feet

28              3SsnPorch                 Three season porch area in square feet

29             ScreenPorch               Screen porch area in square feet

30               PoolArea                  Pool area in square feet

31                PoolQC                    Pool quality

32                 Fence                     Fence quality

33               MSZoning                 The general zoning classification

34                LotArea                   Lot size in square feet

35                 Street                    Type of road access

36                  Alley                     Type of alley access

37               LotShape                  General shape of property

38             LandContour               Flatness of the property

39                Utilities                    Type of utilities available

40               LotConfig                   Lot configuration

41             Neighborhood             Physical locations within Ames city limits4

42               Condition1                Proximity to main road or railroad

43               Condition2                  Proximity to main road or railroad (if a second is present)

44                BldgType                   Type of dwelling

45               HouseStyle                  Style of dwelling

46               OverallQual                 Overall material and finish quality

47               OverallCond                Overall condition rating

48                 YearBuilt                   Original construction date

49             YearRemodAdd               Remodel date

50                  RoofStyle                  Type of roof

51                 Exterior1st                 Exterior covering on house

52                 Exterior2nd                Exterior covering on house (if more than one material)

53                 MasVnrType               Masonry veneer type

54                 MasVnrArea               Masonry veneer area in square feet

55                  ExterQual                 Exterior material quality

56                  ExterCond                Present condition of the material on the exterior

57                  Foundation               Type of foundation

58                  BsmtQual                 Height of the basement

59                  BsmtCond                General condition of the basement

60                 BsmtExposure            Walkout or garden-level basement walls

61                  BsmtFinType1           Quality of basement finished area

62                    BsmtFinSF1             Type 1 finished square feet

63                     BsmtUnfSF              Unfinished square feet of basement area

64                    TotalBsmtSF             Total square feet of basement area

65                    MiscFeature              Miscellaneous feature not covered in other categories

66                        MiscVal                 $Value of miscellaneous feature

67                       MoSold                  Month Sold

68                       YrSold                    Year Sold

69                      SaleType                  Type of sale

70                   SaleCondition              Condition of sale

71                      SalePrice                  The property’s sale price in dollars. This is the target variable.


发表评论

电子邮件地址不会被公开。 必填项已用*标注