DSC 148: Intro to Data Mining

DSC 148: Intro to Data Mining – Data Mining Challenge

Winter 2024

Challenge Objective

Given the partial dataset of Airbnb listings in New York City, you are asked to design data mining models to build a relationship between the price (dollars per day) and the other observed variables. That is, you are asked to predict the price of the listing, given all its information.

Due date: Feb 29, 11:00 PM PT

Data and Baselines

train.csv: It contains all the training data that you can use in this challenge. The first column “id” provides you the unique key to identify the listings. Different columns show different features/attributes of a listing, including free texts, numerical features, and categorical features. There are also many missing values. So please conduct some exploratory data analyses (EDAs) first before you work on feature engineering.

test.csv: It contains all the listings that you need to predict their prices. The format is the same as the train.csv, except that the price column has been removed.

simple_baseline.ipynb: It contains simple baseline methods there. By running this notebook, you will be able to get simple_linear_regression_baseline.csv (leaderboard score: 114.56170) and mean_value_baseline.csv (leaderboard score: 135.55633) as output. The first one is produced by the linear regression model + very simple features. The second one is blindly predicting the mean price based on training data for all listings.

You can download all the datasets and baselines here: http://tinyurl.com/cse148-w24-dmc

Evaluation Metrics

Your predictions will be evaluated against the ground-truth price using the RMSE metric. For each test listing, we will calculate the squared error between the ground truth and your prediction. We will take an average of all listings and then get the square root. For more information, please check this webpage.

Registering your Kaggle Team Name

We will release an assignment to submit your Kaggle username shortly after the challenge starts. You must register your Kaggle username through gradescope to participate in this challenge, otherwise, you will be considered an outsider and your score won’tbe counted towards your overall grade. This username submission is worth 1 point of your challenge grade.

When you upload your submission, you can see your team name under the “Team Name” column. Please submit it.

Note: This is an individual competition and the team name is just the Kaggle terminology here.

Scoring

If you can achieve an RMSE strictly smaller than the “simple-linear-regression” benchmark, you will be able to get 50% of the credits. If you can achieve an RMSE smaller than 100, you will be able to get 40% more of the credits. The remaining 10% will be decided based on your ranking.

Submission Format

You are asked to run your models locally and upload your final prediction file. It is a CSV file with headers of two columns: Id and Predicted. The first character must be capitalized. The first column corresponds to the id in the test.csv file and the second column contains the predicted   price.

Once submitted, the system will evaluate a fixed portion (30%, randomly chosen) of the test set and compute RMSE accordingly. Then your score will be displayed on the leaderboard. Please note that the leaderboard during the challenge is NOT final. The final leaderboard will be refreshed once the challenge ends. A new RMSE score will be calculated based on the other 70% portion which has not been tested yet.

Every day, you can make at most 20 submissions. Please start early and make sure you have enough time to tweak your models and hyperparameters. You will be able to choose 2 submissions for the final evaluation and the system will pick the best score you have.

发表评论

电子邮件地址不会被公开。 必填项已用*标注