首页 » 计算机科学(Computing Science) » OpenML Datasets

OpenML Datasets

2024-12-07 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

05 December to 19 December 2024

Candidates are required to:

• Upload the entire project folder (.zip) containing their code and Project Report.

• Please read and follow the enclosed guidance and requirements.

The Project is weighted at 100%.

Introduction

This examination requires you to work on an end-to-end data science challenge in which you build a prediction model using a dataset of your choice.

While you have flexibility in selecting your dataset, you must follow a specific series of tasks in building your code base and conducting the analysis meeting the coding best practices covered in both courses.

The project has two main parts - the code base, which contains your analyses and modelling scripts, and the report in which you present your results.

Where to Find Datasets

Here are some helpful websites with large collections of data:

Google Data Set Search
U.S. Government’s Open Data
Kaggle Datasets
OpenML Datasets
UCI Machine Learning Repository

Tips for finding a dataset:

Features shouldn’t be preprocessed (e.g. no pre-calculated principal components) and have a clear meaning.
The target variable can be binary, discrete or continuous.
Ideally, the dataset contains heterogeneous data types such as numeric and categorical features.
We recommend datasets with more than 1000 observations to have enough data for ma chine learning models.

Note: Do not use any of the datasets from our problem sets (e.g. French Motor Dataset, Titanic).

The data set you choose will be part of the assessment. However, you will be asked to complete a series of tasks for your analysis and fulfil the best practices covered in the course regarding your code base. This means that choosing a simpler data set but completing all tasks well can still yield a high mark.

Code base

The code base you will write should follow the best practices covered in both, Fundamentals of Data Science and Research Computing.

As part of the assessment, we will run your code, which means that the code should run out-of-the-box and contain any necessary instructions in your Readme file.

By ”run out-of-the-box” we mean that your code must be executable without modification.

This requires:

A complete environment.yml file listing all package dependencies
Clear installation instructions in your README.md
Your repository configured as an installable package

The best practices we assess include the following.

Conda environments and installing your repo as a package
Appropriate usage of version control with Git and GitHub

Important: We will anonymise your Git history for blind grading.

Modularised code
Unit tests
House-keeping things:

Consistent and clean code through pre-commit hooks
Type hints and docstrings for function documentation and testing
Appropriate path handling

Project tasks

Please address the following tasks in your project:

1. Data loading

Create a module called data that contains a function load_data to load your raw dataset.

Your data might either be made available:

directly in the repo’s data folder
by creating a function that downloads and saves the data to the data folder
or you provide instructions in the Readme on how we can download it manually

2. Exploratory Data Analysis (EDA) and data cleaning
(a) Load and explore your raw dataset in a Jupyter notebook called eda_cleaning.ipynb.

Please write any custom exploratory data analysis functions in separate modules (e.g. preprocessing, plotting) that you import into your Jupyter Notebook.
Your exploratory analysis should touch on the following points and provide meaningful visualisations (use your favourite plotting library and make sure plots are properly labelled and legible):

Describe your data (e.g. dtypes, descriptive statistics)
What is the distribution of the target variable?
Do we face missing values / outliers?
How do specific features correlate with the target variable?
What features can we use for the specific prediction task?

(b) Clean your data

Take necessary cleaning steps based on your EDA to prepare your data for modelling.
Save your prepared dataset to a .parquet file in your data folder. We should be able to reproduce the prepared dataset given your preparation script and/or functions.

3. Modelling

(a) Load cleaned data and split it

Load your cleaned data from above within a model_training.py script.
Split your sample, either using random splitting or ID-based splitting.

(b) Setup your modelling pipelines:

• Setup modelling pipelines for both, GLMs and LGBMs.

• Make sure to select an appropriate loss-function, given the distribution of the target variable.

• Setup adequate feature engineering steps using scikit-learn transformers as part of the estimator pipelines.

As part of this: Write your own simple scikit-learn transformer in a feature_engineering module. This can be a simplified re-implementation of an existing one, e.g.StandardScaler, yet please avoid using the ones we have discussed in the course (Winsorizer, SquaredTransformer). You don’t have to utilise the transformer in your final feature engineering pipeline.

Write a unit test for your transformer implementation. The unit test should be parametrised to cover different cases. Make sure your unit test is successful.

Tune the model pipelines:

Find the right degree of regularisation (alpha and l1_ratio) for the GLM.
Tune learning_rate, n_estimators, n_leaves and min_child_weight for the LGBM pipeline. You can use early_stopping to reduce the hyperpa rameter space and simply set a large enough value for n_estimators.
Use k-fold cross-validation when tuning.
Feel free to use your preferred tuning implementation. Hint: E.g. GridSearchCV, RandomizedSearchCV from scikit-learn or try out the optuna package.

4. Evaluation and Interpretation

Evaluate the predictions of your tuned GLM and LGBM pipelines on your validation set.

Hint: You can use the evaluation function we implemented in problem set 4.

Create a ”predicted vs. actual” plot for both models.
What are the most relevant features in the models?
Plot partial dependence plots for the top 5 most important features of the models. (If the feature importance between GLM and LGBM doesn’t give the same top 5 features, just take the top 5 features of the LGBM).
Hint: You can use Dalex' Explainer class to produce most of the model diagnostics,but you can also use any other package of your choice.

Report (max. 2000 words)

The report should describe your approach and most important insights consisting of:

A short motivation for the prediction task.
Explanatory (remember the difference between exploratory and explanatory data analysisform the lectures) data analysis, using effective visualisations to understand your datasetand showcase key cleaning steps.

Please list all the cleaning steps (bullet-points are enough)

Describe and motivate your feature selection and engineering approach.
Describe and show how you concluded on the final model including a:

Description of your evaluation approach
Description of your hyperparameter tuning approach and selected parameters
Comparing GLM and GBM results

Describe and show the final performance of your model.
Provide an outlook on how to improve the current analysis (e.g. what kind of data would be helpful to improve your model?, what else would you have liked to do with more time?).

It might be helpful to think of your report as something you would submit as part of your application for a data science job.

Assessment

The different parts of the exam are weighted as follows:

(30%) Repository structure, organisation and code quality w.r.t. concepts covered in the courses
(15%) Exploratory data analysis (including data collection and cleaning) (20%) Modelling (15%) Evaluation and Interpretability
(20%) A concise project report including effective presentation of visualisations and results

The final report (pdf), should not exceed 2000 words (excl. plots and code)

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *