Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
05 December to 19 December 2024
Introduction
This examination requires you to work on an end-to-end data science challenge in which you build a prediction model using a dataset of your choice.
While you have flexibility in selecting your dataset, you must follow a specific series of tasks in building your code base and conducting the analysis meeting the coding best practices covered in both courses.
The project has two main parts - the code base, which contains your analyses and modelling scripts, and the report in which you present your results.
Where to Find Datasets
- Google Data Set Search
- U.S. Government’s Open Data
- Kaggle Datasets
- OpenML Datasets
- UCI Machine Learning Repository
- Features shouldn’t be preprocessed (e.g. no pre-calculated principal components) and have a clear meaning.
- The target variable can be binary, discrete or continuous.
- Ideally, the dataset contains heterogeneous data types such as numeric and categorical features.
- We recommend datasets with more than 1000 observations to have enough data for ma chine learning models.
Note: Do not use any of the datasets from our problem sets (e.g. French Motor Dataset, Titanic).
The data set you choose will be part of the assessment. However, you will be asked to complete a series of tasks for your analysis and fulfil the best practices covered in the course regarding your code base. This means that choosing a simpler data set but completing all tasks well can still yield a high mark.
Code base
The code base you will write should follow the best practices covered in both, Fundamentals of Data Science and Research Computing.
As part of the assessment, we will run your code, which means that the code should run out-of-the-box and contain any necessary instructions in your Readme file.
By ”run out-of-the-box” we mean that your code must be executable without modification.
- A complete environment.yml file listing all package dependencies
- Clear installation instructions in your README.md
- Your repository configured as an installable package
- Conda environments and installing your repo as a package
- Appropriate usage of version control with Git and GitHub
- Important: We will anonymise your Git history for blind grading.
- Modularised code
- Unit tests
- House-keeping things:
- Consistent and clean code through pre-commit hooks
- Type hints and docstrings for function documentation and testing
- Appropriate path handling
Project tasks
- Create a module called data that contains a function load_data to load your raw dataset.
- Your data might either be made available:
- directly in the repo’s data folder
- by creating a function that downloads and saves the data to the data folder
- or you provide instructions in the Readme on how we can download it manually
(a) Load and explore your raw dataset in a Jupyter notebook called eda_cleaning.ipynb.
- Please write any custom exploratory data analysis functions in separate modules (e.g. preprocessing, plotting) that you import into your Jupyter Notebook.
- Your exploratory analysis should touch on the following points and provide meaningful visualisations (use your favourite plotting library and make sure plots are properly labelled and legible):
- Describe your data (e.g. dtypes, descriptive statistics)
- What is the distribution of the target variable?
- Do we face missing values / outliers?
- How do specific features correlate with the target variable?
- What features can we use for the specific prediction task?
- Take necessary cleaning steps based on your EDA to prepare your data for modelling.
- Save your prepared dataset to a .parquet file in your data folder. We should be able to reproduce the prepared dataset given your preparation script and/or functions.
(a) Load cleaned data and split it
- Load your cleaned data from above within a model_training.py script.
- Split your sample, either using random splitting or ID-based splitting.
(b) Setup your modelling pipelines:
- As part of this: Write your own simple scikit-learn transformer in a feature_engineering module. This can be a simplified re-implementation of an existing one, e.g.StandardScaler, yet please avoid using the ones we have discussed in the course (Winsorizer, SquaredTransformer). You don’t have to utilise the transformer in your final feature engineering pipeline.
- Write a unit test for your transformer implementation. The unit test should be parametrised to cover different cases. Make sure your unit test is successful.
- Tune the model pipelines:
- Find the right degree of regularisation (alpha and l1_ratio) for the GLM.
- Tune learning_rate, n_estimators, n_leaves and min_child_weight for the LGBM pipeline. You can use early_stopping to reduce the hyperpa rameter space and simply set a large enough value for n_estimators.
- Use k-fold cross-validation when tuning.
- Feel free to use your preferred tuning implementation. Hint: E.g. GridSearchCV, RandomizedSearchCV from scikit-learn or try out the optuna package.
- Evaluate the predictions of your tuned GLM and LGBM pipelines on your validation set.
- Hint: You can use the evaluation function we implemented in problem set 4.
- Create a ”predicted vs. actual” plot for both models.
- What are the most relevant features in the models?
- Plot partial dependence plots for the top 5 most important features of the models. (If the feature importance between GLM and LGBM doesn’t give the same top 5 features, just take the top 5 features of the LGBM).
- Hint: You can use Dalex' Explainer class to produce most of the model diagnostics,but you can also use any other package of your choice.
The report should describe your approach and most important insights consisting of:
- A short motivation for the prediction task.
- Explanatory (remember the difference between exploratory and explanatory data analysisform the lectures) data analysis, using effective visualisations to understand your datasetand showcase key cleaning steps.
- Please list all the cleaning steps (bullet-points are enough)
- Describe and motivate your feature selection and engineering approach.
- Describe and show how you concluded on the final model including a:
- Description of your evaluation approach
- Description of your hyperparameter tuning approach and selected parameters
- Comparing GLM and GBM results
- Describe and show the final performance of your model.
- Provide an outlook on how to improve the current analysis (e.g. what kind of data would be helpful to improve your model?, what else would you have liked to do with more time?).
The different parts of the exam are weighted as follows:
- (30%) Repository structure, organisation and code quality w.r.t. concepts covered in the courses
- (15%) Exploratory data analysis (including data collection and cleaning) (20%) Modelling (15%) Evaluation and Interpretability
- (20%) A concise project report including effective presentation of visualisations and results
- The final report (pdf), should not exceed 2000 words (excl. plots and code)