machine learning models

In this coursework, you are required to apply, evaluate and compare Polynomial feature and Boruta feature selection with SHAP values using eight optimize machine learning models using UCI ML repository dataset in Python programming language. The task requires you to implement the different feature selection on UCI ML repository dataset provided to you which will be applied to eight optimized machine learning models and discuss all your findings and a reasonable conclusion. Note, you are not expected to use the default ML classifier from scikit learn, rather the coursework required you to use the grid search technique (Scikit-learn’s GridSearchCV) as explained in class to find the optimized hyperparameter for each ML model. Your work will involve documenting all stages, including the collection of the dataset, data preprocessing, data visualization, handling of missing data and data imbalance, data normalization, feature selection, optimized grid search configuration of eight machine learning models, training and testing the optimized machine learning model, and evaluation metrics. The sole aim is to assess your ability to apply theoretical concepts into real-life application, critically analysing the model performances and findings.

Learning Outcomes

1. Evaluate and articulate the issues and challenges in machine learning, including feature selection, model selection, and model decision making process for real-life application.

2. Demonstrate a working knowledge of the variety of mathematical techniques normally adopted for machine learning problems, and of their application to creating effective solutions.

3. Critically evaluate the performance, limitations and future findings of a proposed solution to a machine learning problem.

4. Create solutions to machine learning problems using appropriate tools.

Report Writing Structure (Total 100%)

Follow the provided report template, ensuring your report is well-organized and includes the following sections:

Abstract (maximum of 250 words): (4%)

Summarize the key aspects of your coursework, including the problem, the feature selection technique, models used, the dataset, and the main findings.

1.0 Introduction (4%)

Introduce the problem of feature selection technique in machine learning especially for your given dataset, its importance, and the coursework objectives.

1.1 Literature Review (4%)

Discuss previous works related to the feature selection techniques or other related selection techniques and machine learning models with focus on your given dataset, highlighting their contributions and limitations.

2.0 Methodology (26%)

Detail the dataset, preprocessing steps, optimized feature selection techniques, ML models, and hyperparameter tuning and the model architecture. Provide visualizations of the model architecture and design pipeline.

2.1 Data Collection and Data Preprocessing (5%)

· Explain the data collected from the UCI ML Repository dataset and ensure it is well-structured, with diverse and consistent entries. Also, present the data visualization to discuss the data distribution using any plots like scatter plot, violin plot, ridge plot, kernel distribution plot, boxen plot etc

· Perform and explain the essential preprocessing tasks, including handling missing data, normalizing features, and addressing class imbalances to prepare the dataset for model training and testing.

2.2 Feature Selection techniques (6%) 

Implement the optimized boruta and the polynomial features. Document the process of each of the feature selection technique and the mathematical formulas of these optimized features that contribute to model performance.

2.3 ML model Selection and Optimization (6%)

Select eight different machine learning models. Note that the eight machine learning models include the model already explained in class hours and they include support vector machine (SVM), logistic regression (LR), K-nearest neighbour (KNN), Decision tree (DT), adaptive boosting (ADA) , bagging, stacking, and voting classifiers. For each model, apply hyperparameter tuning to enhance its performance. You must document the rationale behind the selection of models and the tuning process.

2.4 Data Partitioning and Environment Setup (3%)

Explain the two data split method and divide the dataset into training-split ratio and cross-validation split based on the dataset size and also the environmental setup.

2.5 Explain and provide visualizations of the model architecture and design pipeline. (2%)

2.6 Explain the Model explainability (2%)

2.7 Present the nine evaluation metrics for the model with their respective mathematical formulas (2%) 

3.0 Results and Discussion (25% marks)

· Train each of the models using the selected features from the feature selection techniques. Also, train with all the entire dataset (without any feature selection). Monitor and document the training progress.

· Test the trained models on the testing set.

· Evaluate and present their performance results using the two data split method – train-test split and cross validation split with the various evaluation metrics, such as accuracy, sensitivity, specificity, precision, f1-score, ROC-AUC, confusion matrix and time. In your report, tabulate the evaluation metrics like accuracy, sensitivity, specificity, precision, f1-score, roc-auc and time for feature selection techniques and even without it with all the ML models.

· Also, document the learning curves, the confusion matrices and ROC-AUCs of the best ML model with the best feature selection technique.

· Discuss the impact of model performance based on the feature selection technique and even when no feature selection is introduced. Present and analyse the experimental results, using tables, figures, and plots in high-quality image resolution. Discuss the effectiveness of the feature selection techniques and their impact on model performance.

4.0 Model Explainability (20%)

· Machine learning model are seen as ‘black boxes’, student should provide model interpretation by using only the result that has the feature selection together with the best ML model. Discuss how the SHAP values help attribute a model’s prediction to its features for a comprehensive, fair and trusted in AI decision making process by utilizing the following plots; waterfall, dependency, force, summary and dot plots. Ensure all images are saved in 600 dpi during your coding.

5.0 Conclusion, Limitations, and Future Work (8%)

Summarize your findings, discuss any limitations encountered, and suggest areas for future work or improvements. Also, briefly compared the performance result of your ML coursework with the relevant literature review of other researchers you have written in Table 1 of the ML Coursework Template.

6.0 References (5%)

Cite all sources used, adhering to IEEE citation standards. Include a minimum of 12 relevant references.

The three main parts of a reference are as follows:

1. Author’s name listed as first initial of first name, then full last.

2. Title of article, patent, conference paper, etc., in quotation marks.

3. Title of journal or book in italics

Each reference number should be enclosed in square brackets on the same line as the text, before any punctuation, with a space before the bracket.

Examples of in-text citation:

 “. . .end of the line for my report [1].”

 “The theory was first put forward in 1987 [2].”

“Scholtz [3] has argued. . . .” “For example, see [4].”

 “Several recent studies [3, 4, 15, 22] have suggested that. . . .”

Reference

[1] S. Bhanndahar. ECE 4321. Class Lecture, Topic: “Bluetooth can’t help you.” School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, Jan. 9, 2008.

Technical Writing (4%)

Adherence to the provided report template, clarity and coherence in writing, and proper formatting of figures, tables, and diagrams. The quality of plots (600dpi) and readability of labels (font size 20-25pt) will also be evaluated. Effectively communicating of ideas and demonstrating critical thinking showing a solid grasp of the subject matter.

Assignment length

The length of the assignment should not be less than 2,500 words for the coursework to contribute towards the development of writing skills and critical thinking. Therefore, it is required of you to complete your assignments within the coursework specification as written in the assignment brief.  

The specified word counts refer to the main body of the report and also include headings, and in-text citations. However, kindly note that the word count does not include front cover, title page, contents page, abstract, tables, reference list, bibliography, appendices, equations or diagrams.  Remember to save all images in a specific folder with high image resolution of 600dpi. Also, all tables and Figures images must be well labelled in your report. Font style: Times New Roman, Font Size: 11, Line spacing: 1.0

Appendices themselves will not be marked. However, inappropriate use of appendices will be taken into consideration when awarding the final mark. 

Student Number: (Insert you student number – make sure it is correct)

Word Count: (insert your total word counted excluding cover page, contents pages, reference list and appendices)

AI Declaration:

Delete as appropriate.

I have utilised / have not utilized the use of AI tool(s) in this assessment.

I have used the following AI tool(s): please provide the name of the AI tool(s) you have used and provide the exact prompt(s) you provided in

 For example:

AI Tool: CHAT GPT – Prompt: Find information on what are the impacts of utilizing AI Tools for academic Purposes and career prospects?

Baidu translator: I have written the task 1, task 2, task 3 and task 4 in Chinese language and used Baidu Translate to covert these tasks to English.

The box below:

If the declaration has not been made, and your tutors suspect use of AI, you will be called into do a viva voce and it will be considered academic misconduct if you fail the viva voce. This will be the same for the use of translation software which will also requires you to declare the use of.

Full disclosure will not result in an academic penalty or a lower score, so make sure you are honest and fill in the declaration when submitting your coursework(s)

发表评论

电子邮件地址不会被公开。 必填项已用*标注