Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
DATA2060 Final Project Rubric
Overview
- You’ll explain the math and the numerical algorithms behind your ML model in one ormore markdown cells using references and citations.
- You’ll implement the ML algorithm from scratch using object oriented programming (aclass and corresponding methods). Make sure there are __init__, train, predict, loss, etc.methods as required by your algorithm. You can only use python and numpy in thissection.
- You will develop (unit) tests to make sure:
- each method of your algorithm works correctly in isolation,
- edge cases are handled appropriately,
- your implementation can correctly reproduce results obtained from sklearn,textbooks, or results of peer-reviewed journal articles. Previous work needs to bereferenced.
- You can use pandas, matplotlib, sklearn, pytest, and other packages as needed in this section.
Your team’s final report will be the pdf version of the jupyter notebook submitted to Gradescope,and each team will also give a final presentation to describe their algorithm and tests/results tothe rest of the class. The slides of the final presentation will also be submitted to gradescope.
Timeline
Each team will have a mentor TA assigned. Your mentor TA will check in with your team threetimes during the term and answer any questions you might have. You can also come to theinstructor’s office hours for help. Here are the expectations for each of those meetings:
- Week of October 28th: You should complete the markdown section of the report and make sure everyone understands the math and numerical methods behind the algorithm.
- Week of November 18th: You should have some unit tests and at least the train and loss methods completed by this point.
- Week of December 2nd, the week before the final presentations: All methods and testsare completed and the report is ready for submission.
The final presentations will take place during the week of December 9th. A signup sheetwill be posted on the course forum a couple of weeks ahead of time.
The final report needs to be submitted to Gradescope by December 15th. The finaldeadline will be posted on the course forum a couple of weeks ahead of time. If you receive anyfeedback and comments during your final presentation, your team is expected to address thosein the final report.
Example ML algorithms
We cover the Naive Bayes algorithm for categorical (binary) features (Chapter 24.0 and 24.1 inthe textbook). Gaussian Naive Bayes is an extension of the method to continuous features. Youcan read more about this algorithm here.
We cover the ID3 tree algorithm in class (Chapter 18). CART is the algorithm implemented insklearn which is slightly different from ID3. If you choose this project, implement it for a regression problem. Check out the references here.
3. CART - Classification And Regression Tree for classification
Same as above but used for classification. Check out the references here.
4. Boosting
AdaBoost is an algorithm we cover in class (Lecture 9) but it is not implemented in any of the HW assignments. If you liked the algorithm, this is your chance to implement it! The details of the algorithm are described in Chapter 10 of the textbook.
Implement the one-vs-all and the all-pairs algorithms we covered in Lecture 5 using any of the binary classification models you implemented in the homeworks. Compare and contrast the results. If you choose this method, describe in the signup sheet which binary classification algorithm you plan to use.
Format requirements for the Final Report (75 points)
1. Overview of [the name of your ML algorithm] (20 points)
- Give an overview of the algorithm and describe its advantages and disadvantages.
- Representation: describe how the feature values are converted into a single number prediction.
- Loss: describe the metric used to measure the difference between the model’s prediction and the target variable.
- Optimizer: describe the numerical algorithm used to find the model parameters that minimize the loss given a training set.
Use markdown in the jupyter notebook, add equations to explain math, and use pseudo-code to explain how numerical algorithms work. Use citations and references. Use at least 500 words (excluding equations and pseudo-code).
This section is one code cell which contains the class of your ML algorithm and any other helperfunctions. Add docstrings to each method and function and explain what they do and what the inputs and outputs are. Please add comments to the code as well as needed! You can only usepython and numpy in this section.
3. Check model (20 points)
This section is a collection of code and markdown cells that contain the unit tests and a demonstration that your implementation can reproduce previous results.
There should be several unit tests. Create at least two or three unit tests per method and makesure that edge cases are properly handled. Explain either as comments or in a markdown cellwhat the goal of each test is and/or what edge case it tests for.
Find at least one previous work where your ML algorithm is applied on a public dataset. You canuse, for example sklearn, a textbook, or a peer-reviewed publication. Include a markdown celland describe the previous work. Demonstrate in a code cell that your implementation cansuccessfully reproduce the previous work. You can use pandas, matplotlib, sklearn, pytest, andother packages in this section. Use citations and references.
Create a public Github repo and add the link to the report under the title. You can add the link tothe repo to your resume as a class project so make sure it is professional. Here are theminimum requirements:
The first thing people inspect in your repo is the readme file. The readme file should give anoverview of the project, it states what python version and package versions were used todevelop the code so others can run it locally and reproduce your results (add a yaml file for ease of use)ce, and a list of authors with contact info. There should be a license file to let people know what they can and cannot do with your code. Github offers a couple of license options, check those out and decide what’s best for you. The repository should have the following directory structure:
All data files are in /data, and your jupyter notebook should be in /src. Feel free to add other folders like /figures, /results, as required. Additionally, you might want to add the pdf versions ofyour report and the slides as well. Check out one of my repos as an example.
Collect all citations you used in the Overview and the Check Model sections. Use the Harvard citation format.
Final presentation (15 points)
- a title slide (1 point),- introduce the math behind your ML algorithm, show equations (4 points),- describe the numerical techniques, use pseudo-code, do not show actual code (4points),- walk us through the previous work you reproduced using your code (4 points),- add a summary slide and let the audience know what was particularly interesting about your ML algorithm and what you found challenging as you implemented it (2 points).
There is no need to describe the unit tests during the final presentation.