Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
COMP4318/5318 - Machine Learning and Data Mining
Assignment 2
You are required to work in groups of two or three students. You must register a group on Canvas (People → A2-Group).
1. Summary
The code for this assignment should be written in Python in the Jupyter Notebook environment. Your implementation of the algorithms should use the same suite of libraries that we have used in the tutorials, such as Keras, Scikit-learn, Transformer, NumPy, and Pandas. In exceptional cases, you may use alternative libraries like PyTorch, but you need to justify your choice. Other libraries may be utilised for minor functionality such as pre-processing, plotting; however, please specify any dependencies at the beginning of your code submission.
The total score for this assignment is allocated as follows:
1. Code: max 40 points2. Report: max 60 points
Detailed assignment specifications and scoring criteria can be found on the assignment page on Canvas (Assignments → Assignment 2 - Specification). The sections below provide comprehensive information on the assignment tasks and guidelines for submission.
2. Instruction
2.1 Dataset catalogue
1. EMNIST handwritten character dataset:Original dataset and description: https://www.nist.gov/itl/products-and-services/emnist-dataset For this assignment, we provide a smaller subset of the EMNIST-ByClass dataset: https://drive.google.com/file/d/1qEoWitIzaRUWRFJsy7BKWZdvKecuck4X/view?usp=sharing2. Sentiment140:Original dataset and description: https://www.kaggle.com/datasets/kazanova/sentiment140 For this assignment, we provide a smaller subset: https://drive.google.com/file/d/1w9vysV8MIl_6LF26XFZlfCbyG2MbDj--/view?usp=sharing
3. Wiki Face Dataset:Original dataset and description: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/ For this assignment, we provide a smaller subset: https://drive.google.com/file/d/1Hx5-D8ZgdOFCSKeUtpFYg379xYmR1hwN/view?usp=sharing4. Electricity Consumption:Original dataset and description (UCI ML repository): https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 For this assignment, we provide a small and clean version based on the raw data: https://drive.google.com/file/d/1GGN2EivWIcRJ4UTm8bPz819bddkGuxxO/view?usp=sharing
5. Sales Transactions:Original dataset and description (UCI ML repository): https://archive.ics.uci.edu/ml/datasets/Sales_Transactions_Dataset_Weekly3
2.2 Assignment Task
➢ Choose methods that are suitable for the dataset and its associated problem.➢ Choose appropriate pre-processing techniques for the dataset (e.g., data normalisation, dimensionality reduction, etc.)
➢ Choose appropriate sets of hyperparameters to search (if applicable to your chosen method)
- Classification: Using accuracy, precision, recall, and confusion matrix.- Regression: Using Mean Square Error (MSE).- Clustering: Using appropriate clustering evaluation methods such as silhouette coefficient.
3. Requirements
3.1Code
➢ Setup: Install and import necessary packages and libraries used in your coding environment. It is recommended to specify their versions to ensure reproducibility. Define any necessary utility or helper functions (e.g., for plotting, optimization, etc.) if applicable.➢ Data loading, pre-processing, and exploration: begin by loading the dataset, then conduct some exploration of the data to better understand its properties and the preprocessing that may be appropriate. You may like to investigate the dataset to uncover patterns, outliers, and relationships before building models. You should include anything you feel is relevant in this section. Based on your insight from the data exploration and/or with reference to other sources, you should apply appropriate preprocessing techniques. Different preprocessing techniques might be applied to the different algorithms.
➢ Model Implementation: You will be required to design and implement at least three different models covered in the course. These models must represent different architectures to effectively explore their strengths and weaknesses with the selected dataset. For instance, a combination of DNN, LSTM, and Transformer would be appropriate, whereas variations like a 1-layer DNN, 3- layer DNN, and 5-layer DNN are not acceptable as they do not sufficiently differ in architecture. Implement an instance of each model before tuning hyperparameters and set up any functions you may require tuning hyperparameters in the next section.
➢ Hyperparameter tuning: Conduct a search over relevant hyperparameters for each model using a chosen strategy (e.g., grid search, random search). Please preserve the output of these cells in your submission and keep these hyperparameter search cells independent from the other cells of your notebook to avoid needing to rerun them, as markers will not be able to run this code (it will take too long), i.e. ensure the later cells can be run if these grid search cells are skipped.
➢ Evaluation and Comparison: After selecting the best hyperparameters, train each model using these settings in separate cells, independent of the tuning process. Use these models to evaluateand compare their performance on the test set with appropriate evaluation metrics. Conduct visualizations to effectively present and analyse the experimental results.
➢ Final models: Conclude the best classifier based on the comparative analysis.
3.2 Report
➢ Abstract: include a self-contained, short summary of your work.➢ Introduction: introduce the dataset and the problem you have chosen, discuss its relevance to real-world applications, outline the previous techniques and approaches that have been utilized to solve the problem and provide an overview of the methods you used and the results you obtained.➢ Data: describe the dataset and pre-processing, including:
- Data description and exploration: Describe the data, including all its important characteristics, such as the number of samples, classes, dimensions, and the original source of the images. Discuss your data exploration, including characteristics/difficulties as described in the relevant section above and anything you consider relevant. Where relevant, you may wish to include some sample images to aid this discussion.
- Pre-processing: Justify your choice of pre-processing either through your insights from thedata exploration or with reference to other sources. Explain how the preprocessing techniques work, their effect/purpose, and any choices in their application. If you have not performed pre-processing or have intentionally omitted possible preprocessing techniques after consideration, justify these decisions.
➢ Methodology: describe the models that you employed in this assignment, including:
- Theory: For each model, explain the main theoretical ideas (this will be useful as a framework for comparing them in the rest of the report). Explain why you chose those models.
- Strengths and weaknesses: Describe the relative strengths and weaknesses of the models from a theory perspective. Consider factors such as performance, overfitting, runtime, number of parameters, and interpretability. Explain the reasons, e.g. don’t simply state that CNNs perform better on images but explain why this is the case.
- Architecture and hyperparameters: State and explain the chosen architectures or other relevant design choices you made in your implementation (e.g. this is the place to discussyour neural network configurations). Describe the hyperparameters you will tune over, the values included in the search, and outline your search method. Briefly explain what each hyperparameter controls and the expected effect on the algorithm. For example, consider the effects of changing the learning rate, or changing the stride of a convolutional layer. Justify why you have made these choices.
➢ Experimental Results: present the experimental setting (e.g., the details of the dataset, models, hardware, and software specifications of the computer used for performance evaluations). Provide the experimental results obtained from the algorithms you implemented in an intuitive way. Discuss and compare the performance of your models. Consider factors such as accuracy, runtime, number of hyperparameters, and interpretability. Employ high-quality plots, figures, and tables to visually support and enhance the discussion of these results. Please do not include screenshots of raw code outputs when presenting your results.➢ Conclusion: summarize your main findings, mention any limitations, methods, and results, and suggest potential directions for future works. Write one or two paragraphs describing the most important thing that you have learned while completing the assignment.
➢ References: include the references cited in your report in a consistent format. You may choose any appropriate academic referencing style, such as IEEE.
4. Submission
4.1 Group Registration:
4.2 Proceed to Canvas and upload all files separately, as follows:
Important:
- Only ONE group member needs to submit the assignment on behalf of the group.- Do NOT submit the dataset or zip files to Canvas. We will copy the data folder to the same directory as yours.ipynb file to run your code. Please make sure your code is able to read the dataset from this folder.- BOTH CODE AND REPORT will be checked for plagiarism.
4.3 File Naming Conventions
- a2_groupID_SID1_SID2_SID3.ipynb (code)
- a2_groupID_SID1_SID2_SID3.pdf (pdf version of the code)
- a2_groupID_SID1_SID2_SID3_report.pdf (report)
In both the Jupyter Notebook and report, include only your SIDs and not your name. The marking is anonymous.
4.4 Late Submission Penalties
The maximum delay for assignment submission is 3 (three) days, after which the assignment will not be accepted. You should upload your assignment at least half a day or one day prior to the submission deadline to avoid network congestion.
Canvas may not be able to handle a large number of submissions happening at the same time. If you submit your assignment at a time close to the deadline, a submission error may occur causing your submission to be considered late. Penalty will be applied to late submission regardless of issues.
4.5 Marking Rubric
5. Academic honesty
https://sydney.edu.au/students/academic-integrity.html
Plagiarism (copying from another student, website or other sources), making your work available to another student to copy, engaging another person to complete the assignments instead of you (for payment or not) are all examples of academic dishonesty. Note that when there is copying between students, both students are penalised – the student who copies and the student who makes his/her work available for copying. The University penalties are severe and include:
* a permanent record of academic dishonesty on your student file,* mark deduction, ranging from 0 for the assignment to Fail for the course* expulsion from the University and cancelling of your student visa.
In addition, the Australian Government passed a new legislation last year (Prohibiting Academic Cheating Services Bill) that makes it a criminal offence to provide or advertise academic cheating services - the provision or undertaking of work for students which forms a substantial part of a student’s assessment task. Do not confuse legitimate co-operation and cheating!