machine learning

2025-04-07 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Project 2

Deadline:	Hand-in by midnight Sunday, 21 April 2025
Evaluation:	25% of your final course grade.
Late Submission:	See Course Guide
Work	This assignment is to be done individually.
Purpose:	Implement the entire data science/analytics workflow. Learn to correctly apply and reason about using different machine learning techniques to solve real-world problems. Gain skills in extracting data from the web using APIs and web scraping. Build on the data wrangling, data visualization and introductory data analysis skills gained up to this point as well as problem formulation and presentation of findings. Learning outcomes 1 - 5 from the course outline.

Project outline:

This project requires that you apply machine learning techniques taught so far to build predictive regression models on current, topical and original data from your chosen domain. You are expected to carry out an entire data science/analytics workflow by: (1) acquiring data from multiple sources, (2) performing data wrangling, (3) integrating the data, (4) conducting analysis to answer some key research questions and finally (5) perform predictive modelling.

The data should primarily come from sources such as web APIs and/or scraped web pages. This data can also be combined with a static datasets found in various repositories if needed, or some of the datasets you used from Project 1. The important point is that you are predicting continuous-valued outputs, thus you are entirely free to choose a domain or a combination of domains that interest you.

Project Requirements:

Project details:

- Each student should aim to create a unique and distinctive data-problem to work on that is made original by combinations of different data sources. The goal of the project is to perform prediction analysis.

Questions to consider in your experiments and tasks to perform once you have chosen your domain:

- Build multiple regression and kNN models and compare their outputs.

- Experiment with models using different features. Which features are most effective? Why?

- Experiment with kNN using different distance metrics and different values of k, and compare the outputs. Which values of k are most robust for the size of your dataset and your problem domain? Are variables in your data having different scales affecting the algorithm’s accuracy? How have you tried to overcome this?

- Experiment with linear, multiple linear and polynomial regression models and compare them. At what point does a regression model become too complex and no longer captures the true relationships in the data?

- How reliable are your prediction models? What do the confidence intervals and prediction bands tell you? Could you recommend this predictive model to a client? Would you expect this model to preserve its accuracy on data beyond the range it was built on?

- Is your evaluation approach robust enough to be able to draw conclusions about the utility of your models?

Submit a single Jupyter Notebook that contains the most integral parts of analysis, together with a thorough description of findings. Make sure you interpret all model outputs and figures. The Python code in the notebook must be entirely selfcontained and all the experiments and the graphs must be replicable.

Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some of your Python code in your notebook by putting it into a .py files that you can import. This might help the readability of your final notebook by removing unnecessary python code that can clutter and distract from your actual findings and discussions.

You may install and use any additional Python packages you wish that will help you with this project. When submitting your project, include a README file that specifies what additional python packages you have installed in order to make your project repeatable on my computer, should I need to install extra modules.

Follow the general structure of the Project Notebook Template provided. Make your notebook professional and tidy (avoid large data dumps) and run your text through an IPython Notebook spell-checker extension. You can also pretend that you are a consultant performing an analysis for a client.

NOTE: Topics of web scraping, using web APIs and kNN algorithms will be covered in weeks 5 and 6. Therefore, begin your assignment as soon as you can using the concepts covered thus far. Once the material in week 6 is all covered, you will be able to complete all remaining components of this assignment in the week that it is due.

Marking criteria:

Marks will be awarded for different components of the project using the following rubric:

Component	Marks	Requirements and expectations
Data Acquisition	15	- diversity of sources: data from a web API and/or data scraped from a web site should be included to get maximum marks - appropriate use of merging and concatenation. - ethical data collection (make sure that terms and conditions of use permit you to collect the data) and state clearly that you have complied with this in the notebook
Data Wrangling	10	- thoroughness in data cleaning, - visualisations, - handling of missing values and outliers.
Data Analysis	20	- quality of your exploratory data analysis - presentation of the characteristics of the data, - discussion of assumptions being made if any - formulation of the problem as a machine learning problem - diversity of techniques used to achieve this. - presentation of findings.
Predictive Modelling	40	- diversity of experiments. - quality of the evaluations and testing using hold out data - comparisons, presentation and interpretation of results.
Originality and Rigour	15	- discuss how your academic readings have informed your analyses - originality of the datasets - quality of research questions - difficulty of the problem - degree to which the problem domain is original, challenging, topical and presented in an interesting way
Reading Log	PASS	- The compiled reading logs up to the current period. - The peer discussion summaries for each week. - Any relevant connections between your readings and your analytical work in the notebook. If a research paper influenced how you approached an implementation, mention it.

Hand-in: Make sure that the notebook you submit has all the outputs embedded. Also, export your notebook into HTML. Zip-up your notebook (.ipynb and .html) and dataset(s) you have chosen, as well as any other .py files you might have written, into a single file and submit through Stream. Include your reading log too in the zipped file. Do not email your submission to the lecturer unless there are problems with the submission site.

If you have any questions or concerns about this assignment, please ask the lecturer sooner rather than closer to the submission deadline.

Use of Generative AI in This Assignment

In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university, the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently. Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.

For this project, you are required to independently select, wrangle, analyze, and interpret datasets from your chosen domain. You will also maintain a reading log, where critical engagement with academic sources is expected and integrated into your analyses where relevant. The use of generative AI is restricted to planning, explanation, and concept development, as outlined below.

Allowed Uses of AI for assignment 2

You may use AI for conceptual understanding, guidance, and general problem-solving strategies, but NOT for directly completing any part of your assignment. Specifically, AI can be used to:

Understand background knowledge relevant to data science, regression analysis, and kNN – as well as other models.

o Example: "How does kNN differ from linear regression in terms of assumptions and use cases?"
o Example: "What are common challenges when performing web scraping at scale?"

Seek feedback on your problem formulation and methodology without directly generating code or statistical analysis.

Example: "I plan to predict housing prices using data from a real estate API. Does this make sense?"
Example: "What are some potential pitfalls in merging datasets from different sources?"

Clarify technical concepts or debugging hints, provided you write the code yourself.

Example: "Why might my web scraping code be returning an empty dataset?"
Example: "How does feature scaling affect kNN classification?"

Explore different methods for data visualization, but without directly copying AI-generated visualizations.

Example: "What are effective ways to visualize feature importance in regression models?"
Example: "How can I compare multiple regression models visually?"

Enhance critical engagement with research articles by summarizing complex concepts or suggesting alternative interpretations.

Example: "What are some alternative methods for assessing regression model reliability?"

Prohibited Uses of AI for assignment 2

You must NOT:

• Copy AI-generated code directly into your submission.

• Input the assignment questions directly into AI and use its responses as your own.

• Paraphrase AI-generated explanations/code and present them as original work.

• Ask AI to write step-by-step solutions to any of the assignment tasks.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *