Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
|
Deadline: |
Hand-in by midnight Sunday, 21 April 2025 |
|
Evaluation: |
25% of your final course grade. |
|
Late Submission: |
See Course Guide |
|
Work |
This assignment is to be done individually. |
|
Purpose: |
Implement the entire data science/analytics workflow. Learn to correctly apply and reason about using different machine learning techniques to solve real-world problems. Gain skills in extracting data from the web using APIs and web scraping. Build on the data wrangling, data visualization and introductory data analysis skills gained up to this point as well as problem formulation and presentation of findings. Learning outcomes 1 - 5 from the course outline. |
This project requires that you apply machine learning techniques taught so far to build predictive regression models on current, topical and original data from your chosen domain. You are expected to carry out an entire data science/analytics workflow by: (1) acquiring data from multiple sources, (2) performing data wrangling, (3) integrating the data, (4) conducting analysis to answer some key research questions and finally (5) perform predictive modelling.
- Each student should aim to create a unique and distinctive data-problem to work on that is made original by combinations of different data sources. The goal of the project is to perform prediction analysis.
- Build multiple regression and kNN models and compare their outputs.- Experiment with models using different features. Which features are most effective? Why?- Experiment with kNN using different distance metrics and different values of k, and compare the outputs. Which values of k are most robust for the size of your dataset and your problem domain? Are variables in your data having different scales affecting the algorithm’s accuracy? How have you tried to overcome this?- Experiment with linear, multiple linear and polynomial regression models and compare them. At what point does a regression model become too complex and no longer captures the true relationships in the data?- How reliable are your prediction models? What do the confidence intervals and prediction bands tell you? Could you recommend this predictive model to a client? Would you expect this model to preserve its accuracy on data beyond the range it was built on?- Is your evaluation approach robust enough to be able to draw conclusions about the utility of your models?
Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some of your Python code in your notebook by putting it into a .py files that you can import. This might help the readability of your final notebook by removing unnecessary python code that can clutter and distract from your actual findings and discussions.
You may install and use any additional Python packages you wish that will help you with this project. When submitting your project, include a README file that specifies what additional python packages you have installed in order to make your project repeatable on my computer, should I need to install extra modules.
Follow the general structure of the Project Notebook Template provided. Make your notebook professional and tidy (avoid large data dumps) and run your text through an IPython Notebook spell-checker extension. You can also pretend that you are a consultant performing an analysis for a client.
Marks will be awarded for different components of the project using the following rubric:
|
Component |
Marks |
Requirements and expectations |
|
Data Acquisition |
15 |
- diversity of sources: data from a web API and/or data scraped from a web site should be included to get maximum marks
- appropriate use of merging and concatenation.
- ethical data collection (make sure that terms and conditions of use permit you to collect the data) and state clearly that you have complied with this in
the notebook
|
|
Data Wrangling |
10 |
- thoroughness in data cleaning,
- visualisations,
- handling of missing values and outliers.
|
|
Data Analysis |
20 |
- quality of your exploratory data analysis
- presentation of the characteristics of the data,
- discussion of assumptions being made if any
- formulation of the problem as a machine learning problem
- diversity of techniques used to achieve this.
- presentation of findings.
|
|
Predictive Modelling |
40 |
- diversity of experiments.
- quality of the evaluations and testing using hold out data
- comparisons, presentation and interpretation of results.
|
|
Originality and Rigour |
15 |
- discuss how your academic readings have informed your analyses
- originality of the datasets
- quality of research questions
- difficulty of the problem
- degree to which the problem domain is original, challenging, topical and presented in an interesting way
|
|
Reading Log |
PASS |
- The compiled reading logs up to the current period.
- The peer discussion summaries for each week.
- Any relevant connections between your readings and your analytical work in the notebook. If a research paper influenced how you approached an implementation, mention it.
|
If you have any questions or concerns about this assignment, please ask the lecturer sooner rather than closer to the submission deadline.
In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university, the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently. Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.
For this project, you are required to independently select, wrangle, analyze, and interpret datasets from your chosen domain. You will also maintain a reading log, where critical engagement with academic sources is expected and integrated into your analyses where relevant. The use of generative AI is restricted to planning, explanation, and concept development, as outlined below.
You may use AI for conceptual understanding, guidance, and general problem-solving strategies, but NOT for directly completing any part of your assignment. Specifically, AI can be used to:
- Understand background knowledge relevant to data science, regression analysis, and kNN – as well as other models.
- o Example: "How does kNN differ from linear regression in terms of assumptions and use cases?"
- o Example: "What are common challenges when performing web scraping at scale?"
- Seek feedback on your problem formulation and methodology without directly generating code or statistical analysis.
- Example: "I plan to predict housing prices using data from a real estate API. Does this make sense?"
- Example: "What are some potential pitfalls in merging datasets from different sources?"
- Clarify technical concepts or debugging hints, provided you write the code yourself.
- Example: "Why might my web scraping code be returning an empty dataset?"
- Example: "How does feature scaling affect kNN classification?"
- Explore different methods for data visualization, but without directly copying AI-generated visualizations.
- Example: "What are effective ways to visualize feature importance in regression models?"
- Example: "How can I compare multiple regression models visually?"
- Enhance critical engagement with research articles by summarizing complex concepts or suggesting alternative interpretations.
- Example: "What are some alternative methods for assessing regression model reliability?"
You must NOT: