Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Project outline:
This project requires that you perform data cleaning, exploratory data analysis (EDA) as well as uncover insights from a real-world dataset. You are required to present your work in a Jupyter Notebook. The notebook is expected to have the general structure of a report, together with all the Python scripts embedded in it and, descriptions of the steps you took in your analysis and the data cleaning processes.
After you have cleaned the data and prepared it for analysis, your task is to gain an understanding of the problem domain, which will enable you to formulate some assumptions as well as key questions that will drive your research. The research objectives are open-ended. It is your task to find correlations, interesting trends and innovative ideas on how to best use the data in the dataset.
You will need to transform data into different formats where necessary. Be creative and generate new columns as derivatives from others where useful. Make justifiable decisions on how to handle missing values depending on your research goals. Look for erroneous values and restore the integrity of the data where needed. Be critical.
Utilise a variety of exploratory data analysis techniques to make sense of the data, which will then guide you to dig deeper ,and drive new avenues of investigation. Use visualisations to communicate your insights and messages to the reader. Be ,effective with how you construct your graphs and preserve accuracy and integrity.
Finally, you may install and use any additional Python packages you wish that will help you with this project.
Dataset Domain:
The dataset covers socio-economic data on New Zealand, stretching back to early 1980s. The data covers a range of topics: income and wealth distribution, poverty and deprivation levels, health measures, education outcomes, safety and security, housing as well as employment. The data is captured by various government agencies as well as some private sector entities.
There are approximately ~100 columns in the dataset. The columns range widely in their completeness and coverage. A ,document is provided which explains briefly what each column means and where it originated.
The dataset has been intentionally tampered with in order to provide you with a sufficient amount of practice in data wrangling and cleaning. Cleaning the dataset represents a significant amount of marks in the assignment.
Dataset Usage Conditions:
Bonus Marks:
Additional marks are offered to students who are prepared to go beyond the specified requirements. Bonus marks will be granted in respect to the meaningful integration of additional data into the main dataset. Some examples of additional data files you can explore comprise the NZ General Social Survey. Data from the 2008, 2010, 2012, 2014, 2016 years is included but you are asked to also incorporate the 2018, 2021 and 2023 data found here: https://datainfoplus.stats.govt.nz/Item/nz.govt.stats/2ed50ad6-8ab8-47df-883d-210a51b50043 .
In order to make this project as interesting as possible and unique to your interests or prior knowledge, you are also invited to combine the main dataset with other relevant data from Statistics NZ. You are asked to explore the datasets available there and choose data that complements your topic. This could range from economic indicators, population demographics, health statistics, to environmental data. You can find various data from the following Stats NZ archives:
c. Or navigate to specific files via https://www.stats.govt.nz/
Some of the variables can also be updated with more recent values. You will be awarded additional marks if you take the effort to acquire these datapoints.
Marking criteria:
|
Component |
Marks |
Requirements and expectations |
|
Data Wrangling |
30 |
Thoroughness of the data cleaning using Python. |
|
EDA/Visualisation |
30 |
Quality of investigation into potential erroneous values, decision making process on how to handle missing data and potential interpolation options.
Stating assumptions and justifying them. Variety of exploratory research and inquiry into different aspects of the dataset, use of broad and appropriate range of visualisations and their effective communication. |
|
Data Analysis |
30 |
Depth, sophistication and difficulty of analysis being performed. Diversity of techniques used to answer the research questions and communicate the findings to the reader. |
|
Report Presentation |
10 |
Structure of the report and use of headers and formatting.
Clear sections and logical flow. Well-articulated research questions and goals. Suitable introduction and conclusion. Tidy code sections and their explanations where needed. Not cluttering the notebooks with too many dataframe data dumps. |
|
BONUS MARKS |
|
|
|
Integration of Additional Datasets
|
5 |
Meaningful integration and augmentation of insights with the NZ General Social Survey data or other data from Stats NZ. |
|
Updating of variables |
5 |
Updating of variables with more recent values where possible. |
Jupyter Notebook Template
A notebook template has been created for you that you need to use. Make sure that the introduction section has all the necessary parts filled out that are relevant to your project. The template file is called ‘Jupyter Project Report Template.ipynb’
Group Work:
Hand-in:
Use of Generative AI in This Assignment
In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university, the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently.
Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.
For this project, you are required to independently select, wrangle, analyze, and interpret a range of datasets. The use of ,generative AI is restricted to planning, explanation, and concept development, as outlined below.