首页 » 数据科学与大数据技术 » Data Mastery：Scripting，Databases and Data Privacy

Data Mastery：Scripting，Databases and Data Privacy

2026-03-05 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Project 1

Deadline:Hand in by midnight April 19 2026

Evaluation:20% of your final course grade.

Work:This assignment is expected to be completed individually. See below.

Purpose: Gain experience in perform data wrangling, data visualization and introductory data analysis using Python with suitable libraries. Begin developing skills in formulating a problem from data in a given domain, asking questions of the data, extracting insights from a real-world dataset. Learning outcomes 1, 2 and 4 from the course outline.

Project outline:

This project requires that you perform data cleaning, exploratory data analysis (EDA) as well as uncover insights from a real-world dataset. You are required to present your work in a Jupyter Notebook. The notebook is expected to have the general structure of a report, together with all the Python scripts embedded in it and, descriptions of the steps you took in your analysis and the data cleaning processes.

After you have cleaned the data and prepared it for analysis, your task is to gain an understanding of the problem domain, which will enable you to formulate some assumptions as well as key questions that will drive your research. The research objectives are open-ended. It is your task to find correlations, interesting trends and innovative ideas on how to best use the data in the dataset.

You will need to transform data into different formats where necessary. Be creative and generate new columns as derivatives from others where useful. Make justifiable decisions on how to handle missing values depending on your research goals. Look for erroneous values and restore the integrity of the data where needed. Be critical.

Utilise a variety of exploratory data analysis techniques to make sense of the data, which will then guide you to dig deeper ,and drive new avenues of investigation. Use visualisations to communicate your insights and messages to the reader. Be ,effective with how you construct your graphs and preserve accuracy and integrity.

Finally, you may install and use any additional Python packages you wish that will help you with this project.

Dataset Domain:

The dataset covers socio-economic data on New Zealand, stretching back to early 1980s. The data covers a range of topics: income and wealth distribution, poverty and deprivation levels, health measures, education outcomes, safety and security, housing as well as employment. The data is captured by various government agencies as well as some private sector entities.

There are approximately ~100 columns in the dataset. The columns range widely in their completeness and coverage. A ,document is provided which explains briefly what each column means and where it originated.

The dataset has been intentionally tampered with in order to provide you with a sufficient amount of practice in data wrangling and cleaning. Cleaning the dataset represents a significant amount of marks in the assignment.

Once the dataset is ready for analysis, consider how to create a data product from your insights that helps inform public discourse on these socio-economic matters.

Dataset Usage Conditions:

The dataset was collated by a group of researchers belonging to the Knowledge Exchange Hub at Massey University. The dataset values are obtained from a mixture of publicly available sources as well as confidential private sources. It also contains a number of derived values. The dataset has intentionally not been updated as this creates a good learning opportunity for students to hunt out the data sources where possible and to update the raw values and the analysis since it was originally conducted.

Bonus Marks:

Additional marks are offered to students who are prepared to go beyond the specified requirements. Bonus marks will be granted in respect to the meaningful integration of additional data into the main dataset. Some examples of additional data files you can explore comprise the NZ General Social Survey. Data from the 2008, 2010, 2012, 2014, 2016 years is included but you are asked to also incorporate the 2018, 2021 and 2023 data found here: https://datainfoplus.stats.govt.nz/Item/nz.govt.stats/2ed50ad6-8ab8-47df-883d-210a51b50043 .

In order to make this project as interesting as possible and unique to your interests or prior knowledge, you are also invited to combine the main dataset with other relevant data from Statistics NZ. You are asked to explore the datasets available there and choose data that complements your topic. This could range from economic indicators, population demographics, health statistics, to environmental data. You can find various data from the following Stats NZ archives:

a. https://infoshare.stats.govt.nz/

b. https://explore.data.stats.govt.nz/

c. Or navigate to specific files via https://www.stats.govt.nz/

Some of the variables can also be updated with more recent values. You will be awarded additional marks if you take the effort to acquire these datapoints.

Marking criteria:

Marks will be awarded for different components of the project using the following rubric:

Component	Marks	Requirements and expectations
Data Wrangling	30	Thoroughness of the data cleaning using Python.
EDA/Visualisation	30	Quality of investigation into potential erroneous values, decision making process on how to handle missing data and potential interpolation options. Stating assumptions and justifying them. Variety of exploratory research and inquiry into different aspects of the dataset, use of broad and appropriate range of visualisations and their effective communication.
Data Analysis	30	Depth, sophistication and difficulty of analysis being performed. Diversity of techniques used to answer the research questions and communicate the findings to the reader.
Report Presentation	10	Structure of the report and use of headers and formatting. Clear sections and logical flow. Well-articulated research questions and goals. Suitable introduction and conclusion. Tidy code sections and their explanations where needed. Not cluttering the notebooks with too many dataframe data dumps.
BONUS MARKS
Integration of Additional Datasets	5	Meaningful integration and augmentation of insights with the NZ General Social Survey data or other data from Stats NZ.
Updating of variables	5	Updating of variables with more recent values where possible.

Jupyter Notebook Template

A notebook template has been created for you that you need to use. Make sure that the introduction section has all the necessary parts filled out that are relevant to your project. The template file is called ‘Jupyter Project Report Template.ipynb’

Group Work:

Ideally, this assignment is expected to be completed individually. However, students desiring to complete this assignment in pairs may be given permission on the condition that their final mark will be a maximum of 80%. The completion of the bonus component would make their maximum score of 90%.

Hand-in:

Submit ONLY ONE Jupyter notebook file via the Stream assignment submission link. However, please extract an html page from your notebook and submit this too in case there are errors in your notebook and we cannot open it. Include also your AI use statement. Please do not email your submission to the teaching staff.

Use of Generative AI in This Assignment

In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university, the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently.

Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.

For this project, you are required to independently select, wrangle, analyze, and interpret a range of datasets. The use of ,generative AI is restricted to planning, explanation, and concept development, as outlined below.

Allowed Uses of AI for assignment 1

You may use AI along the lines of the following prompts to:

- Understand background knowledge related to data sources, economic indicators, demographics, environmental trends, and other relevant themes.

• Example: "Explain how GDP is typically measured and what factors influence it."

• Example: "What are the key challenges in analyzing time-series economic data?"

- Seek feedback on your problem-solving approach without directly generating code or statistical analysis.

• Example: "I plan to analyze population trends by merging multiple datasets from Stats NZ. Does this approach make sense?"

• Example: "What are common pitfalls in data cleaning when working with government datasets?"

- Clarify error messages or debugging hints, as long as you are the one writing the code.

• Example: "I am trying to use groupby in pandas, but my output is not as expected. What might be wrong?"

• Example: "Why is my bar chart missing some categories when I use Matplotlib?"

- Explore different methods for visualizing data, but not for directly copying generated visualizations.

• Example: "What are the best ways to visualize trends in census data over time?"

• Example: "How can I use geospatial visualizations to display regional economic data?"

Prohibited Uses of AI for assignment 1

You must NOT:

• Copy AI-generated code directly into your submission.

• Input the assignment questions directly into AI and use its responses as your own.

• Ask AI to interpret your figures and raw findings for you.

• Paraphrase AI-generated explanations/code and present them as original work.

• Ask AI to write step-by-step solutions to any of the assignment tasks.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *