首页 » 数据科学与大数据技术 » Data exploration and preparation

Data exploration and preparation

2024-11-11 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Coursework 1: Student examples

The following examples are extracts from previous students' coursework to provide a guidance as to the standards expected. Full coursework examples are not provided to avoid potential issues with copying (plagiarism); and also as the coursework changes each year so there is no single coursework that matches the current.

These are examples of actual student work and not 'templates' to copy.

The coursework specifications vary each year, do not assume that the examples comply with the guidance in this year's specification. Always refer to the current specification.

Start work early on your coursework and use the tutorials to gain formative feedback.

The 'good' examples are drawn from 2:1 or distinction responses, but marks are not provided.

The examples are accompanied by a brief comment to explain what was considered either "good" or "could be improved" about the given example.

Past students have published their code with 'public' visibility on GitHub; you must not copy their code. Copying from other students, past or present, is not appropriate even when correctly cited.

Section 1: Data exploration and preparation

Please note that last year this section was not separated into understanding and preparation in the same way as it is this year. The following examples will give you an understanding of standard, but do not exactly match what you are asked to do this year.

General guidance

Describe and explain the steps you took, your findings, and any decisions you made.

For example, if you identify code quality issues then explain how you addressed these; or if you chose not to address them, then explain the reason for not addressing them.

Do not focus on interpreting the data as if for a particular audience. For example, comments such as "I created _X_ chart to explore the range of

values of _Y_ variable and found that there were _Z_ outliers" are relevant in the context of this coursework; comments such as "The data shows

that more people migrated from in London in 2023 than in 2022." is not relevant in the context of this coursework.

Example 1: Boundary between High pass/Merit

Feedback: "The code shows some understanding of the use of pandas, though you could have done more to describe the data using the pandas functions to show size, data types, ranges of values, etc. There is some attempt at adding structure in functions though it's a little jumbled. Try and separate out the functions from the 'main' where you then call the functions."

It wasn't clear why the student commented out the code to create the charts.

The text supported the written code and evidenced that the student has gained a good understanding through applying the code, however the code and explanation combined was not sufficient to attain a higher mark.

Student's code (parts removed to reduce the length of this page)

import pandas as pd

import matplotlib.pyplot as plt

if __name__ == '__main__':

df = pd.read_csv('dataset.csv')

# first, lets translate all the data from german to english

def ger_to_eng(dataframe):

column_rename_map = {

"StichtagDatJahr": "ReferenceYear",

"DatenstandCd": "Status",

...removed...

dataframe.rename(columns = column_rename_map, inplace = True)

word_mapping = {

'männlich': 'male',

'weiblich': 'female',

'10- bis 19-Jährige': '10-19 years old',

...removed...

# add more translations as needed

}

dataframe.replace(word_mapping, inplace=True)

# second, cleaning up the data

def del_redundant_cols(dataframe):

# i'll be removing columns which i deem redundant

del_columns = ['AgeGroupCode', 'DogAgeGroupLong', #'DogAgeGroupSort',

'GenderCode', 'DogGenderCode', 'BreedCode']

# delete specified columns

for col in del_columns:

del dataframe[col]

# next let's examine the data a bit

def dog_age_check(dataframe):

age_tally = dataframe['DogAgeGroupCode'].value_counts().sort_index()

age_tally = age_tally.drop(999, errors='ignore') # accounts for the unknown entries, which default to 999, removing them

# plotting a bar chart

plt.bar(age_tally.index, age_tally.values, color='skyblue')

plt.xlabel('Age')

plt.ylabel('Amount of Dogs')

plt.title('Age of Dogs Tabulated')

plt.show()

'''def owned_dog_count(dataframe):

# groupby OwnerID and sum up every number of dogs tied to that owner

dogs_per_person = dataframe.groupby('OwnerID')['NumberOfDogs'].sum().reset_index()

# Display the result

plt.figure(figsize=(10, 6))

dogs_per_person.plot(kind='bar', color='skyblue')

plt.title(f'')

plt.xlabel('Number of owned dogs')

plt.ylabel('Frequency')

plt.show()

plt.bar(dogs_per_person.index, dogs_per_person.values, color='skyblue')

plt.xlabel('Number of owned dogs')

plt.ylabel('Frequency')

plt.title('Number of dogs owned by a single person')

plt.show()

plt.figure(figsize=(10, 6))

plt.bar(dogs_per_person.index, dogs_per_person, color='skyblue')

plt.yscale('log') # Set y-axis to logarithmic scale

plt.title(f'Frequency Bar Graph for num_dog')

plt.xlabel("num of dogs")

plt.ylabel('Logarithmic Frequency')

plt.show()

def create_dog_bar_chart(dataframe):

owner_counts = dataframe.groupby('OwnerID')['NumberOfDogs'].nunique().reset_index()

# Plotting the bar chart

plt.bar(owner_counts['NumberOfDogs'], owner_counts['OwnerID'])

plt.xlabel('Number of Dogs Owned')

plt.ylabel('Number of Owners')

plt.title('Number of Owners for Each Number of Dogs Owned')

plt.show()'''

ger_to_eng(df)

del_redundant_cols(df)

dog_age_check(df)

df.to_csv('dataset_prepared.csv')

Example 2: Distinction

Feedback: "Great use of functions, comments and docstrings. Some great work on cleaning the data, especially considering how you would

detect outliers. Great visualisation of the data, what does this mean for your product? "

The code from the functions below has been removed, however the student went beyond what was taught in the course.

Code:

import math

import matplotlib.pyplot as plt

import pandas as pd

import warnings

from datetime import timedelta

warnings.simplefilter(action="ignore", category=FutureWarning)

# ignore the waring from df.approve

def print_general_statistics(df):

"""

print the general information about the dataframe;

print first 5 rows and all the columns of the data frame;

demonstrate number of row and column of the data frame;

print the data types; general statics information of the data frame

Args:

df: The data frame imported.

"""

pd.set_option(

"display.max_columns", None

) # set all the columns visible in the terminal printing

pd.set_option("display.width", None)

print("\nthe first 5 rows of dataframe :\n")

print(df.head(12))

print("\nThe Rows and Columns number:\n")

print("\nRow Number :" + str(df.shape[0]))

print("\nColumn Number :" + str(df.shape[1]))

print("\nColumn data types:\n")

print(df.dtypes)

print("\nStatistics:\n")

print(df.describe()) # Add your code inside the brackets

def null_data_detection(df):

def time_stamp_format_convert(df):

def breaking_point_detection(df):

def interpolation(df):

def timestamp_delete(df):

def outlier_detection(df, window_size=5, threshold=3):

def different_activity_frame_division(df):

def statics_histogram(df, name):

def statistics_boxplot(df, name):

def smoothing(df):

def smoothing_all(df):

if __name__ == '__main__':

# read dataframe from csv

df_raw = pd.read_csv("dataset.csv")

# detect whether there is null data from the dataframe

# print the general statistics

print_general_statistics(df_raw)

# detect null values

df_after_null_preprocess = null_data_detection(df_raw)

# convert timestamp to datetype and allow the after calculation

df_after_time_stamp_convert = time_stamp_format_convert(df_after_null_preprocess)

print_general_statistics(df_after_time_stamp_convert)

# detect whether index =20928 has varied or not

print(df_after_time_stamp_convert.loc[20928, "timestamp"])

# delete same timestamp in the dataframe

df_after_delete = timestamp_delete(df_after_time_stamp_convert)

# interpolate lost data between the datapoints

df_interpolation = interpolation(df_after_delete)

# divide dataframe according to Activity

df_activity_0, df_activity_1 = different_activity_frame_division(df_interpolation)

# detect the outliers

outlier_detection(df_activity_0)

outlier_detection(df_activity_1)

# draw the figures

statics_histogram(df_activity_0, "Activity = 0")

statics_histogram(df_activity_1, "Activity = 1")

statistics_boxplot(df_activity_0, "Activity = 0")

statistics_boxplot(df_activity_1, "Activity = 1")

# smoothing the data frame

df_final = smoothing_all(df_interpolation)

df_final = df_final.drop(columns="timestamp_datetype")

df_final.to_csv("output_file.csv", index=True)

Extract from the report:

This is just the first part so you can see the difference in the quality of the explanations. The student went on to explain the actions taken in each of their functions, though some of the text was unnecessary as it also described the code which wasn't necessary.

Section 2: Database design and preparation

Please note that this section is substantially different in 2024.

Database design was included as part of the application design in coursework 2, and not coursework 1. The design of the database covered requirements for their app, and not the data set so is different to what you are asked to do.

Students were not taught the first, second and third normal forms and so the expectations of their coursework was lower and therefore the grades awarded higher than would be given for the same this year. I have tried to comment on the implications.

Preparation of the database is new to the coursework this year so there are no student examples from COMP0035. Students created the database in COMP0034 and used different libraries. The students who achieved a higher grade showed originality in their code, tackled databases with multiple tables, and provided well structured and documented code; those who achieved a high pass tended to just copy the tutor's example code from the tutorial's and make minor changes to adapt it to their data.

Example 1: High pass

Feedback: "The ERD is a little confusing, why are you storing the account data in two places? The account data is enough. If you plan to create the activity chart then you probably need a new table with attributes such as user_id, date/time, route they accessed. I assume the Visualisation Chart is really the demograophic data so the table probably just needs a more meaningful name. A good attempt at each aspect of the design with a few areas that could be improved."

Example 2: High Merit / Distinction boundary

Feedback given: "The ERD is well drawn and shows an understanding of normalisation. It is consistent with other aspects of the design."

Note that last year students were not taught 1NF, 2NF and 3NF, so this coursework evidenced that the student had carried out some independent research to understand normalisation, though this is a copy and paste of the normalisation criteria rather than an explanation of how these were applied in the context of this design.

Extract from student's PDF:

Example 3: Excellent (>70)

Feedback: "The development of the ERD similarly shows a clear grasp of the concepts of database design and normalisation and the resulting design is well presented and approprate. This is an excellent coursework that not only evidences mastery of the techniques taught but also clearly evidences an excellent grasp of the implications of the concepts and extensive additional reading."

This student provided a detailed explanation of the steps and decisions they made at each stage of normalisation, discussing the implications of the choices they made. Their work evidenced that they understood and carefully applied the concepts. This far exceeded what was taught within the course last year.

Student's ERD diagram only:

Section 3: Tools

Linting was not included in coursework 1 but was included in coursework 2. Source code control was assessed by looking at the commit history and messages in their repository so cannot be included here. Including the environment files without seeing the students environment will not give you a meaningful example. Since there is no meaningful way to provide student examples, the following is the feedback given to students only.

Students achieving this highest marks in this section also provided evidence of the use of tools that went beyond the required tools. I will not list these as this would then not be exceptional; this is an opportunity for students achieving the higher marks to research tools that support code quality and development and apply something that is not covered in the course.

Source code control

High pass: "Some use of source code control over a period though you appear to mostly upload files rather than synchronise files between a local and remote repository."

High merit: "Regular use of source code control with clear and meaningful commit messages."

Distinction: "Regular use of source code control with unique commit referencing. Evidence of effective use of branches and pull requests."

Linting

High pass: The code itself appeared free of issues that would typically be flagged by a linter so some assumption could be made as to effective linting, but there was little or no evidence provided by the student to explain how they used linter tools to achieve this.

Merit: Provided evidence of using a linter, and the code appeared free of issues. Low merit: may have provided evidence of using the linter but not then used this to improve the code.

Distinction: Provided evidence of using a linter at different stages; discussed actions taken and in cases where the issues could not be resolved, gave an appropriate explanation for this. Some used more than one linter and compared the results.

Environment management

High pass: One or more of the required files was missing and/or in general the files were missing some details that prevented them from being fully usable.

Merit: All files provided and were appropriate to allow the environment to be recreated and their code run. May have had minor issues e.g. a package missing from requirements.txt, a detail within pyproject.toml missing or incorrect.

Distinction: Some students showed use of different techniques for creating and managing environments; and/or used the files with very specific detail beyond the basics; and gave very clear guidance in the readme.md that led to the marker being able to successfully create an environment and run the students code.

Last modified: Saturday, 28 September 2024, 6:36 PM

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *