Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Assessment Proforma 2024-25 - Autumn
Key Information
|
Module Code |
CMT309 |
|
Module Title |
Computational Data Science |
|
Assessment Title |
Data Science Portfolio |
|
Assessment Number |
2 |
|
Assessment Weighting |
70% |
|
Assessment Limits |
This is an individual assignment that accounts for 70% of your total grade. It consists of four questions with different weightings. There is no strict requirement on the documentation or quality of the functions you create, but it is expected that the functions include only essential information. |
The Assessment Calendar can be found under ‘Assessment & Feedback’ in the COMSC- ORG-SCHOOL organisation on Learning Central. This is the single point of truth for (a) the hand out date and time, (b) the hand in date and time, and (c) the feedback return date for all assessments.
Learning Outcomes
The learning outcomes for this assessment are as follows:
1.) Carry out data analysis and statistical testing using code
2.) Critically analyse and discuss methods of data collection, management and storage
3.) Extract textual and numeric data from a range of sources, including online
4.) Reflect upon the legal, ethical and social issues relating to data science and its appli- cations
Submission Instructions
Start by downloading all questions.ipynb from Learning Central, then answer the follow- ing questions. You can use any Python expression or package that was used in the lectures and practical sessions. Additional packages are not allowed unless instructed in the ques- tion. You can study answering the questions by filling in the appropriate sections in the given example Jupyter Notebook.
All files should be submitted via Learning Central. The submission page can be found un- der ‘Assessment & Feedback’ in the CMT309 module on Learning Central. Your submission should consist of multiple files:
|
Description |
|
Type |
Name |
|
Coversheet |
Compulsory |
One PDF (.pdf) file |
Coversheet.pdf |
|
Your solution to Q1 |
Compulsory |
Word (.docx) file |
student no Q1.docx |
|
Your solution to Q2 |
Compulsory |
One Notebook (.ipynb) file |
student no Q2.ipynb |
|
Your solution to Q3 |
Compulsory |
One Notebook (.ipynb) file |
student no Q3.ipynb |
|
Your solution to Q4 |
Compulsory |
One Notebook (.ipynb) file |
student no Q4.ipynb |
Any deviation from the submission instructions (including the number and types of files submitted) may result in a reduction of marks for the assessment or question part.
You can submit multiple times on Learning Central. ONLY files contained in the last attempt will be marked, so make sure that you upload final files in the last attempt.
Staff reserve the right to invite students to a meeting to discuss the Coursework submissions.
If you are unable to submit your work due to technical difficulties, please submit your work via e-mail to [email protected] and notify the module leader.
Assessment Description
Rules
(1) You have to upload the files mentioned in the Submission Instructions section below.
(2) Failing to follow submitted file names, and file types (e.g. naming your file q1.py instead of Q1.py) will have a penalty of 10 points from your total mark.
(3) The coursework includes different datasets, which are automatically downloaded. Since these files are already with the markers, students do not need to submit these files back.
(4) Changing the txt file names, and developing your codes with those changed file names would cause errors during the marking since the markers will use a Python marking code developed with the original file names.
(5) You can use any Python expression or package that was used in the lectures and prac- tical sessions. Additional packages are not allowed unless instructed in the question. Failing to follow this rule might cause you to lose all marks for that specific part of the question(s).
(6) You are free to use any Python environment, or version to develop your codes. How- ever, you should fill and test your notebook in Google Colab since the testing and marking process will be done via Google Colab.
(7) If any submitted code for any sub-question fails to run in Google Colab, that part of the code will be marked as 0 without testing the code in Jupyter, or any other environment.
(8) It is not allowed to use the input() function to ask the user to enter values.
(9) If a function is asked to be developed, the name and input arguments of that function should be the same as instructed in the paper.
Testing Your Codes
You are given at least one test case for each question with their desired outcomes. These cases will give you the chance to check/test your implementations of the questions. You use the test cases to make sure that:
• Your function does not crash, that is, there are no Python errors when trying to run the function.
• Compare the results of the test cases to your results. Expected/desired results are given at the end of each case.
Please note that returning the same outputs in the test codes does not assure that you will get full marks. We will use additional test cases (not disclosed) to test your functions.
IMPORTANT: You must make sure that your file executes and does not crash before submitting it to Learning Central. Any function that crashes or does not execute will receive 0 marks on the respective (sub)question. Note that the test codes are only provided for your convenience.
Q1) Ethics (8 marks)
Camden Council published a data charter. Since Camden Council uses data to improve residents’ lives, it also believed it to be important to write about how they do it and how they collect and process residents’ data ethically. In this question, you must first demonstrate that you are able to understand the charter by aligning it to the UK Data Ethics Framework. To this end, you must use 3 resources:
• The Camden Data Charter: [link here]
• The UK Data Ethics Framework: [link here]
• The YouTube video showing how the charter was developed: [link here]
Your task is to identify one principle in the Principles section in Camden’s Data Charter and discuss how it aligns with the UK Data Ethics framework. Specifically, you must discuss whether it aligns or not with each of the principles in the framework (transparency, accountability and fairness), and what actions from the framework are most prominent. You should also refer to the video for evidence of this alignment, being specific about what part of the video you are referring to, who speaks, and how what they say aligns with the UK Framework. You must not discuss a principle in the charter for which you didn’t find evidence in the video.
For example, you could refer to 1 . Build trust through transparency in the Camden Charter, which is clearly related to the Transparency principle in the framework, and which is discussed by Mohamed between seconds 41 and 49 in the video.
For distinction, you should also reflect critically on some part of the video that shows any limitation or where you do not necessarily agree, and discuss why. For example, in 1:32, one of the posters on the wall reads communicate to people that that data will not be used for profiling. However, you might believe that profiling might be beneficial for you as an adult, e.g., by knowing more about your health, the council might direct you to specific vaccination/immunization campaigns. However, you might feel more protective about children. For this mark to be fully creditable, you must provide the actual timestamp from the video.
Q2) Web Scraping (8 Marks)
Create a function oscars_scraper(url, start, end, PerfQuery) that takes a Wikipedia page URL (url) as input and performs web scraping. The page stores information about Academy Award for Best Actress category winners and nominees between 1927 and 2023.
Your task is to use BeautifulSoup module to scrape the corresponding page (in HTML format) find the target table for dates between start and end and create a data frame in the end.
The function oscars_scraper(url, start, end, PerfQuery) will then perform a pandas query for a given 'Actress' (variable PerfQuery) finding the number of times nominated and winning the award. Then, a string should be printed as given below.
An example test case can be given as
with df is returned (first 5 lines as an example)
and the statement below has been printed.
Between the years 1975 and 2023, Kate Winslet was nominated for the Academy Awards for Best Actress 4 times . Among those nominations, Kate Winslet won the award 1 times
WARNING: All the information should be scraped from the HTML page. Manual entries will be discarded and cannot be marked!
Some Suggestions
• As the first step, find table rows by using HTML tags.
• Investigating the tables in the given URL, all columns are in HTML header format, and year rows are also in HTML header formats spanning several rows at a time.
• Considering all table column headers have the same names ( 'Year', 'Actress', 'Role(s)', 'Film', 'Ref . '), finding that row once for the first instance and extracting the data frame names is the easiest approach.
• Year columns also include the edition of the award ceremony (e.g. "2023 (96th)"). You need to split this information into two columns in your table 'Year', 'Edition'.
• Some rows span more than one row. Please carefully combine these rows to create your pandas dataframe in the end safely.
• Winner actress names include a special character "‡" (e.g. "Emma Stone ‡"). You need to remove this character from actress names.
• Lastly, you should add a Boolean column called "Winner" that includes True for winners, and False for nominees not won.
Marking Details
• (1 mark) Correct function argument usage
• (1 mark) Beautifulsoup usage
• (2 marks) Correct function returns in terms of quantity and types
• (4 marks) Correct dataframe content (this will be checked via 4 pandas queries (1 given, 3 hidden) each is worth 1 mark)
Q3) Regression Analysis & Visualisation (30 marks)
You are given a data set (nba.csv) which is a record of statistics of several rookie NBA players throughout their first season.
| | name|games_played |minutes_played |points_scored |... |5yrs |
|---:|:----------------|---------------:|-----------------:|----------------:|...:|-------:|
| 0 | Brandon Ingram | 36 | 27.4 | 7.4 |... | 0 |
| 1 | Andrew Harrison | 35 | 26.9 | 7.2 |... | 0 |
| 2 | JaKarr Sampson | 74 | 15.3 | 5.2 |... | 0 |
| 3 | Malik Sealy | 58 | 11.6 | 5.7 |... | 1 |
| 4 | Matt Geiger | 48 | 11.5 | 4.5 |... | 1 |
Given these sets of features, your task is to
1. select significant features via a combination of different approaches,
2. perform prediction using Logistic regression with your selected features.
• The prediction task will be to check whether the player would still be playing in the NBA 5 years later (column '5yrs') (1: indicating he will play or 0: indicating he won’t play).
3. compare your designed logistic regression model with
• Model 1: Linear regression with 'games_played' column as a predictive feature
• Model 2: Logistic regression with all columns as predictive features
• Model 3: Random Forest Regression with your chosen features.
4. Creating a function that calculates the Precision-Recall values for any given model.
Expectations and Marks
• Write a code that utilises two feature selection techniques of your preference. Combine the findings of your two chosen feature selectors in a logical way (open to your imagination!). Depending on your choice of features, create a figure to justify your choice!
• (2 x 2 Marks) Correct usage of feature selectors
• (8 Marks) Combination of methods and justification figure! You must use the plotly module. Your figure should be self-explanatory with their titles, labels, legends, hover text, annotations, etc. where the reader should understand your justification of feature selection without the need for additional text.
• Perform prediction using your Logistic regression model with selected features.
• (1 Marks) Correct implementation of model
• Create your comparison models shared above with proper predictions.
• (3 x 1 Marks) Correct implementation of comparison models
• (4 Marks) Create a function _PR() that calculates precision and recall values for any given model. You cannot use any ready-to-use functions here and you must calculate precision recall values yourself from their definitions.
• Since the regression models provide continuous outputs and the prediction problem here is a binary classification, you need to decide on a decision threshold for each regression model.
• (4 Marks) Obtain Precision-Recall curves for all four models above, plot them in a single figure and show their optimal decision thresholds on this figure. You can use any visualisation module taught in the lectures.
• (2 Marks) Correct threshold selection for models.
• Performance analysis code and table
• (4 Marks) Calculate % Accuracy (1 mark) and AUC (1 mark) metrics for all the models and create a dataframe (2 marks) showing Model names, Accuracy and AUC in different columns.