Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
COMP42315 Assignment – Web Scraping, Data Analysis, and Visualization
Module/Lecture Course: |
Programming for Data Science |
Deadline for submission: |
9 August 2024 at 2:00pm UK time |
Deadline for marks and feedback to be returned to students: |
Week beginning 9 September 2024 |
Submission instructions: |
Submit all files via Jupyterhub on NCC. |
Format: |
You should submit provide your code and written answers (i.e., report) within the same Jupyter notebook file. Do not put your name on your work, just your username. |
Contribution: |
The coursework assignment contributes 100% to the final mark for the module. |
In accordance with University procedures, submissions that are up to 5 working days late will be subject to a cap of the module pass mark, and later submissions will receive a mark of zero.
Content and skills covered by the assignment
• Understand advanced concepts of programming in Python.
• Have a critical appreciation of the main strengths and weaknesses of a range of Python packages and understand how to use them.
• Have a critical appreciation of how to acquire and clean datasets for analysis.
• Understand how to manipulate potentially large datasets efficiently.
• Be able to write computer programs in python using industry-standard packages.
• Be able to select appropriate data structures for modelling various data science scenarios.
• Be able to select the appropriate algorithm and programming package for a given problem.
• Be able to write a computer program in python to collect or read data from available sources, and clean these datasets using the appropriate packages.
• Effective written communication.
• Planning, organising, and time-management.
• Problem solving and analysis.
Requirements – Please read the following instructions carefully as they are updated in this assignment
Students are expected to work on the coursework individually.
In this assignment, you are asked to scrape data from a website and perform data analysis and visualisation. You will implement the programming solution with a Jupyter Notebook file containing your code and written answers (i.e., report) that explain the implementation and justify the design.
What the examiners expect from program implementation:
• Your program must be runnable on the Durham NCC server for this module
https://ncc1.clients.dur.ac.uk/COMP42315 – a program that partially works or does not run at all will receive no mark.
• You are asked to use Python and the Python libraries taught in this module to complete this part. If you wish to use other libraries, you should verify if they are installed on NCC server. If they are not installed on the NCC server, you should avoid using them or alternatively contact your tutor to ensure they are installed on NCC server.
• Your source code should be documented with comments, making it to be followed as easily as possible.
• Apart from performing the requested functionality, your design should aim at a clear
programming logic. Your proposed solution should also be as robust as possible, such that it works in different situations, and would hopefully work in the future when the site owner updates the webpage (i.e., as future-proof as possible).
What the examiners expect from the report:
• You are asked to answer each of the questions and submit a single Jupyter Notebook
(.IPYNB) file, which should include the source code, the results after running the source code, and written answers (i.e., in the separate report part provided in the submission system) that explains the implementation and justifies the design.
• You should include the results of your solution in a proper presentation (e.g., tables, figures) in the Jupyter Notebook.
• If there are any features that you wish to highlight, you are also encouraged to do so such that your examiner can pay attention to them.
• You can use visualizations, figures, tables, organization structures, etc., to help explain your design ideas and showcase the results.
• You should also provide support and justification for your design.
Questions
For Questions 1 and 2, you are asked to perform the following tasks based on the following target website, which contains artificial content designed for this assignment:
https://sitescrape.awh.durham.ac.uk/comp42315_resit/
1. Please design and implement a solution to crawl the publication title, year and author list of
every unique publication record on the target website. Then, using Pandas DataFrame, please create and display a table that contains these unique records. The table should consist of five columns: the row number in the table, publication title, author list, year, and the number of authors (hint: you will need to develop an algorithm to work this out). The records in the table should be sorted first according to the descending number of author values, then by the descending number of year values, and finally by the titles from A to Z. Include the full table in your Jupyter Notebook.
[Explain your design and highlight any features in this question’s report part of your Jupyter Notebook in no more than 300 words. (35%)]
2. You will use the scraping website to gather information related to research publications from Dr. Shum’s research group. You should avoid scraping the same url information repeatedly, by storing it in a dataframe for further processing.
a) Present Table 1 containing the headings: publication year, research publication title, impact factor (IF), citation count, and Similar Research items count (SRic) for the most highly-cited research publication in each year between and including 2006 and 2023. (5/30)
b) Determine the proportion of document types represented in Table 1, and legibly present this information in Table 2 with headings proportion and document type, taking care with the normalisation of proportion and including all document types. (5/30)
c) For each publication in Table 1, compute the impact factor mean and standard deviation of all publications with at least one shared topic with that publication, excluding the publication itself from the computation. Present the publication title, the mean impact factor of publications in shared topics, and the standard deviation in Table 3 with headings topic(s), research publication title, mean impact factor, and std. dev. impact factor. (10/30)
d) Finally, in Figure 1 plot the Similar Research items count (SRic) on the x-axis against the impact factor on the y-axis as a scatter plot including all research publications, coloring by document type, and using a different marker shape for those listed in Table 1. (10/30)
[Explain your design and highlight any features in this question’s report part of your Jupyter Notebook in no more than 300 words. (30%)]
For Question 3, you are asked to perform the task based on the clinical dataset (covid.csv), which you can download separately on Blackboard Ultra. The dataset contains artificial content designed for this assignment.
3. The clinical dataset includes 16 features (including 'id') related to vital parameters and patient-
reported symptoms collected from individuals who underwent COVID-19 testing . The target variable is 'level', that depicts the severity of the disease ranging from 1 to 6. You are required to perfom the following tasks on the clinical dataset.
a) You are required to extract a subset that includes the 'defined features' and the 'target variable' (in the subset, there will be 10 features in total including target variable). You are required to extract the 'defined_features' that are as indicated below:
defined_features = ['headache', 'lossOfSmell', 'musclePain', 'cough', 'soreThroat', 'fever', 'diarrhea', 'fatigue', 'shortnessOfBreath'] (5/35)
b) Perform exploratory data analysis on the clinical dataset, highlight the features that are statistically important and highly related to the 'target variable', and visualise them
legibly using an appropriate visual method. Save the statistically important features as a subset 'selected_features' and compare them with the 'defined_features'. Highlight any differences and report your findings. (10/35)
c) Design and implement a solution to perform data analysis on the clinical dataset to
identify the complex probabilistic relationship between 'defined features' or
'selected_features' and the 'target variable' and validate the prediction of the severity stage and its determinants. Justify the design choice and showcase the findings using an appropriate visualisation tool. [NOTE: You can opt for either 'defined features' or
'selected_features' to answer this sub-task] (20/35)
[Explain your design and highlight any features in this question's report part of your Jupyter Notebook in no more than 400 words. (35%)]
Word Limit policy
Examiners will stop reading once the word limit has been reached, and work beyond this point will not be assessed. Checks of word counts may be carried out on submitted work. Checks may take place manually and/or with the aid of the word count provided via electronic submission.
The word count (total 1000 words), as mentioned in individual questions, will:
• Exclude diagrams, tables (including tables/lists of contents and figures), equations, executive summary/abstract, acknowledgements, declaration, bibliography/list of references, and
appendices. However, it is not appropriate to use diagrams or tables merely as a way of circumventing the word limit. If a student uses a table or figure as a means of presenting his/her own words, then this is included in the word count.