STATS 3DA3 Homework Assignment 2
Pratheepa Jeganathan
02/05/2024
Instruction
• Due before 10:00 PM on Tuesday, February 13, 2024.
• Submit a copy of PDF with your solution to Avenue to Learn.
• Late penalty for assignments: 15% will be deducted from assignments each day after the due date (rounding up).
• Assignments won’t be accepted after 48 hours after the due date.
Assignment Standards
Your assignment must conform to the Assignment Standards listed below.
• Write your name and student number on the title page. We will not grade assignments without the title page.
• You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
• LATEXis strongly recommended but not strictly required.
• Eleven-point font (times or similar) must be used with 1.5 line spacing and margins of at least 1~inch all around.
• Use newpage to write solution for each question (1, 2, 3).
• No screenshots are accepted for any reason.
• The writing and referencing should be appropriate to the undergradaute level.
• Various tools, including publicly available internet tools, may be used by the instructor to check the originality of submitted work.
• Assignment policy on the use of generative AI:
– Students are not permitted to use generative AI in this assignment. In alignment with McMaster academic integrity policy, it “shall be an offence knowingly to … submit academic work for assessment that was purchased or acquired from another source”.
This includes work created by generative AI tools. Also state in the policy is the following, “Contract Cheating is the act of” outsourcing of student work to third parties” (Lancaster & Clarke, 2016, p. 639) with or without payment.” Using Generative AI tools is a form of contract cheating. Charges of academic dishonesty will be brought forward to the Office of Academic Integrity.
Question 1
Download the paper Data Science at the Singularity by David Donoho (2024) at paper. Follow the steps to find the most frequently used words and create a word cloud.
• (1) Reference where you obtained the original PDF document.
• (2) Read all PDF document pages and separate each line by \n.
• (3) Split the lines by \n.
• (4) Remove the lines before Abstract. ...... You can print the first few lines and find the number of lines to remove.
• (5) Create a data frame with lines.
• (6) Tokenize each line and convert each word to a row.
• (7) Convert each word to lowercase.
• (8) Remove stopwords.
• (9) Remove any other words that are not suitable for the word cloud. For example, a single letter word, symbols [ . , ) , abbreviation, etc.
• (10) Create a term-frequency data frame.
• (11) Produce a word cloud. You can decide on the most frequently used words in the world cloud—for example, word cloud for the ten most frequently used words.
• (12) Write a summary paragraph (at least two statements) about your word cloud. The summary should be cast in the context of your chosen text document.
Question 2
Question 2 uses Johns Hopkins GitHub data on the COVID-19 global vaccine administered to develop a Shiny App.
Visit the website https://github.com/govex/COVID-19/tree/master/data_tables/vaccine _data/global_data and read the description (readme.md).
This question will lead to developing a Shiny app so that users can choose the date range to investigate the COVID-19 vaccine administrated and the number of people for whom at least one dose has been administered.
• (1) Read the CSV file of https://raw .githubusercontent .com/govex/COVID -19/ master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine _global .csv into Python. Read the data dictionary at https://github .com/ govex / COVID -19 / blob / master / data _tables / vaccine _data / global _data / data_dictionary.csv.
• (2) Each row is uniquely defined by country and date in the data frame. What is the dimension of the data?
• (3) Look at the data dictionary. Describe the Doses_admin and People at least one dose administered variables.
• (4) Identify the data frame column representing the countries. Then, select the rows in the data frame for Canada.
• (5) Use only the Canada vaccine data to answer the rest of the questions. Plot the time series data of Dose_dmin and People_at_least_one_dose in the same graph. Label the time series lines by Doses Administered and People at least one dose administered, respectively. Convert the y-axis to the log scale. Rotate the x-axis ticks by 45 degrees.
Hint:
1. Convert ‘Date’ column to datetime format.
2. Use matplotlib.pyplot.plot.
• (6) Describe the plot in the context of data.
• (7) Create the Shiny app as follows. In the Shiny app, the user input is any starting and ending dates. The range of dates may be 2020-12-29 to 2023-03-09. The output is the time series plot for the logarithm of the doses administrated and people at least one dose administrated in Canada for the range of dates the users choose. You can use the following template to create the Shiny app.
• (8) Deploy your Shiny app at https://www.shinyapps.io/. Then, provide the link to the app—for example, https://pratheepaj.shinyapps.io/my_app/. from shiny import App, render, ui
# import required libraries
app_ui = ui.page_fluid(
ui.input_date_range(
"daterange",
"Date range",
start="2020-12-29",
end= '2023-03-09'
),
ui.output_plot('myplot'),
)
def server(input, output, session):
@output
@render.plot
def myplot():
# Read the data
# select the data for Canada
# If you call the data frame as `df`, then the
# following codes select the rows in the user
# selected date range
df = df[df['Date'] > pd.Timestamp(input.daterange()[0])]
df = df[df['Date'] < pd.Timestamp(input.daterange()[1])]
# Create the plot using `df`
app = App(app_ui, server)
3. Helper’s name.
After attempting homework problems individually, students may discuss a homework assignment with their classmates. However, students must write up their solutions individually and explicitly indicate who (if anyone) or resources students received help. Write your helper’s name (only one helper’s name is accepted).
Grading scheme
1. 1. Link to the document[1]
2. Codes to read all the pages[1]
3. Codes [1]
4. Codes [1]
5. Codes [1]
6. Codes [2]
7. Codes [1]
8. Codes [1]
9. Codes [1]
10. Codes [1]
11. Codes, word cloud for the most frequently used words [2]
12. Two statements[2]
2. 1. Codes [1]
2. Codes and answer [1]
3. Description [2]
4. Identify the column and code [2]
5. Plot variable 1, plot variable 2 in the same plot, label both time series, y-axis scale, x-axis ticks [5]
6. At least one statement [1]
7. importing libraries, complete the codes for creating the plot, app works locally[3]
8. deploying the app, link to the app [2]
The maximum point for this assignment is 32. We will convert this to 100%.