DATA7201 Data Analytics at Scale

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Postgraduate coursework

DATA7201 Data Analytics at Scale (2024)

Project Report – Report on Dataset Analytics (Coursework)

1. Introduction

This assessment for “ DATA7201 Data Analytics at Scale” consists of a piece of individual coursework. Given a dataset (see Section 2), you should use big data analytics techniques to explore the data and to draw some conclusions that inform decision makers. You will also need to select the most appropriate techniques and justify your choices using supporting evidence from academic literature.

You should write a  1,500 word structured  report (see Section  3) that describes the approach you have taken to analyse  the chosen dataset using  big  data  analytics  techniques.  The  report  should  focus  on  summarising  your approach on the chosen dataset and presenting your main findings. You should pay particular attention on clearly communicating the results  of your analysis and on  helping the  reader interpret  your findings.  Charts, tables, and appendices are not included in the word count.

This assessment is worth 45% of the overall course mark for DATA7201. Submission deadline: 4pm Monday 20th May 2024 (Week 13) via Turnitin.

2. Given dataset: Facebook Ad Library API

The  dataset  to  be  used  in  this  assessment  is  a  collection  of  sponsored  political  posts  on  Facebook  targeted  at Australian users during 4 years (03/2020-02/2024). This includes the period preceding the latest Australian Federal election  in  May  2022 and the Voice  referendum  in  October  2023. A description of the data structure  is available starting from: https://www.facebook.com/ads/library/api/ (note that some fields have changed in the API and the collected data over the years). The dataset covers four years’ worth of data collected from this API. The format in which the data  is  provided  by  Facebook  is JSON  files.  Each file  is the  result  of a  request for  active  ad campaigns performed every 12 hours (or more frequent) during the period, thus a lot of ad campaigns are duplicated across files (i.e., if ad campaigns run for more than 12 hours) and should be properly managed during pre-processing. Given the limited size of this dataset, it is expected that projects would analyse most of the available data. You can find the data on the DATA7201 cluster HDFS under /data/ProjectDatasetFacebookAU.

You  can  integrate  the  dataset  with  external  data  if  you  want  (e.g.,  with  weather  data  via  time  information  and mentioned locations), although this is not mandatory. The emphasis of this coursework assignment is on how you engage with big data analytics techniques, select appropriate big data analytics technologies, and on how well you communicate your analysis and findings. You are allowed to use any other data analytics tool (e.g., for producing visualisations or data summaries) as long as you also use, in some steps of your analysis (e.g., to pre-process the entire dataset to select a relevant sample of the data), the cluster where the data lies (e.g., Pig, Python, SQL, etc.).

Examples of possible analysis include, but are not restricted to, the following:

.     Look at ad volume over time for a certain topic.

.     Focus  on  certain  accounts  (e.g.,  Facebook  pages  supporting  a  certain  party  and  see  which  demographic segments they target most).

.     Look at URLs included in ads to understand which internet domains are most popular during the campaign.

.     Look at a specific event or hashtag and look at who is talking about it.

.     Look at spend per demographic group during an election campaign.

.     Look at the duration of ad campaigns over topics and political alignment.

You should investigate the dataset using tools on the DATA7201 cluster and write up your findings into a report also providing the code/scripts/queries (if any) you used as an appendix. You will be evaluated according to the learning objectives of the module as specified in the report structure (Section 3).

3. Report structure

You are required to produce a structured report that includes all the sections detailed in Table 1. You can structure sub-sections as you prefer. Overall, 90 marks will be awarded based on the content of your report. In addition, 10 marks will be awarded based on the presentation of the report and how well you communicate your findings. You must state the word count somewhere in the report. As there is a word count limit you should aim to make your writing as concise and informative as possible. Note also that your work will be assessed taking into account the word limit; therefore, we are not expecting multiple detailed analyses in the report; rather the emphasis should be on the clarity, accuracy and quality in communicating your findings.

Table 1: Required content of the structured report.

Section

Description

Maximum allocated marks

Learning Objective

Structured abstract

This should provide a summary of your report in a structured manner. This is not included in the word count.

Required, but 0 marks

Table of contents

This should include section titles and page numbers. This is not included in the word count.

Required, but 0 marks

Introduction

This section should briefly describe the general area of big data analytics and motivate the need for distributed system solutions with practical examples on why these solutions are needed.

15 marks

1. Solve challenges and leverage opportunities in dealing with Big Data

Dataset Analytics

This section should provide a brief description of the dataset used in your report and the pre-processing steps you took (e.g., focus on ads about certain topic). You should also list any additional datasets you used (e.g., weather data), if any.

Describe all steps performed to analyse the data and present the results of your analysis. You can select in which way to analyse your data (e.g., Pig,

Python, SQL, etc.) using the DATA7201 cluster, what specific dimensions to look at, and what questions to investigate. You should use at least one of the tools available on the cluster and you can use additional external tools, If desired.

50 marks

3. Apply data analytics infrastructures to best support data science practices for non-technical stakeholders (e.g., executives).

5. Judge in which situations Big Data analytics solutions are more or less appropriate.

6. Design the most appropriate Big Data infrastructure solution given a use case where to deploy Big Data solutions.

Discussion and

conclusions of the

analysis

In this section, you should summarise and discuss the main findings of your analysis and lessons learned. You should state the main message the reader should come away with from your data analysis.

25 marks

3. Apply data analytics infrastructures to best support data science practices for non-technical stakeholders (e.g., executives).

Appendix

Include the code/scripts/queries you   used as an appendix. The code quality will not be assessed.

Optional, and marks

发表评论

电子邮件地址不会被公开。 必填项已用*标注