CS 579: Online Social Network Analysis
Project 1 – Social Media Data Analysis
Deliverable 1 due on January 25, 2024 at 11:59pm (google form)
Short progress report due on February 6, 2024 at 11:59pm (google form)
Final report due on February 20, 2024 at 11:59pm (Blackboard)
This project is to be completed in 2-person teams. The team members should work together on each aspect of the project, including the writeup of the report. Each deliverable should be submi>ed once per team.
Project Objec+ves
You will learn how to crawl social media data, consider privacy and data usage implicaAons, process, model and analyze the data. You will write a detailed wri>en report and give a short oral presentaAon summarizing your results.
Project Outline
1. Data CollecAon
2. Data VisualizaAon
3. Network Measures CalculaAon
Guidelines
Data Collec*on – Your iniAal task is to choose a social media platorm to collect data from.
Some example plaMorms include instagram, dblp, Reddit, arXiv, ResearchGate, Stackoverflow, Stackexchange, Wikipedia, etc. Figure out how you can crawl data from these websites. Some
of these plaMorms provide an API for collecAng data. Make sure you have the needed credenAals for scraping the data (i.e. API key).
You should collect enough data to create a social network with 100-500 nodes. Some representaAve network types are described as follows.
. Friendship Network. A user’s friendship network can be represented as a graph that the nodes are the users and the edges show whether there is a friendship relaAonship between them. Example: Users and connecAons in LinkedIn.
. Co-authorship Network. The nodes are scienAsts and two scienAsts are connected if they have co-authored a paper. Example: An authorship network in the Computer Science category of papers in arXiv.
. Diffusion Network. A node represents an enAty which can publish, receive and
propagate informaAon. A directed edge between nodes represents the direcAon of
informaAon propagaAon. Example: Fake news propagaAon when the nodes are users and the edges are re-tweets/replies/likes.
Your report will include a descripAon of how you crawled your chosen plaMorm to collect the data. Please also describe any challenges you faced, how you overcame the challenges and how the challenges impacted the data that you were ulAmately able to collect. Your report should also include the user privacy policy for your chosen social media plaMorm and data usage policy. If you cannot find these policies, please describe where you looked for them.
Data Visualiza*on – Once the data is collected, the next step is to uAlize a graph analysis
socware to visualize your network as a graph. There are many socware packages available including networkx [link], snap [link], Gephi [link], NodeXL [link] and graph-tool [link]. Choose one and read the instrucAons to determine how to input and visualize your graph. Each package may require a parAcular format (i.e., adjacency matrix, adjacency list, edge list) for input of the graph data.
Your report will include a short descripAon of the graph analysis socware that you used, your reasoning for choosing the socware and the format of the data input file. You will incude a screenshot of your visualized graph along with any informaAon needed for the reader to understand the visualizaAon.
Network Measures – You will learn different network measures in class (Degree DistribuAon,
Clustering Coefficient, PageRank, Diameter, Closeness, Betweenness, etc.). Use your chosen graph anaysis socware to obtain degree distribuAon and plot it as a histogram. In addiAon to this, choose two other network measures to report on. Choose any two from those that we’ve learned about. Report on these measures in an appropriate format.
Your report will include a descripAon of how you used the graph analysis socware to get each of the three measures along with the measures and corresponding visualizaAons as appropriate.
Discussion of Results – Your report will include a discussion of the results of the data visualizaAon and network measures. What insights do these results provide? What further quesAons do these results raise? What would your next step to invesAgate further be?
Reference – Your report will cite all tutorials, packages, socware and libraries you used in your data collecAon and analysis.
Video – Each team will submit a video (no longer than 4 minutes) where each team member talks about the most significant challenge they faced working on the project.
Submission
We will run your code to see if it works for all of the steps. You should put all of your files including your raw data, your cleaned data, source code files, a report in pdf format and your short video into a .zip folder named LASTNAME1_LASTNAME2_PJ1 (Instead of LASTNAME1 and LASTNAME2 type the lastname of each team member). Submit your zip folder to Blackboard. One submission per team.
Academic Integrity
You must develop your own code for data scraping. It is NOT okay to use a publicly available dataset.
You can refer to others’ code and use libraries, socware and packages but it is not okay to copy exisAng code from others. Be sure to cite any sources you use. Failure to cite sources will be considered plagiarism.