COP3530 Project 2: Simplified PageRank Algorithm

Project 2: Simplified PageRank Algorithm

 FAQs

(https://docs.google.com/document/d/1a9hR1Ep2IYKMnsXwl2VxyotO1Yd-XCC8NWUkdp24JM/edit#heading=h.ijfkblnjamg7)

The course staff will maintain an active FAQ as a Google document to answer your questions. Post your questions in Slack, but we will answer them in this document and send you the link.

This assignment handoutin PDF form: Project2_SimplifiedPageRank.pdf (https://ufl.instructure.com/courses/538770/files/97127669?wrap=1)

Problem Statement

In the late 90s, as the number of webpages on the internet was growing exponentially, different search engines were trying different approaches to rank the webpages. At Stanford, two computer science Ph.D. students, Sergey Brin and Larry Page, were working on the following questions: How can we trust information? Why are some webpages more important than others? Their research led to the formation of the Google search engine.

In this project, you are required to implement a simplified version of the original PageRank algorithm on which Google was built by representing the web as a graph and implementing this graph using an Adjacency List or an equivalent data structure. The PageRank algorithm is an algorithm that is used to order or rank different webpages on the internet.

Representing the Web as a Graph

The entire internet consists of different webpages that can be represented as a graph. Each node represents a webpage, and each edge represents a link between two webpages. This graph can be implemented as an Adjacency Matrix or an Adjacency List. In this assignment, you are supposed to implement the Adjacency List representation. If you use an Adjacency Matrix representation, you will be penalized by 25% of your implementation points and will also fail the large test cases.

Adjacency Matrix Representation

Now, for the sake of simplicity, we are explaining the project in the form of an Adjacency Matrix, M. We represent the internet in the form of |V|x|V| matrix where |V| is the total number of vertices in this graph or the total number of webpages on the internet. Each vertex, V , is a webpage on the entire internet. In the below graph, we have five vertices or webpages. Within our graph, if there is an edge from V to V (the from_page points to_page), we have the value in our adjacency matrix M = 1 and 0 otherwise.

Each webpage is thus a node in the directed graph and has incoming edges and outgoing edges. Each node has a rank, r. According to PageRank, this rank, r, is equally split among the node's outgoing links. In the below figure, the rank of node i is denoted by it, and this rank is similarly divided among node i's three outgoing edges.


Rank(i) = Rank(j)/out_degree(j) + Rank(k)/out_degree(k)

According to PageRank, this rank, r, is equal to the sum of the incoming ranks. In the above figure, the rank of node i, i = k /out_degree(k) + j /out_degree(j); Thus, the rank is based on the indegree (the number of nodes pointing to it) and the importance of an incoming node. This is important considering, let's say, you create your webpage and have a million links to other pages of importance, then you might artificially inflate your webpage's ranking if this is not the case. If, for calculating the rank, we used our links, we could have easily duped the algorithm. Therefore, the rank is only based on in-links.

Core Idea of PageRank

Important webpages will point to other important webpages.

Each page will have a score, and the search results will be based on the page score (called page rank).

Goal

In this assignment, you need to compute the rank of the webpages using a Simplified PageRank Algorithm, which is explained in the example below. You are supposed to implement an Adjacency List data structure to represent the graph.

Input

Line 1 contains the number of lines (n) that will follow and the number of power iterations (p) you need to perform. Each line from 2 to n+1 will contain two URLs — from_page to_page separated by a single space. This means that the from_page points to the URL to_page.

Output

Print the PageRank of all pages after p powerIterations in ascending alphabetical order of the webpage. Also, round off the rank of the page to two decimal places.

Constraints

1 <= n <= 10,000
1 <= Unique webpages or |V| <= 10000

Explanation of PageRank Through an Adjacency Matrix (Example)

Gradescope Test Case 1 Explanation (p=2):
Note: Here, p = 2, n = 7, |V| = 5
Input
7 2
google.com
google.com
facebook.com
ufl.edu
ufl.edu
maps.com
gmail.com
gmail.com
maps.com
ufl.edu
google.com
gmail.com
facebook.com
maps.com
Output
facebook.com 0.20
gmail.com 0.20
google.com 0.10
maps.com 0.30
ufl.edu 0.20
6/30/25, 10:11 PM
Project 2: Simplified PageRank Algorithm
https://ufl.instructure.com/courses/538770/assignments/6582501
5/14Algorithm
Graph Representation ofthe Above Input
Data Structure 1
1
google.com
2
gmail.com
3
facebook.com
4
maps.com
5
ufl.edu
Step 1
Step 2
Step 3

Mapping for Simplicity

(Optional, but you will need a mechanism to store unique vertices.)

Use a map/associative array to map the URLs with a unique ID:

1 google.com
2 gmail.com
3 facebook.com
4 maps.com
5 ufl.edu
Graph Representation and Page Rank
In PageRank, the equation to calculate the rank for your graph is given as follows:
Rank of a Page, r = M.R where,
M is the matrix with values given by the following:
M = 1/d if there is an edge from V to V (D is the outdegree of node j)
0 otherwise
For our graph, the adjacency matrix, M will look like:

Power Iteration, r(t+1) = M.r(t)

This means that the rank of the webpage at time t+1 is equal to the rank of that page at time t multiplied by matrix, M. To achieve this, we create our matrix M based on input. Next, we initialize r(t), which is a matrix of size |V|x1 and consists of the ranks of every webpage. We initialize r(t) to 1/|V|. Next, we compute power_iterations based on our input, p. There is mathematical proof that the matrix r converges, for example, r(t+1) = r(t), at which point the algorithm stops. However, this is difficult to test, and we, therefore, will be testing your algorithm on a fixed power iteration value, p.

Example r and M matrices for our input:
Gradescope Test Case 2 Explanation (p=3):
r(t+1) = r(2) = M.r(1) =

Note: In our input case, the number of power_iterations, p, is 2. Therefore, we print r(1) of the URLs sorted in alphabetical order. If p was 1, then return the initializing rank matrix or r(0). If p>2, the process repeats where you multiply the matrix, M, with the new rank matrix r(t+1) at the next iteration.

Optional Template

You can create a template, but make sure your code passes the sample test cases. An example template is in the section below. Click the arrow to see the example.

This template is also provided in the P2 Catch Template (https://github.com/COP3530/P2-Catch-Template) . Even if you choose to make your own code structure, the template will likely still be useful to you as it sets up Catch testing for you.

Example Template
class AdjacencyList {
private:
//Think about what member variables you need to initialize
public:
//Think about what helper functions you will need in the algorithm
};
void AdjacencyList::PageRank(int n){ }
// prints the PageRank of all pages after p powerIterations in
ascending alphabetical order of webpagesand rounding rank to two
decimal places
// This class and method are optional. To accept the input, you can use this method:
int main()
{
int no_of_lines, power_iterations;
std::string from, to;
std::cin >> no_of_lines;
std::cin >> power_iterations;
for(int i = 0;I < no_of_lines; i++)
{
std::cin >> from;
std::cin >> to;
// Do Something
}
//Create a graph object
Created_Graph.PageRank(power_iterations);

发表评论

电子邮件地址不会被公开。 必填项已用*标注