首页 » 计算机科学(Computing Science) » CS 221: Information Retrieval Project 4

CS 221: Information Retrieval Project 4

2024-08-03 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CS221: Project 4 - RA: Ranking

Coding Tasks

Implement TF-IDF ranking (Cosine) (10 points)
Implement PageRank and run with ICS websites (5 points)

Testing Tasks

Write at least 1 test case for a task (2 points)
Review the test cases of two teams (2 points)

Total: 19 points

Overview

In this project, you'll implement two ranking methods: TF-IDF (Cosine) and PageRank.

Task 1: Implement TF-IDF ranking.

Modify both your InvertedIndex from Project 2 and PositionalIndex from Project 3 to include all necessary information to calculate TF-IDF scores. Implement the following searchTfIdf function that returns the top-K documents ranked by TF-IDF scores.

First, we'll go over an example to calculate TF-IDF Cosine scores using the vector space model.

Segment 1 
	d1: “new york city” 
	d2: “los angeles city” 
Segment 2
 	d1: “new york post”
Query q: "new new city". (We use two copies of "new" on purpose.)

Number of documents in the system: 3.

Document frequency for query keywords:
	new  = 2 {s1d1, s2d1}
	city = 2 {s1d1, s1d2}

IDF:
	new  = log(3/2) ≈ 0.18
	city = log(3/2) ≈ 0.18

TF (Term frequency):
	new  = {s1d1: 1, s2d1: 1, query: 2}
	city = {s1d1: 1, s1d2: 1, query: 1}

TF-IDF vectors of documents and query:
	s1d1  = [new: 1 * IDF(new), city: 1 * IDF(city)] = [0.18, 0.18]
	s1d2  = [new: 0,            city: 1 * IDF(city)] = [0   , 0.18]
	s2d1  = [new: 1 * IDF(new), city: 0]             = [0.18, 0   ]
	query = [new: 2 * IDF(new), city: 1 * IDF(city)] = [0.36, 0.18]

Cosine similarity between the query and each document:
	sim(s1d1, q) = s1d1 · query / ‖s1d1‖ = (0.18 * 0.36 + 0.18 * 0.18) / sqrt(0.18^2 + 0.18^2) ≈ 0.38
	sim(s1d2, q) = (0 * 0.36 + 0.18 * 0.18) / sqrt(0 + 0.18^2) = 0.18
	sim(s2d1, q) = (0.18 * 0.36 + 0 * 0.18) / sqrt(0.18^2 + 0) = 0.36

So the ranked order of the documents is: [s1d1, s2d1, s1d2].

Next we'll discuss a two-pass method to calculate these scores. For each query keyword, since it has inverted lists in multiple LSM segments, in order to compute its global IDF, we need to know the frequency of this word in each segment. In the first pass, we access each segment to retrieve the frequency of each query keyword, as well as the number of documents within the segment, and use these values to calculate the IDF's of the query keywords.

In the second pass, within each segment, we traverse the inverted list of a query keyword, and use the (local) TF for each document and the (global) IDF of the keyword to compute the TF-IDF value for this keyword and this document. We accumulate the product of this value and the corresponding TF-IDF value of the query keyword in order to compute the dot product of the document and the query. Furthermore, for each document, since we need to normalize its TF-IDF vector by its length, we also need to accumulate the squares of the TF-IDF's of the query keywords in this document. The following is the pseudo code:

Map<DocID, Double> dotProductAccumulator; //  DocID is <SegmentID, LocalDocID>
Map<DocID, Double> vectorLengthAccumulator;

for each segment:
  for each query token w:
    for each docID on the postingList of w:
      tfidf = TF(w, docID) * IDF(w);
      dotProductAccumulator[docID] += tfidf * queryTfIdf[w];
      vectorLengthAccumulator[docID] += tfidf ^ 2;

  for each docID in this segment
    score(docID) =  dotProductAccumulator[docID] / sqrt(vectorLengthAccumulator[docID]);

In this method, we need to store a double (IDF) for each query keyword. In addition, for each segment, for each of its documents, we need to store 2 doubles.

Notice that since we want to return topK documents, we need to maintain a global priority queue of K docIDs. After visiting each segment, we only need to add the top-K docIDs of this segment into this priority query and free the space for the docIDs of this segment.

To implement the method, you need the following API functions for each segment:

public int getNumDocuments(int segmentNum);

public int getDocumentFrequency(int segmentNum, String token);

/**
 * Performs top-K ranked search using TF-IDF.
 * Returns an iterator that returns the top K documents with highest TF-IDF scores.
 *
 * Each element is a pair of <Document, Double (TF-IDF Score)>.
 *
 * If parameter `topK` is null, then returns all the matching documents.
 *
 * Unlike Boolean Query and Phrase Query where order of the documents doesn't matter,
 * for ranked search, order of the document returned by the iterator matters.
 *
 * @param keywords, a list of keywords in the query
 * @param topK, number of top documents weighted by TF-IDF, all documents if topK is null
 * @return a iterator of top-k ordered documents matching the query
 */
public Iterator<Pair<Document, Double>> searchTfIdf(List<String> keywords, Integer topK)

Task 2: Implement PageRank.

Note: the description of this task could have minor changes.

In this task, you'll run the PageRank algorithm on the collection of ICS websites we provided, index all the ICS website documents, and combine PageRank with the TF-IDF weighted search results.

Download the provided webpages zip file. Here are the description of what's in the zip file:

The "url.tsv" file contains the mapping of a webpage ID to the original URL of the webpage.
The "id-graph.tsv" file contains the graph of the webpages. Each line is an edge in the graph represented as a pair of two document IDs. The first is the "from" documentID and the second is the "to" document ID.
The "cleaned" folder contains the web page documents with its documentID as the file name. All web page files are already cleaned - all HTML tags are removed and only plain text are kept. In each file, the first line is the documentID, the second line is the original URL, and the third line is the text of the HTML document.

Implement the following IcsSearchEngine class.

public class IcsSearchEngine {

    /**
     * Initializes an IcsSearchEngine from the directory containing the documents and the
     *
     */
    public static IcsSearchEngine createSearchEngine(Path documentDirectory, InvertedIndexManager indexManager) {
        return new IcsSearchEngine(documentDirectory, indexManager);
    }

    private IcsSearchEngine(Path documentDirectory, InvertedIndexManager indexManager) {
    }

    /**
     * Writes all ICS web page documents in the document directory to the inverted index.
     */
    public void writeIndex() {
        throw new UnsupportedOperationException();
    }

    /**
     * Computes the page rank score from the id-graph file in the document directory.
     * The results of the computation can be saved in a class variable and will be later retrieved by `getPageRankScores`.
     */
    public void computePageRank() {
        throw new UnsupportedOperationException();
    }

    /**
     * Gets the page rank score of all documents previously computed. Must be called after `computePageRank`.
     * Returns a list of <DocumentID - Score> Pairs that is sorted by score in descending order (high scores first).
     */
    public List<Pair<Integer, Double>> getPageRankScores() {
        throw new UnsupportedOperationException();
    }

    /**
     * Searches the ICS document corpus and returns the top K documents ranked by combining TF-IDF and PageRank.
     *
     * The search process should first retrieve ALL the top documents from the InvertedIndex by TF-IDF rank,
     * by calling `searchTfIdf(query, null)`.
     *
     * Then the corresponding PageRank score of each document should be retrieved. (`computePageRank` will be called beforehand)
     * For each document, the combined score is  tfIdfScore + pageRankWeight * pageRankScore.
     *
     * Finally, the top K documents of the combined score are returned. Each element is a pair of <Document, combinedScore>
     *
     *
     * Note: We could get the Document ID by reading the first line of the document.
     * This is a workaround because our project doesn't support multiple fields. We cannot keep the documentID in a separate column.
     */
    public Iterator<Document> searchQuery(List<String> query, int topK, double pageRankWeight) {
        throw new UnsupportedOperationException();
    }

}

Test cases

Please follow the similar general guideline and procedure as previous projects. Here are the test case assignments. Note for this project, you only need to write at least 1 test case. You should put your test case in package edu.uci.ics.cs221.index.ranking.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *