首页 » 计算机科学(Computing Science) » CS 221: Information Retrieval Project 2

CS 221: Information Retrieval Project 2

2024-08-03 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CS221: Project 2 - II: Inverted index, boolean search

Codebase

github repo

Coding Tasks

Implement LSM-like disk-based inverted index that supports insertions. (6 points)
Implement merge of inverted index segments. (4 points)
Implement keyword search, boolean AND search, and boolean OR search. (5 points)
(Optional Extra Credit): Implement deletions. (2 points)

Testing Tasks

Write at least 2 test cases for a task (2 points)
Review the test cases of two teams (2 points)

Total: 19 points (+ 2 extra credits)

Overview

In this project, you'll be implementing a disk-based inverted index and the search operations.

At a high level, inverted index stores a mapping from keywords to the ids of documents they appear in. A simple in-memory structure could be Map<String, List<Integer>>, where each key is a keyword token (also called a "term"), and each value is a list of (often sorted) document IDs (also called "postings").

In this project, the disk-based index structure is based on the idea of LSM (Log-Structured Merge tree). Its main idea is the following:

The inverted index consists of multiple index segments, where each segment is initially created in memory. Once a segment is written to disk, it becomes immutable and is never changed.

Each index segment is a fully searchable inverted index. It contains a posting list structure as well as a document store, which is a mapping from a docID to the corresponding document. The document IDs within each segment are local in the segment and are invisible to the user. These in-disk segments are periodically merged to bigger segments.

When users search a keyword, all segments are searched, and the result documents from each segment are combined.

Example:

Add documents Doc{"cat dog"} and Doc{"cat elephant"}, then flush to Segment0.
Segment0:
----------
PostingList: {"cat": [0, 1], "dog": [0], "elephant": [1]}
DocStore: {0: "cat dog", 1: "cat elephant"}
----------

Then add documents Doc{"cat dog"} and Doc{"wolf dog"}, and flush to Segment1.
Segment1:
----------
PostingList: {"cat": [0], "dog": [0, 1], "wolf": [1]}
DocStore: {0: "cat dog", 1: "wolf dog"}
----------

When searching the word "cat", we first search Segment0 and get [Doc{"cat dog"}, Doc{"cat elephant"}].
Then we search Segment1 and get [Doc{"cat dog"}]. We combine these results and get [Doc{"cat dog"}, Doc{"cat elephant"}, Doc{"cat dog"}]

Task 1: Implement LSM-like disk-based inverted index that supports insertions only.

In this task, you'll implement the disk file structure of a single segment. When a document is added via addDocument(), it should be first stored in an in-memory buffer. You need to design the data structure for the in-memory segment.

Whenever the number of documents reaches a parameter default_flush_threshold, or function flush() is called, you should flush the segment to disk.

The specific format of the disk posting lists should follow what we cover in lectures. You also have freedom to improve the format to make it more efficient.

The following are specific functions to implement:

/**
 * Adds a document to the inverted index.
 * Document should live in an in-memory buffer until flush() is called to write the segment to disk.
 * @param document
 */
public void addDocument(Document document)

/**
 * Flushes all the documents in the in-memory buffer to disk. If the buffer is empty, it should not do anything.
 * flush() writes the segment to disk containing the posting list and the corresponding document store.
 */
public void flush()

/**
 * Iterates through all the documents in all the disk segments.
 */
public Iterator<Document> documentIterator() {
    throw new UnsupportedOperationException();
}

/**
 * Gets the total number of segments in the inverted index.
 * This function is used for checking correctness in test cases.
 *
 * @return number of index segments.
 */
public int getNumSegments()

/**
 * Reads a disk segment into memory based on segmentNum.
 * This function is mainly used for checking correctness in test cases.
 *
 * @param segmentNum n-th segment in the inverted index (starting from 0).
 * @return in-memory data structure with all the contents in the index segment, null if 
 * the segment of segmentNum doesn't exist.
 */
public InvertedIndexSegmentForTest getIndexSegment(int segmentNum)

Task 2: Implement merge of disk segments.

In this task, you'll implement the merging of disk segments. We cannot let the number of segments grow indefinitely because otherwise searching a keyword needs to go through a lot of segments.

In general, there are many merging policies. In this task, we want to implement a particular policy. Whenever the number of segments has reached a parameter default_merge_threshold, or mergeAllSegments() is called, you need to merge all the disk segments pair-wise. For example, suppose there are 10 disk segments. After the merge, we should have 5 disk segments. You could assume merging only happens when you have an even number of segments.

When merging two segments into one, since each segment has its own local document IDs, you need to generate new document IDs for the merged segment. When merging two segments, you need to merge both the inverted index, as well as the documents store of the two segments. For simplicity, you can assume we have enough memory to hold the keywords of both segments. BUT, you cannot assume we have enough memory to store the posting lists and documents of both segments.

The following is specific function to implement:

/**
 * Merges all the disk segments of the inverted index pair-wise.
 */
public void mergeAllSegments()

Task 3: Implement keyword search, boolean AND search, and boolean OR search.

In this task, you'll implement searching using the inverted index. You could assume all documents are flushed to disk segments when doing a search.

Here we make the same assumption as in the merge case regarding what can be stored in memory.

For every query keyword, you need to first analyze it using the provided analyzer before using it to access the inverted index. You can assume the analyzer will not convert one keyword to multiple keywords. If the keyword is empty, searching should not return any results.

The following are specific functions to implement:

/**
 * Performs a single-keyword search on the inverted index.
 * You could assume the analyzer won't convert the keyword into multiple tokens.
 * If the keyword is empty, it should not return anything.
 *
 * @param keyword keyword, cannot be null.
 * @return a iterator of documents matching the query
 */
public Iterator<Document> searchQuery(String keyword)

/**
 * Performs an AND boolean search on the inverted index.
 *
 * @param keywords a list of keywords in the AND query
 * @return a iterator of documents matching the query
 */
public Iterator<Document> searchAndQuery(List<String> keywords)
}

/**
 * Performs an OR boolean search on the inverted index.
 *
 * @param keywords a list of keywords in the OR query
 * @return a iterator of documents matching the query
 */
public Iterator<Document> searchOrQuery(List<String> keywords)

Task 4 (Optional Extra Credit): Implement deletions.

In the LSM-like index structure, deletion could be implemented by maintaining a list of deleted document IDs per segment. Each deleted document is not physically deleted in the inverted index nor document store. When reading or searching, each docID is checked to see if it has been logically deleted. If so, we will not return it to the user.

Those deleted documents within a segment should be physically deleted when we merge it with another segment.

The following is the specific function to implement:

/**
 * Deletes all documents in all disk segments of the inverted index that match the keyword.
 * @param keyword 
 */
public void deleteDocuments(String keyword)

Test cases

Please follow the similar general guideline and procedure as in project 1. Here is test task assignment. There are some guidelines and tips for project 2 test cases:

Put the index and document files under your own folder. Specifically, you should use folder index/YourTestName/, for example index/Team0StressTest.
Clean up and delete all files after each test. You should use Junit @After to delete all written files.
For testing tasks 1 and 2, you could change default_flush_threshold or default_merge_threshold, or directly call flush() and mergeAllSegments() to control when to flush or when to merge. If you changed the variables default_flush_threshold or default_merge_threshold, be sure to change them back to their original value after your tests.
For stress test, you should collect or generate a large amount of text data to test the performance and stability. If you rely on external data sets, please don't commit the large data directly in git. Instead, use a source URL to download the data.
For all test tasks, you should also check the read/write counter values in the PageFileChannel class to make sure the IO number is within a reasonable range.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *