Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CS221: Project 3 - PI: Positional index, phrase search, compression
Coding Tasks
- Implement LSM-like Positional Index (insertion and merge). (6 points)
- Implement Phrase Search. (3 points)
- Implement Compression. (6 points)
Testing Tasks
- Write at least 2 test cases for a task (2 points)
- Review the test cases of two teams (2 points)
Total: 19 points
Overview
In this project, you'll be implementing a disk-based positional index and phrase search operations. Compared to the inverted index in Project 2, a positional index stores the positions of all the occurrences of a keyword in a document. In addition, you'll compress both the inverted index and the positional index.
The positional index supports phrase search, which means searching a sequence of keywords in the exact order. For example, the query "new york" will only match documents with "new" immediately followed by "york". A document like "new ... other tokens ... york" won't match.
The "position" of a word in the positional index can have different meanings. It could be either a character position or a token position. Token positions can include or exclude stop words. Different kinds of "position" definitions also have different implications on phrase search. For example, for the document "University of California, Irvine", if we search the phrase "University California", the handling of the stop word "of" can affect the search result.
In this project, we store token positions in the positional index. In particular, we use tokens generated from the analyzer. If the analyzer removes stop words, then stop words are also excluded from the positional index. For example, the token positions of "cat and dog" are "{cat: 0}, {dog: 1}".
Task 1: Implement an LSM-based positional index.
The on-disk index structure of the positional index is similar to the LSM structure in Project 2. The behavior for addDocument(), flush(), and merge() operations should be the same as Project 2. You should continue using the InvertedIndexManager class from Project 2.
The index format of the positional index should be on top of the inverted index format in Project 2. Recall in Project 2, we have 3 files per segment: docStore, dictionary, and invertedLists. The dictionary stores the pointer from a token to the inverted list of the token. In this project, we need another "positional" file to store the list of positions of a token in a document. The invertedLists needs to store an additional pointer to its position list per documentID.
You need to implement the following functions newly added to InvertedIndexManager
/**
* Creates a positional index with a given folder, analyzer, and compressor.
* Compressor must be used to compress the inverted lists and position lists.
*
*/
public static InvertedIndexManager createOrOpenPositional(String indexFolder, Analyzer analyzer, Compressor compressor);
/**
* Reads a disk segment of a positional index into memory based on segmentNum.
* This function is mainly used for checking correctness in test cases.
*
* Throws UnsupportedOperationException if the inverted index is not a positional index.
*
* @param segmentNum n-th segment in the inverted index (start from 0).
* @return in-memory data structure with all contents in the index segment, null if segmentNum don't exist.
*/
public PositionalIndexSegmentForTest getIndexSegmentPositional(int segmentNum);
/**
* An in-memory representation of a positional index segment, used *only* for testing purposes.
* Only for this testing class, you could load everything into memory.
*/
public class PositionalIndexSegmentForTest {
private final Map<String, List<Integer>> invertedLists;
private final Map<Integer, Document> documents;
private final Table<String, Integer, List<Integer>> positions;
}
The positional index should support ALL operations from the inverted index in Project 2 (searchQuery(), searchAndQuery(), searchOrQuery(), documentIterator(), getIndexSegment()). When doing these operations, you could ignore the positional part of the index. We would run all project 2 test cases with your positional index and they should still pass.
When designing and implementing this task, you may find limitations of your Project 2 implementation. It is recommended to take this chance to fix those problems to make the development of Project 3 easier.
Task 2: Implement phrase search.
In this task, you'll implement phrase search using the positional index. You should make same assumptions about IO and tokenization as in project 2.
The keywords in the list variable phrase represents a consecutive sequence of keywords. A document is a result only if it matches the sequence in its exact order.
Implement the following function:
/** * Performs a phrase search on a positional index. * Phrase search means the document must contain the consecutive sequence of keywords in the exact order. * * You could assume that the analyzer won't convert each keyword into multiple tokens. * Throws UnsupportedOperationException if the inverted index is not a positional index. * * @param phrase, a consecutive sequence of keywords * @return a iterator of documents matching the query */ public Iterator<Document> searchPhraseQuery(List<String> phrase);
Task 3: Implement compression.
In this task, you'll implement a compressor based on delta encoding and variable-length encoding. Implement the encode and decode functions for DeltaVarLenCompressor.
Delta encoding means storing the difference (delta) between adjacent integers rather than the actual numbers. For example, given a list [3, 5, 8, 13, 20], the delta encoding stores [3, 2, 3, 5, 7].
Variable-length encoding of an integer works similarly to UTF-8. We use a variable number of bytes instead of a constant number (4) of bytes to represent an integer. Specifically, in each byte, the first bit indicates if there are more bytes to read - 1 means reading the next byte and 0 means termination, and the rest 7 bits are used to encode the integer value.
For example, the integer value 13 is encoded as 00001101. The first bit 0 means no more bytes to read, and the rest 7 bits 0001101 represent the number 13. The integer value 128 (2^7) is encoded as 10000001 00000000. The first bit in the first byte indicates the inclusion of the next byte. The integer value 16,384 (2^14) is encoded as 10000001 10000000 00000000.
/** * Encodes a list of integers to a byte array. */ byte[] encode(List<Integer> integers); /** * Decodes part of a byte array to a list of integers. * * @param bytes bytes to decode * @param startOffset starting position to decode * @param length number of bytes to decode from start position */ List<Integer> decode(byte[] bytes, int startOffset, int length);
The compressor should be used to compress the inverted lists and the positional lists. The compressor interface has a reference implementation NaiveCompressor, which does normal integer encoding without any compression. In test cases, two positional indexes will be constructed with NaiveCompressor and DeltaVarLenCompressor and the read/write counters will be compared to determine the compression ratio.
Test cases
Please follow the similar general guideline and procedure as in Project 1 and 2. Here are the test case assignments.
Guidelines for p3:
- Put all test cases in the package edu.uci.ics.cs221.index.positional with the naming convention in the spreadsheet.
- The grade will be also based the quality of the test cases.
Detail description of test cases:
- Task1: Flush Positional Segments / Task1: Merge Positional Segments. Test the correctness of positional index flushing and merging operations using getNumSegments and getIndexSegmentPositional. Verify the content of the positional index is correct.
- Task2: Phrase Search. Construct various phrase search queries and check the result the query is correct.
- Task3: Compressor Correctness. Test the correctness of encode and decode functions of the DeltaVarLenCompressor. Verify the result byte array matches delta encoding and variable length encoding. Don't test the compressor with invalid inputs.
- Task3: Compare Index with Compression. Test the effect of the compressor to the positional index. Build the positional index using the no compression NaiveCompressor using non-trivial amount of data and record the IO counter values. Then Build the positional index using same data with DeltaVarLenCompressor. Compare the IO counter values to verify the compression ratio.
- Overall: Large Data Stress Test. Test the positional index with a large amount of data. Using same dataset with project 2 stress test is prohibited.