Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
MCIT Online 553 Project 2
A DHT-based Search Engine
Directions
1 Overview
You will start using standard OLSR as your routing protocol. The use of OLSR can be turned on using a command line flag --routing=NS3 to simulator-main.cc . You are then responsible for first developing Chord as an overlay network layered on top of your routing protocol, followed by building the search engine application that uses Chord. As Extra Credit you can use your routing protocols from project 1 and run project 2 over project 1 network layer --routing=LS/DV to To help you get started, files related to project 2 include:
- simulator-main.cc: In addition to what you learnt in project 1, it has SEARCH LOG() and CHORD LOG() functions to log all messages relevant to the search engine and Chord overlay, respectively. It also includes modules for CHORD and PENNSEARCH.
- scenarios/pennsearch.sce: An example scenario file that contains a simple 3 node Chord network, the publishing of document keywords by two nodes, and three example queries.
- keys/metadata0.keys, keys/metadata1.keys: These keys are located inside the contrib/upenncis553/ directory and not in the scenario directory. Each file contains the meta-data for a set of documents, where each row “DOC T1 T2 T3...” is a set of keywords (in this case, T1, T2, and T3) that can be used to search for a document with identifier DOC. Each document identifier can be a web URL or a library catalog number, and for the purpose of this project, they are simply a string of bytes that uniquely represent each document. In practice, these keywords are generated by reading/parsing web content or other documents to extract keywords. However, since parsing web pages is not the focus of this project, we have skipped this step, and supply these document keywords to you.
- penn-chord-*.[h/cc]: Skeleton code for your Chord implementation.
- penn-search-*.[h/cc]: Skeleton code for your PennSearch implementation.
- grader-logs.[h/cc]: These files contains the functions you need to invoke for autograder to cor rectly parse your submission, please read function briefs in header file.
The command to compile and run project 2 is the same as Section 2.3 and 2.4 in project 1 code doc umentation. Please do not use -result-check flag in the command and please comment out calls to checkNeigbhorTableEntry() and checkRoutingTableEntry() if you use your LS and DV implementation.
You can also opt to use ns-3’s OLSR implementation instead of your own LS and DV.
2 PennChord
Correct Chord Behavior
- Correct lookup. All lookups routed via Chord has to be proven correct (i.e. reach the correct nodevia the right intermediate nodes. You need to support the basic Chord API, which is IPAddrlookup(k), as described in lecture. This API is not exposed to the scenario file explicitly, but usedby PennSearch to locate nodes responsible for storing a given keyword. Note that lookup(k)is not a function, when a lookup is made, chord is going to take some time passing messagesbetween nodes until key owner IPAddr is found, at that time a response message is sent to thelookup requester with the IP owning the key, see callback hints section.
- Consistent routing. All nodes agree on lookup(k).
- Well-formed ring. For each node n, it’s successor’s predecessor is itself. See Ring State describedin autograder section.
- Correct storage. Every item K is stored at the correct node (i.e. lookup(k))
- Performance. For a given network size, you need to compute and output the average hop countrequired by Chord lookups that occur during the duration of your simulation. In other words,you should implement the code that will capture the number of hops required by each lookup and output using CHORD LOG the average hop count across all lookups when the simulation ends. The average hop count must exclude lookups that are generated during periodic finger fixing. After finger fixing is correctly implemented, the average hop count should be log(N) instead of N, where N is the number of nodes in the Chord overlay network.
- Stabilization protocol. Enough debugging messages (not too many!) to show us that periodic stabilization is happening correctly. Since stabilization generates large numbers of messages, you should provide mechanisms to turn on/off such debugging in your code.
Summary of Commands
- Start landmark node: Designate a node as your landmark (i.e. node 0). E.g. “0 PENNSEARCH CHORD JOIN 0” will designate node 0 as the landmark node, since the source and landmark nodes are the same.
- Nodes join: A PennChord node joins the overlay via the initial landmark node. E.g. “1 PENNSEARCH CHORD JOIN 0” will allow node 1 to join via the landmark node 0. Once a node has joined the network, items stored at the successor must be redistributed to the new node according to the Chord protocol. For simplicity, you can assume all joins and leaves are sequential, i.e. space all your join events far apart such that the successors and predecessors are updated before the next join occurs.
- Voluntary node departure: A PennChord node leave the Chord network by informing its suc cessor and predecessor of its departure. E.g. “1 PENNSEARCH CHORD LEAVE” that will result in node 1 leaving the PennChord network. All data items stored should be redistributed to neighboring nodes accordingly. For simplicity, you can assume all joins and leaves are sequential.
Cryptographic Hashes
We provide a helper class to generate 32-bit cryptographic hashes from IP address and strings. It depends on the OpenSSL libcryotp library.
To generate digest for a message, use:
createShaKey(ipAddress) or createShaKey(string)
3 PennSearch
Basics of Information Retrieval
Doc1 is searchable by keywords T1 or T2. Doc2 is searchable by T1, T2, T3, and T4, and Doc3 is searchable by T3, T4, and T5. Typically, these searchable keywords are extracted from the actual documents with identifiers Doc1, Doc2, and Doc3.
Based on these keywords, the inverted lists are {Doc1, Doc2} for T1, {Doc1, Doc2, Doc3} for T2, {Doc2, Doc3} for T3, {Doc2, Doc3} for T4, and {Doc3} for T5. Each inverted list for a given keyword essentially stores the set of documents that can be searched using the keyword. In a DHT-based search engine, for each inverted list Tn, we store each list at the node whose Chord node is responsible for thekey range that includes hash(Tn).
A query for keywords “T1 AND T2” will return the document identifiers “Doc1” and “Doc 2”, and the results are obtained by intersecting the sets {Doc1, Doc2}, and {Doc1, Doc2, Doc3}, which are the inverted lists of T1 and T2 respectively. We only deal with AND queries in PennSearch, so you can ignore queries such as “T1 OR T2”.
Note that the query result is not the actual content of the documents, but rather the document identi- fiers that represent documents that include both T1 and T2. In typical search engine, an extra document retrieval phase occurs at this point to fetch the actual documents. We consider the actual documentcontent retrieval step out of scope of this project.
Summary of Commands
- Inverted list publishing: “2 PENNSEARCH PUBLISH metadata0.keys” means that node 2 reads the document metadata file named metadata0.keys. Node 2 then reads each line, which is of the form “Doc0 T1 T2 T3 . . . ”, which means that the Doc0 is searchable by T1, T2, or T3. After reading the metadata0.keys file, node 2 constructs an inverted list for each keyword it encounters, and then publishes the respective inverted indices for each keyword into the PennChord overlay. For instance, if the inverted list for “T2” is “Doc1, Doc 2”, the command publishes the inverted list “Doc1, Doc2” to the node that T2 is hashed to. This node can be determined via a Chordlookup on hash(T2). As a simplification, the inverted lists are append-only, i.e. new DocIDs areadded into a set of existing document identifiers for a given keyword, but never deleted from aninverted list.
- Search query: “1 PENNSEARCH SEARCH 4 T1 T2” will initiate the search query from node 1, and take the following steps via node 4:
(a) Node 1 contacts node 4 with query ”T1 AND T2”;
(b) Node 4 issues a Chord lookup to find the node that stores the inverted list of T1, i.e., the node that T1 is hashed to (e.g., Node T1), and sends the query ”T1 AND T2” to Node T1;
(c) Node T1 retrieves inverted list for T1 from its local store, issues a Chord lookup to find the node that T2 is hashed to (e.g., Node T2), and sends the retrieved inverted list for T1 together with query ”T2” to Node T2;
(d) Node T2 send the intersection of the inverted lists of T1 and T2 as the final results back either directly back to node 1, or to node 4 (which forwards the results to node 1). If there are no matching documents, a “no result” is returned to node 1.
For simplicity, you can assume there is no search optimization, so inverted lists are intersected based on left-to-right ordering of search terms. Note that the search may contain arbitrary number of search terms, e.g. “1 PENNSEARCH SEARCH 4 T1 T2 T3”.
NOTES
- You can assume that each document identifier appears in only one metadataX.keys file. For instance, if node 0 publishes metadata0.keys, and node 1 publishes metadata1.keys, you can assume that both files do not contain overlapping document identifiers. This emulates the fact that in practice, each node will publish inverted indexes for documents that it owns, and one can make the assumption that each node owns a unique set of documents. On the other hand, each searchable keyword may return multiple document identifiers. For instance, there are two documents Doc2 and Doc3 that can be searched using the keyword T3.
- You can assume that only nodes that participate in the PennSearch overlay can have permission to read document keywords and publish inverted indexes. Nodes, which are outside the Chord network, may initiate SEARCH by contacting any node, which is part of the PennSearch overlay. For instance, in our example command above, “1 PENNSEARCH SEARCH 4 T1 T2” means that node 1 (which may be a node outside the PennSearch overlay) can issue a search query for “T1 and T2” via node 4, which is already part of the PennSearch overlay.
4 Milestones
- Milestone 1: (15%) (Autograded)
We expect a basic Chord implementation where the ring stabilization protocol works correctly. At this stage, finger table implementation is optional for this milestone. We require only the ringstate output to make sure your ring is formed correctly and maintained as nodes enter/leave the Chord overlay.
- Milestone 2a (39%), Milestone 2b (46%) (Both autograded)
Complete working implementation of PennChord and PennSearch. Milestone is split into two parts for easier submission.Criteria for Milestone 2A:
– Node to join the ring (3 points)
* event : node 0 - node 19 join the ring* result : should see lookupissue, lookuprequest, lookupresult log
– Well formed ring (3 points)
* event : 3 PENNSEARCH CHORD RINGSTATE* result : check well formed ring, can also check above request path
– P keyword metadata (8 points, 2 points each)
* events :2 PENNSEARCH PUBLISH metadata2.keys3 PENNSEARCH PUBLISH metadata3.keys4 PENNSEARCH PUBLISH metadata4.keys5 PENNSEARCH PUBLISH metadata5.keys* result : should see publish and store log
– Search correctness (9 points, 3 points each)
* event : 1 PENNSEARCH SEARCH 4 Johnny-Depp* result : Pirates-of-the-Caribbean Alice-in-Wonderland* event : 3 PENNSEARCH SEARCH 10 Johnny-Depp Keira-Knightley* result : Pirates-of-the-Caribbean* event : 8 PENNSEARCH SEARCH 17 George-Clooney Brad-Pitt Matt-Damon* result : Ocean’s-Eleven Ocean’s-Twelve
– Search consistency (4 points)
* events :2 PENNSEARCH SEARCH 12 Brad-Pitt3 PENNSEARCH SEARCH 13 Brad-Pitt4 PENNSEARCH SEARCH 14 Brad-Pitt* result : should all return Mr-Mrs-Smith Ocean’s-Eleven Ocean’s-Twelve Fight-Club
– Multiple searches from one node at same time (9 points, 3 points each)
* event : 15 PENNSEARCH SEARCH 15 Tom-Hardy* result : The-Dark-Knight-Rises Mad-Max* event : 15 PENNSEARCH SEARCH 3 Emilia-Clarke* result : Game-of-Thrones* event : 15 PENNSEARCH SEARCH 12 Chadwick-Boseemann* result : No such file– Non-chord node issue search (3 points)* event : 25 PENNSEARCH SEARCH 9 Tom-Hanks* result : Forrest-Gump Toy-Story* event : 21 PENNSEARCH SEARCH 16 Jeremy-Renner* result : Arrival
Criteria for Milestone 2B: Milestone 2B will test correctness for more advanced cases. It is currently designed as an unseen test - but the autograder will give information for any wrong implementations and hints on how to fix them.
Important: in this milestone (2b), you will need to change your Ring State implementation to indicate the all ring state messages for the current ring have been printed to stdout, by printing a End of Ring State message after the last node in the ring printed its ring state message.
Autograder
For your submissions you will use your GitHub repo, you are required to merge into main and submit main branch only.
On the following logs "(...)" means to use the corresponding arguments for the function, please take a moment to read the briefs in grader-logs.h.
- Ring state (MS1): At any node X, a “X PENNSEARCH CHORD RINGSTATE” command will initiate a ring output message. The node receiving this command should call the following function to log its ringstate, after that it should pass a message to its successor which should also call the following function, and so on until X is reached again.
GraderLogs::RingState(...)
Ring StateCurr<Node currNode#, currIPAddress, currHash>Pred<Node predNode#, predIPAddress, predHash>Succ<Node succNode#, succIPAddress, succHash>
Note that if CHORD LOG is turned on, this means that between each Publish and Store outputmessage, we should see a series of lookup output messages. Also, note that when a node leavesthe ring, this should trigger its keys to be transferred to another node. This transfer should result in new store messages.
Hints
This is the callback setter in chord:
Extra credits
Similar to project 1, extra credit will be submitted separately from regular credit on Gradescope. You can demo your extra credit to your assigned TA after the deadline. Make sure to document your work, design choices, and test cases.
- Handling big numbers (5%): If you use the entire 160 bits from SHA1 as the Chord ID, you will get 5% extra credit. This will require you to design a mechanism to store, add, mod, compare,and display big integers that cannot fit in primitive data types.
- Demonstrate using your LS/DV from project 1 (5%): If you use your own LS or DV instead of the default ns-3 routing for your demo, you will get 5% extra credit. This may require some mod ifications to your project 1 code to get this to work properly, in particularly implement routeInput and routeOutput functionality which is not tested by our autograder.
- Bandwidth efficient search engine (10%): The basic PennSearch can be enhanced to save band width as follows: perform the intersection of keywords starting from the one with smallest inverted list, and use of bloom filters. Implement these features and show that your implementation results in lower bandwidth utilization compared to the regular implementation.
- Enhanced search features (5% for each bullet point): Enhanced your search engine with at least the following features:
- Generating the document keywords (e.g. metadata0.keys, metadata1.keys) with actual web pages (>20) that you have downloaded, and replacement docID with actual URLs. After re turning the search results, fetch the actual documents themselves. Do the last step sparingly so as not to overload existing web servers with your http requests.
- Rank search results based on a reasonable search metric, such as TF-IDF.
- Support inverted lists that are larger than MTU size of 1500 bytes. For instance, you can use a fragmentation/defragmentation scheme that breaks up an inverted list into smaller size lists in executing the search query.
- Chord file system (20%). The Chord File System (CFS) is a fairly sophisticated application that uses Chord. Hence, this has more extra credit than other applications. You need to demonstrate tous that you can take a few fairly large files, store it in CFS, and retrieve them correctly. To earnthe entire 20%, you need to implement all the functionalities described in the CFS paper. You cansearch online for the MIT CFS paper.
- Chord performance enhancements (20%): Chord has O(log N) hop performance in the virtual ring. However, the adjacent nodes in the ring may actually be far away in terms of networkdistance. How can one take advantage of network distance information (gathered via the link/path costs from ns-3 or your project 1 routing implementation) to select better Chord neighbors? Do some research online to find techniques explored by others to solve this problem and implement one such solution.
- Churn handling and failure detection (20%): Add support to emulate failures of PennChord nodes, mechanisms to detect failures of PennChord nodes, and repair the routing state in Penn Chord accordingly when failures occur. To ensure robustness, each node needs to maintain mul tiple successors. All churn events (node joins, leaves, or failures) can happen concurrently, and a node that fails can rejoins the overlay later. You can however not concern yourself with failures at the network layer, i.e .all failures occur at the application-layer node.
- Application-layer multicast (15%): Using Chord as a building block, build and maintain a single source multicast tree as described in the lecture, where JOIN requests are routed to the root note (for each multicast group), and forwarding is set up on the reverse path. To get full credit, you need to construct the tree, show actual content being disseminated via the tree, and make sure you handle the case where nodes enter/leave the ring. You can assume graceful exits and not sudden failures of nodes.
- PennSearch on mobile devices (20%). Using your Android phones, demonstrate a small (2-3) PennSearch network running on mobile devices. This requires taking the ns-3 code and compiling it on Android. If you do not have an Android, you can use an Android emulator.