首页 » 计算机科学(Computing Science) » MCIT Online 553 Project 2 A DHT-based Search Engine

MCIT Online 553 Project 2 A DHT-based Search Engine

2025-03-31 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

MCIT Online 553 Project 2
A DHT-based Search Engine

Directions

You must work in groups of 4 for this project. Please regularly check Ed through out this course for project specification clarifications and FAQ update. You are required to version control your code, but please only use the GitHub repository created for you by the 553 staff, you should have a different repo for project 2 vs the one used in project 1 even if your team members stayed the same. Do not work in public GitHub repositories! Please avoid publishing this project at any time, even post-submission, to observe course integrity policies. If you are caught using code from other groups or any sources from public code repositories, your entire group will receive ZERO for this assignment, and will be sent to the Office of Student Conduct where there will be additional sanctions imposed by the university.

1 Overview

In this project, you will implement a peer-to-peer search engine (PennSearch) that runs over your implementation of the Chord Distributed Hash Table (PennChord). You are expected to read the entire Chord paper carefully, and clarify any doubts with your TAs, instructor, or post questions on Ed.

You will start using standard OLSR as your routing protocol. The use of OLSR can be turned on using a command line flag --routing=NS3 to simulator-main.cc . You are then responsible for first developing Chord as an overlay network layered on top of your routing protocol, followed by building the search engine application that uses Chord. As Extra Credit you can use your routing protocols from project 1 and run project 2 over project 1 network layer --routing=LS/DV to To help you get started, files related to project 2 include:

simulator-main.cc: In addition to what you learnt in project 1, it has SEARCH LOG() and CHORD LOG() functions to log all messages relevant to the search engine and Chord overlay, respectively. It also includes modules for CHORD and PENNSEARCH.
scenarios/pennsearch.sce: An example scenario file that contains a simple 3 node Chord network, the publishing of document keywords by two nodes, and three example queries.
keys/metadata0.keys, keys/metadata1.keys: These keys are located inside the contrib/upenncis553/ directory and not in the scenario directory. Each file contains the meta-data for a set of documents, where each row “DOC T1 T2 T3...” is a set of keywords (in this case, T1, T2, and T3) that can be used to search for a document with identifier DOC. Each document identifier can be a web URL or a library catalog number, and for the purpose of this project, they are simply a string of bytes that uniquely represent each document. In practice, these keywords are generated by reading/parsing web content or other documents to extract keywords. However, since parsing web pages is not the focus of this project, we have skipped this step, and supply these document keywords to you.
penn-chord-*.[h/cc]: Skeleton code for your Chord implementation.
penn-search-*.[h/cc]: Skeleton code for your PennSearch implementation.
grader-logs.[h/cc]: These files contains the functions you need to invoke for autograder to cor rectly parse your submission, please read function briefs in header file.

The command to compile and run project 2 is the same as Section 2.3 and 2.4 in project 1 code doc umentation. Please do not use -result-check flag in the command and please comment out calls to checkNeigbhorTableEntry() and checkRoutingTableEntry() if you use your LS and DV implementation.

You can also opt to use ns-3’s OLSR implementation instead of your own LS and DV.

Note that our skeleton code is a starting point, and you are allowed to be creative and structure your code based on your own design. For regular credits, you are however not allowed to make any code changes outside of upenn-cis553 directory, nor are you allowed to change simulator-main.cc. In addition to our sample test script, we expect you to design additional test cases that demonstrate the correctness of your PennSearch implementation. We encourage you to get started early. As a result, we have included a milestone 1 where you should have a basic Chord implementation.

2 PennChord

Chord nodes will execute as an overlay network on top of your existing routing protocol, i.e., all messages between any two Chord nodes 1 and 2 will traverse the shortest path computed by the underlyingrouting protocol. The PennChord overlay network should be started only after the routing protocol hasconverged (i.e. finish computing all the routing tables). You can assume that the links used in the underlying routing protocol computation does not changed while PennChord is executed. Also, not all nodesneed to participate in the Chord overlay, i.e. the number of nodes in your underlying network topologymay be larger than the number of nodes in the Chord overlay. Your PennChord implementation shouldinclude finger tables and the use of 32-bit cryptographic hashes.

Correct Chord Behavior

We will also be using our own test cases to ensure correct Chord behavior. For simplicity, assumeeach node only maintains one successor (i.e. the closest node in the clockwise direction). You haveto implement the actual stabilization functionality described in the Chord paper. In particular, we will check for the following features:

Correct lookup. All lookups routed via Chord has to be proven correct (i.e. reach the correct nodevia the right intermediate nodes. You need to support the basic Chord API, which is IPAddrlookup(k), as described in lecture. This API is not exposed to the scenario file explicitly, but usedby PennSearch to locate nodes responsible for storing a given keyword. Note that lookup(k)is not a function, when a lookup is made, chord is going to take some time passing messagesbetween nodes until key owner IPAddr is found, at that time a response message is sent to thelookup requester with the IP owning the key, see callback hints section.
Consistent routing. All nodes agree on lookup(k).
Well-formed ring. For each node n, it’s successor’s predecessor is itself. See Ring State describedin autograder section.
Correct storage. Every item K is stored at the correct node (i.e. lookup(k))
Performance. For a given network size, you need to compute and output the average hop countrequired by Chord lookups that occur during the duration of your simulation. In other words,you should implement the code that will capture the number of hops required by each lookup and output using CHORD LOG the average hop count across all lookups when the simulation ends. The average hop count must exclude lookups that are generated during periodic finger fixing. After finger fixing is correctly implemented, the average hop count should be log(N) instead of N, where N is the number of nodes in the Chord overlay network.
Stabilization protocol. Enough debugging messages (not too many!) to show us that periodic stabilization is happening correctly. Since stabilization generates large numbers of messages, you should provide mechanisms to turn on/off such debugging in your code.

Summary of Commands

Start landmark node: Designate a node as your landmark (i.e. node 0). E.g. “0 PENNSEARCH CHORD JOIN 0” will designate node 0 as the landmark node, since the source and landmark nodes are the same.
Nodes join: A PennChord node joins the overlay via the initial landmark node. E.g. “1 PENNSEARCH CHORD JOIN 0” will allow node 1 to join via the landmark node 0. Once a node has joined the network, items stored at the successor must be redistributed to the new node according to the Chord protocol. For simplicity, you can assume all joins and leaves are sequential, i.e. space all your join events far apart such that the successors and predecessors are updated before the next join occurs.
Voluntary node departure: A PennChord node leave the Chord network by informing its suc cessor and predecessor of its departure. E.g. “1 PENNSEARCH CHORD LEAVE” that will result in node 1 leaving the PennChord network. All data items stored should be redistributed to neighboring nodes accordingly. For simplicity, you can assume all joins and leaves are sequential.

Cryptographic Hashes

We provide a helper class to generate 32-bit cryptographic hashes from IP address and strings. It depends on the OpenSSL libcryotp library.

To generate digest for a message, use:

createShaKey(ipAddress) or createShaKey(string)

There is an opportunity to earn extra credit when using a 160 bit as Chord ID, but our provided helper class does not help with this. To generate a 160-bit hash, please look at lybcrypto’s SHA1 function.

3 PennSearch

To test your Chord implementation, we will require you to write PennSearch, which is a simple keyword based search engine.

Basics of Information Retrieval

We first provide some basic knowledge that you would need to understand keyword-based information retrieval. We consider the following three sets of document keywords, one for each of Doc1, Doc2, and Doc3:

Doc1 is searchable by keywords T1 or T2. Doc2 is searchable by T1, T2, T3, and T4, and Doc3 is searchable by T3, T4, and T5. Typically, these searchable keywords are extracted from the actual documents with identifiers Doc1, Doc2, and Doc3.

Doc1 T1 T2

Doc2 T1 T2 T3 T4

Doc3 T2 T3 T4 T5

Based on these keywords, the inverted lists are {Doc1, Doc2} for T1, {Doc1, Doc2, Doc3} for T2, {Doc2, Doc3} for T3, {Doc2, Doc3} for T4, and {Doc3} for T5. Each inverted list for a given keyword essentially stores the set of documents that can be searched using the keyword. In a DHT-based search engine, for each inverted list Tn, we store each list at the node whose Chord node is responsible for thekey range that includes hash(Tn).

A query for keywords “T1 AND T2” will return the document identifiers “Doc1” and “Doc 2”, and the results are obtained by intersecting the sets {Doc1, Doc2}, and {Doc1, Doc2, Doc3}, which are the inverted lists of T1 and T2 respectively. We only deal with AND queries in PennSearch, so you can ignore queries such as “T1 OR T2”.

Note that the query result is not the actual content of the documents, but rather the document identi- fiers that represent documents that include both T1 and T2. In typical search engine, an extra document retrieval phase occurs at this point to fetch the actual documents. We consider the actual documentcontent retrieval step out of scope of this project.

Summary of Commands

Inverted list publishing: “2 PENNSEARCH PUBLISH metadata0.keys” means that node 2 reads the document metadata file named metadata0.keys. Node 2 then reads each line, which is of the form “Doc0 T1 T2 T3 . . . ”, which means that the Doc0 is searchable by T1, T2, or T3. After reading the metadata0.keys file, node 2 constructs an inverted list for each keyword it encounters, and then publishes the respective inverted indices for each keyword into the PennChord overlay. For instance, if the inverted list for “T2” is “Doc1, Doc 2”, the command publishes the inverted list “Doc1, Doc2” to the node that T2 is hashed to. This node can be determined via a Chordlookup on hash(T2). As a simplification, the inverted lists are append-only, i.e. new DocIDs areadded into a set of existing document identifiers for a given keyword, but never deleted from aninverted list.
Search query: “1 PENNSEARCH SEARCH 4 T1 T2” will initiate the search query from node 1, and take the following steps via node 4:

(a) Node 1 contacts node 4 with query ”T1 AND T2”;
(b) Node 4 issues a Chord lookup to find the node that stores the inverted list of T1, i.e., the node that T1 is hashed to (e.g., Node T1), and sends the query ”T1 AND T2” to Node T1;
(c) Node T1 retrieves inverted list for T1 from its local store, issues a Chord lookup to find the node that T2 is hashed to (e.g., Node T2), and sends the retrieved inverted list for T1 together with query ”T2” to Node T2;
(d) Node T2 send the intersection of the inverted lists of T1 and T2 as the final results back either directly back to node 1, or to node 4 (which forwards the results to node 1). If there are no matching documents, a “no result” is returned to node 1.

For simplicity, you can assume there is no search optimization, so inverted lists are intersected based on left-to-right ordering of search terms. Note that the search may contain arbitrary number of search terms, e.g. “1 PENNSEARCH SEARCH 4 T1 T2 T3”.

NOTES

You can assume that each document identifier appears in only one metadataX.keys file. For instance, if node 0 publishes metadata0.keys, and node 1 publishes metadata1.keys, you can assume that both files do not contain overlapping document identifiers. This emulates the fact that in practice, each node will publish inverted indexes for documents that it owns, and one can make the assumption that each node owns a unique set of documents. On the other hand, each searchable keyword may return multiple document identifiers. For instance, there are two documents Doc2 and Doc3 that can be searched using the keyword T3.
You can assume that only nodes that participate in the PennSearch overlay can have permission to read document keywords and publish inverted indexes. Nodes, which are outside the Chord network, may initiate SEARCH by contacting any node, which is part of the PennSearch overlay. For instance, in our example command above, “1 PENNSEARCH SEARCH 4 T1 T2” means that node 1 (which may be a node outside the PennSearch overlay) can issue a search query for “T1 and T2” via node 4, which is already part of the PennSearch overlay.

4 Milestones

Milestone 1: (15%) (Autograded)

We expect a basic Chord implementation where the ring stabilization protocol works correctly. At this stage, finger table implementation is optional for this milestone. We require only the ringstate output to make sure your ring is formed correctly and maintained as nodes enter/leave the Chord overlay.

Milestone 2a (39%), Milestone 2b (46%) (Both autograded)

Complete working implementation of PennChord and PennSearch. Milestone is split into two parts for easier submission.

Criteria for Milestone 2A:

– Node to join the ring (3 points)

* event : node 0 - node 19 join the ring

* result : should see lookupissue, lookuprequest, lookupresult log

– Well formed ring (3 points)

* event : 3 PENNSEARCH CHORD RINGSTATE

* result : check well formed ring, can also check above request path

– P keyword metadata (8 points, 2 points each)

* events :

2 PENNSEARCH PUBLISH metadata2.keys

3 PENNSEARCH PUBLISH metadata3.keys

4 PENNSEARCH PUBLISH metadata4.keys

5 PENNSEARCH PUBLISH metadata5.keys

* result : should see publish and store log

– Search correctness (9 points, 3 points each)

* event : 1 PENNSEARCH SEARCH 4 Johnny-Depp

* result : Pirates-of-the-Caribbean Alice-in-Wonderland

* event : 3 PENNSEARCH SEARCH 10 Johnny-Depp Keira-Knightley

* result : Pirates-of-the-Caribbean

* event : 8 PENNSEARCH SEARCH 17 George-Clooney Brad-Pitt Matt-Damon

* result : Ocean’s-Eleven Ocean’s-Twelve

– Search consistency (4 points)

* events :

2 PENNSEARCH SEARCH 12 Brad-Pitt

3 PENNSEARCH SEARCH 13 Brad-Pitt

4 PENNSEARCH SEARCH 14 Brad-Pitt

* result : should all return Mr-Mrs-Smith Ocean’s-Eleven Ocean’s-Twelve Fight-Club

– Multiple searches from one node at same time (9 points, 3 points each)

* event : 15 PENNSEARCH SEARCH 15 Tom-Hardy

* result : The-Dark-Knight-Rises Mad-Max

* event : 15 PENNSEARCH SEARCH 3 Emilia-Clarke

* result : Game-of-Thrones

* event : 15 PENNSEARCH SEARCH 12 Chadwick-Boseemann

* result : No such file

– Non-chord node issue search (3 points)

* event : 25 PENNSEARCH SEARCH 9 Tom-Hanks

* result : Forrest-Gump Toy-Story

* event : 21 PENNSEARCH SEARCH 16 Jeremy-Renner

* result : Arrival

Criteria for Milestone 2B: Milestone 2B will test correctness for more advanced cases. It is currently designed as an unseen test - but the autograder will give information for any wrong implementations and hints on how to fix them.

Important: in this milestone (2b), you will need to change your Ring State implementation to indicate the all ring state messages for the current ring have been printed to stdout, by printing a End of Ring State message after the last node in the ring printed its ring state message.

Autograder

Autograder parses specific logs from your submission, please make sure you have all of your own printouts commented out before making your submission, having your own printouts interleaved with the logs autograder expects is going to cause it to crash.

For your submissions you will use your GitHub repo, you are required to merge into main and submit main branch only.

On the following logs "(...)" means to use the corresponding arguments for the function, please take a moment to read the briefs in grader-logs.h.

Ring state (MS1): At any node X, a “X PENNSEARCH CHORD RINGSTATE” command will initiate a ring output message. The node receiving this command should call the following function to log its ringstate, after that it should pass a message to its successor which should also call the following function, and so on until X is reached again.

GraderLogs::RingState(...)

You should see something like this for every node in the ring, note that if your implementation is correct ID’s, IP’s, and Hashes should each form a double linked-list.

Ring State

Curr<Node currNode#, currIPAddress, currHash>

Pred<Node predNode#, predIPAddress, predHash>

Succ<Node succNode#, succIPAddress, succHash>

The last node in ringstate (X predecessor) should call the following function after invoking RingState.

GraderLogs::EndOfRingState()

• Lookup issue: Every time a node issues a lookup request, the following should be called:

CHORD_LOG(GraderLogs::GetLookupIssueLogStr(...))

• Lookup forwarding: Every time a node forwards a lookup request, the following should be called:

CHORD_LOG(GraderLogs::GetLookupForwardingLogStr(...))

• Lookup results: Every time a node returns a result in response to a lookup request back to the node that originated the initial lookup request, the following should be called:

CHORD_LOG(GraderLogs::GetLookupResultLogStr(...))

• Inverted list publish: Whenever a node publishes a new inverted list entry (an entry is a key,value pair), the following should be called:

SEARCH_LOG(GraderLogs::GetPublishLogStr(...))

• Inverted list storage : Whenever a node (that the keyword is hashed to) receives a new inverted list entry to be stored, the following should be called:

SEARCH_LOG(GraderLogs::GetStoreLogStr(...))

Note that if CHORD LOG is turned on, this means that between each Publish and Store outputmessage, we should see a series of lookup output messages. Also, note that when a node leavesthe ring, this should trigger its keys to be transferred to another node. This transfer should result in new store messages.

• Search: Whenever a node issues a search query with terms T1, T2,. . . ,Tn, the following shouldbe called:

SEARCH_LOG(GraderLogs::GetSearchLogStr(...))

• Inverted list shipping: For each inverted list <docIDList> being shipped in the process of thesearch, the following should be called:

SEARCH_LOG(GraderLogs::GetInvertedListShipLogStr(...))

• Search results: At the end of intersecting all keywords (T1, T2, . . . , Tn), output the final document list <docIDList> that is being sent back to the initial query node:

SEARCH_LOG(GraderLogs::GetSearchResultsLogStr(...))

• Average Hop Count: The following function should be called on the destructor of each nodechord layer, note that only hops related to searches are counted here (i.e. search layer, not chord layer) for MS1 you should see O(N) and for MS2 you should see O(log(N)) hops. Example: If node 8 gets a lookup related to search, then it forwards to node 10 (hop 1), node 10 forwards tonode 15 (hop 2), and node 15 sends the response to node 8 (no hop here, it is a response, not alookup anymore!).

GraderLogs::AverageHopCount(...)

Hints

• Own Address: Use the following API to obtain current node IP in the application layer:

Ipv4Address GetLocalAddress();

• Node Hash key: Use the following API to obtain node hash:

uint32_t CreateShaKey(const Ipv4Address &ip)

• Callbacks: For MS2, make sure you understand callbacks implementation, check how pings are implemented in chord layer (these are different pings from project1, those were network layer pings, here we implemented application layer pings).

This is the callback setter in chord:

void PennChord::SetPingSuccessCallback(

Callback<void, Ipv4Address, std::string> pingSuccessFn)

{

m_pingSuccessFn = pingSuccessFn;

}

This is Penn-search using the setter to register a callback (i.e. ”Hey Chord when you have a

PingSuccess please execute my function HandleChordPingSuccess”)

// Configure Callbacks with Chord

m_chord->SetPingSuccessCallback(MakeCallback(

&PennSearch::HandleChordPingSuccess, this));

This is the actual function in Penn-Search

8void PennSearch::HandleChordPingSuccess(

Ipv4Address destAddress, std::string message)

{

SEARCH_LOG("Chord Ping Success! ....");

// Send ping via search layer

SendPennSearchPing(destAddress, message);

}

This is Chord executing the callback

m_pingSuccessFn(sourceAddress, message.GetPingRsp().pingMessage);

Extra credits

We have suggested several avenues for extra credit, to enable students to experiment with the challenges faced in designing the Chord protocol to work 24x7 over the wide-area. Note that doing extra credit is entirely optional. We offer extra-credit problems as a way of providing challenges for those students with both the time and interest to pursue certain issues in more depth.

Similar to project 1, extra credit will be submitted separately from regular credit on Gradescope. You can demo your extra credit to your assigned TA after the deadline. Make sure to document your work, design choices, and test cases.

Handling big numbers (5%): If you use the entire 160 bits from SHA1 as the Chord ID, you will get 5% extra credit. This will require you to design a mechanism to store, add, mod, compare,and display big integers that cannot fit in primitive data types.
Demonstrate using your LS/DV from project 1 (5%): If you use your own LS or DV instead of the default ns-3 routing for your demo, you will get 5% extra credit. This may require some mod ifications to your project 1 code to get this to work properly, in particularly implement routeInput and routeOutput functionality which is not tested by our autograder.
Bandwidth efficient search engine (10%): The basic PennSearch can be enhanced to save band width as follows: perform the intersection of keywords starting from the one with smallest inverted list, and use of bloom filters. Implement these features and show that your implementation results in lower bandwidth utilization compared to the regular implementation.
Enhanced search features (5% for each bullet point): Enhanced your search engine with at least the following features:

Generating the document keywords (e.g. metadata0.keys, metadata1.keys) with actual web pages (>20) that you have downloaded, and replacement docID with actual URLs. After re turning the search results, fetch the actual documents themselves. Do the last step sparingly so as not to overload existing web servers with your http requests.
Rank search results based on a reasonable search metric, such as TF-IDF.
Support inverted lists that are larger than MTU size of 1500 bytes. For instance, you can use a fragmentation/defragmentation scheme that breaks up an inverted list into smaller size lists in executing the search query.

Chord file system (20%). The Chord File System (CFS) is a fairly sophisticated application that uses Chord. Hence, this has more extra credit than other applications. You need to demonstrate tous that you can take a few fairly large files, store it in CFS, and retrieve them correctly. To earnthe entire 20%, you need to implement all the functionalities described in the CFS paper. You cansearch online for the MIT CFS paper.
Chord performance enhancements (20%): Chord has O(log N) hop performance in the virtual ring. However, the adjacent nodes in the ring may actually be far away in terms of networkdistance. How can one take advantage of network distance information (gathered via the link/path costs from ns-3 or your project 1 routing implementation) to select better Chord neighbors? Do some research online to find techniques explored by others to solve this problem and implement one such solution.
Churn handling and failure detection (20%): Add support to emulate failures of PennChord nodes, mechanisms to detect failures of PennChord nodes, and repair the routing state in Penn Chord accordingly when failures occur. To ensure robustness, each node needs to maintain mul tiple successors. All churn events (node joins, leaves, or failures) can happen concurrently, and a node that fails can rejoins the overlay later. You can however not concern yourself with failures at the network layer, i.e .all failures occur at the application-layer node.
Application-layer multicast (15%): Using Chord as a building block, build and maintain a single source multicast tree as described in the lecture, where JOIN requests are routed to the root note (for each multicast group), and forwarding is set up on the reverse path. To get full credit, you need to construct the tree, show actual content being disseminated via the tree, and make sure you handle the case where nodes enter/leave the ring. You can assume graceful exits and not sudden failures of nodes.
PennSearch on mobile devices (20%). Using your Android phones, demonstrate a small (2-3) PennSearch network running on mobile devices. This requires taking the ns-3 code and compiling it on Android. If you do not have an Android, you can use an Android emulator.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *