CS 5246 - Text Processing and the Web Homework #1

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CS 5246 - Text Processing and the Web Homework #1 - Blog Search

In this assignment, you will developing a search engine for blogs. Like many inputs in other domains, the input is largely free text, but semi-structure hints can be recovered from the input with some effort. Also, similar to other real-world problems, the input is quite noisy; the input files are HTML files that need to be post-processed to find the appropriate content.

Your assignment is to create an advanced search engine that will retrieve relevant English blog posts and comments on particular topics. To make the assignment more closed in nature, we are restricting the possible blog posts to ones published in 2007 and published using Wordpress - a popular blogging software.

To do this assignment, you are to utilize Yahoo!'s Build your Own Search Service (BOSS). Note: If you have access to other search engine APIs (Google or Microsoft), you are not to use these APIs -- use only the BOSS API. This is to guarantee that your system does not retrieve "better" results merely because it is connected to a different search engine.

You may work in teams of two or individually for this assignment. There will be no adjustment to scores in factoring for whether the assignment is done in a team or individually.

BOSS is a very simple system that allows you to gain programmatic access to Yahoo! search engine via a simple HTTP GET. Please read the documentation for more details (see "links" below). For example to search for blogs containing "Google" from Wordpress blog posts in 2007, we write a URL like:

http://boss.yahooapis.com/ysearch/web/v1/Google%20inurl:2007%20%22powered%20by%20wordpress%22?appid=appid

A few things to note about the above. Spaces and quotation marks are URL escaped (to "%20" and "%22" respectively). If you include other special characters you will have to escape them. Secondly, I've added "inurl:2007 "powered by wordpress" as part of the query to limit my query to HTML documents that have the two phrases on the page and have "2007" as part of the URL. These restrictions limit the returns to blog entries in 2007 that are created by WordPress (yes, this doesn't include all such blogs -- some sites that use WordPress eliminate the "powered by wordpress" tagline). Finally, the value of the appid attribute must be replaced by a valid appid. To get a valid appid, you need to follow the instructions to obtain a BOSS API key, as discussed in class. The output (shown here in XML format) will contain results as would be shown on Yahoo!, which include the snippet/abstract, date, size, title and various forms of the the URL (see the BOSS documentation).

To do this assignment, you will have to come up with a programmatic solution that 1) creates suitable queries given statements of information need, 2) caters to the semi-structured and noisy nature of blogs.

To assess your submissions, we will be using statements of information needs which your system should to automatically convert to queries to find relevant documents. Your retrieval results will be assessed against answers that are compiled by the class. Each student or team will be assigned two needs to find relevant blog posts for in the corpus. The answers to these needs will each be a list of documents that each student compiles. The sum of the query answers will be used to grade system performance. Below are the 15 information needs that will be used to test all systems; another 2 have been withheld for private testing (to be only made public after the submission time). You can (Revised on Sun Sep 28 16:31:10 SGT 2008) download these needs as a zip archive (v2.0). Needs will be provided to your system as a 2-line input text file to be read from standard input (not to be confused as with a command-line argument), in which the first line is the title of the query (e.g., the bolded part) and the second line is the description. Note that since you should keep your BOSS API key to yourself, you need to have your program read a BOSS API key should be read from a specific file (should be named "boss.key" in the top-level of your submit directory).

  1. Virginia Tech Shootings: Relevant documents will express opinions about the shootings and/or hypothesize on the motives of the killer. Factual reports of the shootings are not considered relevant.
  2. Traditional Chinese Medicine: Relevant documents will discuss opinions or examples of where any form of traditional Chinese medicine works or fails.
  3. Technologies for Solving Global Warming: Documents that name and discuss different technologies or products that may contribute to the reduction of global warming are considered relevant.
  4. Who Wants To Be a Millionaire Hosts: Documents that give an opinion about the accent, appearance or attitude of the host of the program are considered relevant. Note that this show is syndicated and is hosted by different hosts in different countries.
  5. UCAS Application Process: Documents that give an opinion about the UK higher education admissions process are relevant. Posts that discuss how the system may be improved are also relevant. Stories about individual's experiences are also relevant. Factual pages that discuss the process from administrative posts are not considered relevant.
  6. Water purification: Documents that discuss different methods of water purification and treatment at the industrial level (not consumer level) are considered relevant.
  7. Currency Exchange Rates: Documents that discuss exchange rates between any two currencies will rise or fall are relevant. Posts that discuss only historical fluctulations are not relevant.
  8. Best Games for the Nintendo Wii: Documents that give the authors' or commenter's opinions of their favorite games for the Wii are considered relevant. Posts that give factual information on sales rankings or rankings on a particular web site are not relevant.
  9. Surfing sites in Australia: Documents that discuss opinions on different sites for surfing in Australia are considered relevant. Sites for other sea sports such as diving and sailing are not considered relevant.
  10. Saddam Hussein: Documents that discuss the former Iraqi president's role in the fate of his country and countrymen are relevant. Documents that discuss opinions on his execution are also relevant.
  11. Halo 3: Documents that discuss the game or its beta version are considered relevant. Documents that primarily discuss about previous installments of this game are irrelevant.
  12. Hawaii Sights: Documents that discuss different tourist's opinions of any of the Hawai'ian islands' sights and sounds are considered relevant. Hotel and restaurant recommendations by tourists by themselves are not relevant.
  13. Airline Frequent Flier Programmes: Documents that discuss the different benefits and restrictions of different airline companies' frequent flier programs are relevant. Documents where the poster just states that the poster belongs to specific programme(s) are not relevant.
  14. Phones for SMS: Documents that discuss which mobile phones are best for sending short messages are considered relevant. Documents that just describe other aspects of a mobile phone are considered irrelevant.
  15. Republican Nominations: Documents that discuss the changes of possible candidates for the Republican nomination for the US presidential race are relevant. Documents that discuss congressional candidates or democratic candidates are not relevant.

Note that since this is an assignment that comprises at least 25% of your grade, I expect the level of effort for this assignment to be similar. You have five weeks to do this assignment. You should start immediately by finishing your judgments of which documents are relevant to which information needs. Hopefully this will give you an idea of how to code your search engine you can then follow on to complete the assignment.

What to turn in

You will upload an HT0000000.zip (where HT0000000 is your matric ID, where all letters are in uppercase) archive by the due date, consisting of the following four sets of items. Please use a ZIP (not RAR, B2Z or TAR) utility to construct your submission. Do not include a subdirectories in the submission to extract to (e.g., unzipping X.zip should give files like X.sum, not X/X.sum or submission/X.sum). Please use all capital letters when writing your matric number (matric numbers should start with U, NT, HT or HD for all students in this class). Your cooperation with the submission format will allow me to grade the assignment in a timely manner. Note that I do not want to know who you are, with respect to grading assignments, so it is important that you try not to reveal your identity in your submission. Please follow the below instructions to the letter.

  1. A summary file in plain text (not MS Word, not OpenOffice), that describes your submission and the architecture for retrieval. You should include your matric number and your NUS (u|g) prefixed email address as the only form of ID. In this file you also need to describe how your source code can be built and executed on sf3/sunfire. If your submission cannot be run on sunfire, you'll need to demonstrate it to me, sometime soon after the submission date (by downloading your submission file and running it on your system). The link to the demonstration sign up is here; demonstrations will be from 5-8pm on 7 Oct. You should include notes about the development of your submission, and special features that you developed to handle the structure of the queries and documents (filename: ReadmeHT0000000.txt, where HT0000000 is your matric ID). Warning! If you use any lexicons, resources, code or algorithmic description that are beyond the references on this page, you need to give proper credit and acknowledge the contribution of others. Please cite or acknowledge work that helped you that you did not do on your own. I will deduct the credit accordingly, if applicable. Failure to acknowledge your sources constitutes plagiarism and will be punished accordingly.
  2. Two gold-standard lists of relevant documents for each of the two needs you were assigned find relevant documents for. You should assess relevance only on the basis of the HTML file. This should list the information need ID on the first line and the relevance judgement (+ or -) and URL of any relevant documents on the subsequent lines. These two should be separated by a space, see this example file. You should list at least fifty documents, where more relevant documents should be annotated if possible. These should be named nX-gold.txt, where X should be replaced by the need ID. Documents that are not in English should be judged as irrelevant.
  3. Fifteen files for the retrieval results for all 15 public queries. These should be in a similar form to the gold-standard files; the need ID on the first line and the URL of relevant documents (in relevance order). These files should named nX.txt, where X should be replaced by the need ID. A sample file is here. Each list should have fifty results. I will generate the final two files for the test queries during testing or have you generate them on the fly if a demo is necessary.
  4. Your source code tree. These should be relatively well documented so that I can follow the logic of your code, with the help of the ReadmeHT0000000.txt file. Typing in "make" or "ant" should build the appropriate code, such as an executable, if needed. In your assignment submission, please do not assume that any environment variables (e.g., PATH and CLASSPATH) are necessarily correctly set. The executable file to run your system should be named runHT0000000 (where HT0000000 is to be replaced by your matric number, as above) and be set as executable (by you or by your buildfile if it is compiled). In retrieving candidate blog posts for your system to filter or rerank, you are required to add the inurl:2007 and "powered by wordpress" modifiers to your Yahoo! BOSS queries.

Grading scheme

Your grade will take into account 1) features used, 2) retrieval accuracy, 3) peer annotation, 4) documentation and 5) time efficiency. These factors are listed in order of importance/weighting to your final grade for the assignment. Warning -- I will be reading your code, so please make sure it is tidy and well documented.

  • [36 percent] Features used. This will be judged on the basis on your code and your summary file. What features do you use, whether you take advantage of the semi-structure in the input, how you modified the ranking score to get the final results.
  • [32 percent] Retrieval accuracy. This will be judged based on the pooled relevance judgments that all students turn in (the nX-gold.txt files in your submission. I will also include some additional test queries that you will not know ahead of time.
  • [20 percent] Peer Annotation. To judge #2 (retrieval accuracy) I will be looking at your annotated results to check for completeness and good manual retrieval. Note that our corpus is only a tiny fraction of all blogs on the web, there will be lots of relevant posts not found by using our criteria (Yahoo! BOSS, inurl:2007 and "powered by wordpress"); these you do not have to worry about.
  • [10 percent] Documentation. How well the summary file and source code is documented. This will include how easy it is for me to run your software and the state of your code (is it readable, and the workflow well partitioned?).
  • [2 percent] Time efficiency of the system. As long as the system takes no longer than 5 minutes to produce a result for a need, it will be considered satisfactory.

Due date and late policy

According to the syllabus, this homework is due on 2 Oct at 11:59 pm SGT. Submit your zip file to the IVLE workbin by this time. The late policy for submissions applies as per the policy set forth on the "Grading" page.

References

  • The BOSS homepage. Probably not as useful as the forum or the PDF documentation.
  • WordPress - the free blogging platform, which we are targeting in our search.
  • wget - an open-source command-line URL fetching utility. Also already installed on sunfire. Recommended for interacting with BOSS.
  • The Sentiment AI Yahoo! Group, a group of researchers that look at identifying statements of opinion.
  • A fairly recent opinion lexicon that you might use in your assignment.
  • You might use Apache Lucene IR engine to process and retrieve locally downloaded documents.

Hints

  1. The bulk of this assignment is to think about how to best utilize statements of information need and how to process them. You'll need to figure out how to decide what parts of the statements to keep and which to throw away or weight negatively. You may want to combine the results of several searches together using your own weighting, or incorporate external knowledge from lexicons that you've created yourself or mined from other resources.
  2. The assignment is also difficult technically as you have to deal with XML or JSON output formats. Do plan to spend a bit of time learning how to interpret this output format programmatically using your preferred programming language. Note that your programs have to run on sunfire, otherwise you have to demonstrate that your programs run on a laptop that only uses open-source software (private proprietary libraries are prohibited for assignments).
  3. You can use Yahoo! BOSS to access lots of different types of searches from Yahoo!, including spelling correction, news and the general web. You can use such searches to glean auxiliary, supplemental information which can be used to help you in ranking candidates or in expanding your search.
  4. Yahoo! BOSS accepts all of the query syntax in Yahoo!'s web search. Since most of you count yourselves as savvy web searchers, you should be able to figure out how to use some of the more esoteric searches to help you. If you're not so sure, check here.
  5. You can use external sources in RPNLPIR (such as lexica like WordNet or statistics like IDF statistics over the WebBase corpus) to assist your programs. If you do plan to use external resources, please be aware that they take time to compile and preprocess into a useable form for you to take advantage of.
  6. You may find downloading the documents yourself and processing them may be helpful. If you do download documents, please note that given the five minute deadline for each query, please make sure you that your program doesn't hang if faced with a recalcitrant page download.

Disclaimer

I'm not affiliated with Yahoo!, WordPress or other search engine companies nor am I advocating their products. However, as blogs are a current interest in IR and Yahoo! has an easy-to-use, non-limited API, I have chosen to use these tools for our assignment.

发表评论

电子邮件地址不会被公开。 必填项已用*标注