首页 » 计算机科学(Computing Science) » CS 221: Information Retrieval Project 1

CS 221: Information Retrieval Project 1

2024-08-03 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CS221: Project 1 - TA: Text Analyzers (stemming and tokenization)

Overview

Setups

Setup Development Environment
Clone the CS221 repository
Join Github Organization and create a repo for your team

Coding Tasks

Implement a simple tokenizer based on punctuations and white spaces. (3 points)
Implement a Dynamic-Programming-based Word-Break Tokenizer. (7 points)
Incorporate a Porter stemmer. (2 points)
(Optional) Extra Credit: Implement a dynamic-programming-based Chinese or Japanese tokenizer. (3 points)

Testing Tasks

Write at least 2 test cases for a task (3 points)
Review the test cases of two teams (2 points)

Total: 17 points (+ 3 extra credits)

Setup Environments

Task	Guide
Install Java 8	We require Java 8
Install Maven	Windows: tutorial, Mac: brew install maven, Ubuntu: sudo apt-get install maven
Setup IntelliJ	Intellij (IntelliJ is strongly recommended)

Clone CS 221 repository

Go to spring19-cs221-project and follow the README instructions to import the CS221 project into your IntelliJ. IntelliJ is needed because a library we use is written in Kotlin, which has built-in support in IntelliJ.

Join Github Organization and create a repo for your team

This course uses Github for version control, submitting final code, submitting test cases and reviews. The students are expected to use Github as explained below.

Create an account on Github if you don't have one. Then provide us your username in the Google Spreadsheet.
Wait for the invitation from our staff to join the UCI-Chenli-teaching organization. The invitation might take a few days to be sent out based on our schedule.
One member from each team needs to create a private repository. The repository name should be of the form 'cs221-spring19-team-x' where 'x' is your assigned team number, e.g., "cs221-spring19-team-1". Other members can then be added to the repository as collaborators (by following steps given here).
You must wait for the invitation from us and create a private repository within the organization UCI-Chenli-teaching. Don't create a private repository on your own. 1) Go to UCI-Chenli-teaching organization. 2) click the New button to create a repository. 3) Type in the name and choose private.

We have a wiki to get you started with Github for this course. Go through one of these online tutorials to get familiar with Git and Github if you are not familiar with them.

Coding Tasks

Task 1: Implement a simple tokenizer based on punctuations and white spaces (3 points)

Implement this tokenizer in analysis/PunctuationTokenizer.java

For example: the text I am Happy Today! should be tokenized to ["happy", "today"].

Requirements:

White spaces (space, tab, newline, etc..) and punctuations provided by us should be used to tokenize the text.
White spaces and punctuations should be removed from the result tokens.
All tokens should be converted to lower case.
Stop words should be filtered out. Use the stop word list provided in StopWords.java.

Task 2: Implement a Dynamic-Programming based Word-Break Tokenizer (7 points)

Word break is a problem where given a dictionary and a string (text with all white spaces removed), determine how to break the string into a sequence of words. Implement this tokenizer in analysis/WordBreakTokenizer.java. As an example, an input string "catdog" should be broken to tokens ["cat", "dog"].

Use frequency statistics to choose the optimal way when there are many alternatives to break a string. For example:

input string is "ai",
dictionary and probability is: "a": 0.1, "i": 0.1, and "ai": "0.05"

Alternative 1: ["a", "i"], with probability p("a") * p("i") = 0.01
Alternative 2: ["ai"], with probability p("ai") = 0.05
Finally, ["ai"] is chosen as result because it has higher probability.

We provide an English dictionary corpus with frequency information in "resources/cs221_frequency_dictionary_en.txt".

Requirements:

Use Dynamic Programming for efficiency purposes.
Use the given dictionary corpus and frequency statistics to determine an optimal alternative.

The probability is calculated as the product of each token's probability, assuming the tokens are independent.

A match in the dictionary is case insensitive. Output tokens should all be in lower case.
Stop words should be removed.
If there's no possible way to break the string, throw an exception.

Task 3: Incorporate a Porter stemmer (2 points)

Stemming is the process of reducing a word into its "stem" ("root") form.

Porter stemming is a classic and popular algorithm that uses a set of rules and steps to process a token. We ask you to incorporate the following existing Porter stemmer implementation into this project:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/PorterStemmer.java

Task 4: Implement a dynamic-programming based Chinese or Japanese tokenizer (Optional Extra Credit, 3 points)

Tokenizing Chinese or Japanese text is challenging because there are no explicit spaces between words. It is very similar to the word-break problem in task 2.

Use the same dictionary-frequency and dynamic programming based algorithm in task 2 to implement a Chinese or Japanese Tokenizer. For fairness, you must choose a language that is NOT your native language.

You need to find a Chinese or Japanese dictionary corpus with frequency information on your own, and write at least 3 test cases to test the correctness of your tokenizer.

Testing Tasks

Task: Submitting Test Cases (3 points)

For this project, we require each team to submit at least 2 test cases. We expect you to write high-quality test cases, and grade will be based on the correctness, quality, and documentation.

Each team will be assigned to write test cases for 1 specific task (Punctuation-based Tokenizer, or Word Break Tokenizer, or Porter Stemmer). Check the Google Spreadsheet to see on which task your team should write tests.

The test cases should follow these general guidelines:

Create a new class under the corresponding package in test/java/edu.uci.ics.cs221/.... The class name should follow the naming convention Team#TaskNameTest.
Write tests using the JUnit testing framework. Use Assert functions from the Junit framework.
Each test case should be independent of each other. JUnit runs test cases in an arbitrary order.
Each test case should have informative comments.

The test cases should be submitted via Github Pull Requests, to submit a test case:

fork the spring19-cs221-project into your own Github account. Clone your fork into your local machine.
In your own fork repo, go to "settings -> Collaborators", add the TA "zuozhiw AT gmail DOT com" as a collaborator.
Add the test cases, commit, and push to your own fork's master branch.
In your own fork repo, click "Pull request" and open a pull request to merge into the original repo UCI-Chenli-teaching/spring19-cs221-project.
Follow the title and content of the template pull request

Task: Peer Review Test Cases (2 points)

For this project, we require each team to review 2 other teams' test cases after the test cases are submitted. The reviewers need to leave comments under the Github Pull Requests to discuss problems or suggestions with the authors, and approve the pull request.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *