COMP24011 Lab 4


Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


COMP24011 Lab 4:
BM25 for Retrieval-Augmented Question Answering

Introduction

In this exercise, you will develop your own implementation of the BM25 scoring algorithm, one of the most popular methods for information retrieval. Apart from the traditional uses of information retrieval methods in the context of search engines and document ranking, they have recently been employed to enhance the question answering (QA) capabilities of generative large language models (LLMs). Such models such as ChatGPT, can answer questions based on knowledge learned during their training on large amounts of textual data. However, they suffer from well-known limitations, including their tendency to hallucinate (i.e., make up answers that are factually wrong), as well as biases that they learned from the training data.

A workaround to these issues is the integration of an information retrieval module into the question answering pipeline, in order to enable the LLM to access factual information stored in relevant documents that can be used by the model in producing its output.

If you follow this manual all the way to the end, you will have the opportunity to observe how BM25 enables an LLM to provide more accurate answers to questions. Your main task for this exercise, however, is to implement pre-processing techniques, compute the BM25 score of each (pre processed) document in relation to a (pre-processed) question, and return the topmost relevant documents based on the scores.

For this exercise, you are provided with the following text files as resources: transport_inventions.txt

The content of this file was drawn from Wikipedia’s timeline of transportation technology. We will consider this file as a corpus, i.e., a collection of documents, whereby each line corresponds to one document. Given that there are 10 lines in the file, this corpus consists of 10 documents.

music_inventions.txt

The content of this file was drawn from Wikipedia’s timeline of music technology. We will consider this file as another corpus. As in the first corpus, each line corresponds to one document. Given that there are 10 lines in the file, this corpus consists of 10 documents. 

stopwords_en.txt

This file contains a stop word list taken from the Natural Language Tooklkit (NLTK). This is a list of words that are commonly used in the English language and yet do not bear meaning on their own.

Every line in the file is a stop word.

If you make changes to the contents of these files, this will change the expected behaviour of the lab code that you’re developing, and you won’t be able compare its results to the examples in this manual. But you can always use git to revert these resources to their original state ⌣

To complete this lab you will need a third-party stemming tool called PyStemmer. You can install it by issuing the following command

$ pip install pystemmer

The BM25 Retrieval System

Once you refresh the lab4 branch of your GitLab repo you will find the following Python files.

run_bm25.py

This is the command-line tool that runs each separate NLP task according to the subcommand (and the parameters) provided by the user. It contains the RunNLP class.

nlp_tasks_base.py

This module contains the NLPTasksBase “abstract” class that specifies the signatures of four methods you need to implement, and implements the interface used in RunNLP.

nlp_tasks.py

This is the module that you need to complete for this exercise. It contains the NLPTasks class that is derived from NLPTasksBase, and must implement its abstract methods in order to complete the BM25-based retrieval of documents relevant to a given question.

In order to successfully complete this lab you will need to understand both nlp_tasks_base.py and nlp_tasks.py but you do not need to know the details of how run_bm25.py is coded.
Once you complete this exercise, the BM25 tool will be able to obtain the documents most relevant to a given question. This BM25 retrieval system provides comprehensive help messages. To get started run the command
$ ./run_bm25.py -h
usage: run_bm25.py [-h] -c CORPUS [-w STOPWORDS] [-s]
{preprocess_question,preprocess_corpus,IDF,BM25_score,top_matches}
...
options:
-h, --help show this help message and exit
-c CORPUS, --corpus CORPUS
path to corpus text file (option required except for the preprocess_question command)
-w STOPWORDS, --stopwords STOPWORDS path to stopwords text file (option required unless stopwords are located at ./stopwords_en.txt)
-s, --stemming enable stemming subcommands:
select which NLP command to run
{preprocess_question,preprocess_corpus,IDF,BM25_score,top_matches}
preprocess_question
get preprocessed question
preprocess_corpus get preprocessed corpus
IDF calculate IDF for term in corpus

BM25_score calculate BM25 score for question in corpus document top_matches find top scoring documents in corpus for question

Notice that for most subcommands you need to specify which corpus to work with, as you’ll have the choices described in the Introduction: transport_inventions.txt or music_inventions.txt.

On the other hand, unless you move the stopwords list to another directory, you should not need to give its location.
The tool has a boolean flag that controls if stemming should be applied when pre-processing text.

By default it is set to False, but you can set it to True using the stemming option. This will affect the way your text preprocessing code for Task 1 below should work.

The BM25 tool supports five subcommands: preprocess_question, preprocess_corpus, IDF, BM25_score and top_matches. The first two will call your text pre-processing implementation, the others will call the corresponding functions that you’ll develop in Tasks 2 to 4 below. Each of these subcommands has its own help message which you can access with commands like
$ ./run_bm25.py top_matches -h
usage: run_bm25.py top_matches [-h] question n
positional arguments:
question question string
n number of documents to find
options:
-h, --help show this help message and exit
The BM25 tool will load the stopwords list and corpus as required for the task. For example,
running the command
$ ./run_bm25.py preprocess_question "Who flew the first motor-driven airplane?"
nlp params: (None, ’./stopwords_en.txt’, False)
debug run: preprocess_question(’Who flew the first motor-driven airplane?’,)
ret value: flew first motor driven airplane
ret count: 32
will not load the corpus as text pre-processing is only applied to the given question string. Note
that text pre-processing should, in general, return a different value if stemming is enabled. In fact,
for the same question of the previous example you can expect
$ ./run_bm25.py -s preprocess_question "Who flew the first motor-driven airplane?"
nlp params: (None, ’./stopwords_en.txt’, True)
debug run: preprocess_question(’Who flew the first motor-driven airplane?’,)
ret value: flew first motor driven airplan
ret count: 31
To pre-process the text of a whole corpus you should use the preprocess_corpus subcommand.
For example, once you’ve finished Task 1 you should get
$ ./run_bm25.py -s -c music_inventions.txt preprocess_corpus
nlp params: (’music_inventions.txt’, ’./stopwords_en.txt’, True)
debug run: preprocess_corpus()
ret value: [
’1940 karl wagner earli develop voic synthes precursor vocod’,
’1941 commerci fm broadcast begin us’,
’1948 bell laboratori reveal first transistor’,
’1958 first commerci stereo disk record produc audio fidel’,
’1959 wurlitz manufactur sideman first commerci electro mechan drum machin’,
’1963 phillip introduc compact cassett tape format’,
’1968 king tubbi pioneer dub music earli form popular electron music’,
’1982 soni philip introduc compact disc’,
’1983 introduct midi unveil roland ikutaro kakehashi sequenti circuit dave smith’,
’1986 first digit consol appear’]
ret count: 10
Assignment

For this lab exercise, the only Python file that you need to modify is nlp_tasks.py. You will develop your own version of this script, henceforth referred to as “your solution” in this document.

Before you get started with developing this script, it might be useful for you to familiarise yourself with how the NLPTasksBase “abstract” class will initialise your NLPTasks objects:

• The documents in the specified corpus are loaded onto a list of strings; this list becomes the value of the field self.original_corpus
• The stop words in the specified stop word list file are loaded onto a list of strings, which becomes the value of the field self.stopwords_list
• If stemming is enabled, an instance of the third-party Stemmer class is created and assigned to the field self.stemmer

In addition, the pre-processing of the corpus and of the question strings is done automatically in the NLPTasksBase abstract class. The pre-processed text for these become available as the fields self.preprocessed_corpus and self.preprocessed_question, respectively.


发表评论

电子邮件地址不会被公开。 必填项已用*标注