Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
COMP24011 Lab 4:
BM25 for Retrieval-Augmented Question Answering
Introduction
In this exercise, you will develop your own implementation of the BM25 scoring algorithm, one of the most popular methods for information retrieval. Apart from the traditional uses of information retrieval methods in the context of search engines and document ranking, they have recently been employed to enhance the question answering (QA) capabilities of generative large language models (LLMs). Such models such as ChatGPT, can answer questions based on knowledge learned during their training on large amounts of textual data. However, they suffer from well-known limitations, including their tendency to hallucinate (i.e., make up answers that are factually wrong), as well as biases that they learned from the training data.
A workaround to these issues is the integration of an information retrieval module into the question answering pipeline, in order to enable the LLM to access factual information stored in relevant documents that can be used by the model in producing its output.
If you follow this manual all the way to the end, you will have the opportunity to observe how BM25 enables an LLM to provide more accurate answers to questions. Your main task for this exercise, however, is to implement pre-processing techniques, compute the BM25 score of each (pre processed) document in relation to a (pre-processed) question, and return the topmost relevant documents based on the scores.
For this exercise, you are provided with the following text files as resources: transport_inventions.txt
The content of this file was drawn from Wikipedia’s timeline of transportation technology. We will consider this file as a corpus, i.e., a collection of documents, whereby each line corresponds to one document. Given that there are 10 lines in the file, this corpus consists of 10 documents.
music_inventions.txt
The content of this file was drawn from Wikipedia’s timeline of music technology. We will consider this file as another corpus. As in the first corpus, each line corresponds to one document. Given that there are 10 lines in the file, this corpus consists of 10 documents.
stopwords_en.txt
Every line in the file is a stop word.
If you make changes to the contents of these files, this will change the expected behaviour of the lab code that you’re developing, and you won’t be able compare its results to the examples in this manual. But you can always use git to revert these resources to their original state ⌣
To complete this lab you will need a third-party stemming tool called PyStemmer. You can install it by issuing the following command
The BM25 Retrieval System
Once you refresh the lab4 branch of your GitLab repo you will find the following Python files.
This is the command-line tool that runs each separate NLP task according to the subcommand (and the parameters) provided by the user. It contains the RunNLP class.
This module contains the NLPTasksBase “abstract” class that specifies the signatures of four methods you need to implement, and implements the interface used in RunNLP.
This is the module that you need to complete for this exercise. It contains the NLPTasks class that is derived from NLPTasksBase, and must implement its abstract methods in order to complete the BM25-based retrieval of documents relevant to a given question.
BM25_score calculate BM25 score for question in corpus document top_matches find top scoring documents in corpus for question
Notice that for most subcommands you need to specify which corpus to work with, as you’ll have the choices described in the Introduction: transport_inventions.txt or music_inventions.txt.
By default it is set to False, but you can set it to True using the stemming option. This will affect the way your text preprocessing code for Task 1 below should work.
For this lab exercise, the only Python file that you need to modify is nlp_tasks.py. You will develop your own version of this script, henceforth referred to as “your solution” in this document.
Before you get started with developing this script, it might be useful for you to familiarise yourself with how the NLPTasksBase “abstract” class will initialise your NLPTasks objects:
In addition, the pre-processing of the corpus and of the question strings is done automatically in the NLPTasksBase abstract class. The pre-processed text for these become available as the fields self.preprocessed_corpus and self.preprocessed_question, respectively.