INFS7410 Project - Part 1
version 1.4
Assignment Preamble
Due Date
30 August 2024, 16:00 Eastern Australia Standard Time
Weight
This assignment (Project - Part 1) constitutes 20% of the overall mark for INFS7410.
(Part 1 + Part 2 = 40% of the total course grade)
Completion Requirements
- You should complete it individually.
- You can check the detailed marking sheet provided alongside this notebook.
- see INFS7410-project-part-1-marking-sheet.pdf
Prerequisites Checker
You should have already tackled the activities from pracs week1-5, including:
- Indexing corpus (prac-week1)
- Implementing retrieval functions (prac-week3)
- Implementing rank fusion methods (prac-week4)
- Implementing query expansion and reduction from pseudo relevance feedback(prac-week5)
- Performing evaluation, visualisation, and statistical significance test (prac-week2)
Tips
- Start early to allow ample time for completion.
- Proceed step-by-step through the assignment tasks.
- Most of the assignment relies on knowledge and code from your computer practicals. However, be prepared for challenges that may require additional time and effort to solve.
Aims
Project Aim:
The aim of the entire project is to implement a number of representiative information retrieval methods, evaluate and compare them in the context of real use-cases.
Part 1 Aim
The aim of Part 1 is to:
- Familiarise yourself with the basic retrieval workflow.
- Set up the infrastructure for indexing the corpus and evaluating with queries.
- Implement classical information retrieval methods coverd in the pracs and lectures.
- Tune your retrieval methods to improve their effectiveness.
The Information Retrieval Tasks: Fact Checking and Bio-Medical Retrieval
In this project, we will consider two tasks in IR:
- Fact Checking verifies a claim against a large collection of evidence. Here, we focus on the scientific domain, which ranges from basic science to clinical medicine. We verify scientific claims by retrieving evidence from a corpus of research literature containing scientific paper abstracts.
- Bio-Medical Retrieval involves searching for relevant scientific documents, such as research papers or blogs, in response to a specific query within the biomedical domain.
For these tasks, we will use selected datasets from the BEIR benchmark, specifically SciFact (Fact checking), NFCorpus (Biomedical),and TREC-COVID (Biomedical).
What we give you:
Files from Previous Practicals
You can freely re-use all the materials from prac-week1 to prac-week5, e.g. your implemenation/codes.
Files for This Project
-
infs7410_project_collections.zip (74.1 MB)
- Click here to download and unzip.
- INFS7410-project-part-1.ipynb (This notebook)
- INFS7410-project-part-1-marking-sheet.pdf
We provide the following collections for the project:
-
NF Corpus: (training + test)
- ./nfcorpus_corpus.jsonl
- ./nfcorpus/nfcorpus_train_queries.tsv
- ./nfcorpus/nfcorpus_train_queries.tsv
- ./nfcorpus/nfcorpus_train_qrels.txt
- ./nfcorpus/nfcorpus_test_qrels.txt
-
SciFact: (training + test)
- ./scifact_corpus.jsonl
- ./scifact/scifact_train_queries.tsv
- ./scifact/scifact_train_queries.tsv
- ./scifact/scifact_train_qrels.txt
- ./scifact/scifact_train_qrels.txt
-
TREC-COVID: (test only)
- ./trec-covid_corpus.jsonl
- ./trec-covid/trec-covid_test_qrels.txt
- ./trec-covid/trec-covid_test_queries.tsv
Generally, each collection contains:
- corpus.jsonl: Containing the texts to be retrieved for each query.
- queries.tsv: Listing queries used for retrieval, each line containing a topic id and the query text.
- qrels.txt: Containing relevance judgements for your runs in TREC format <qid, Q0, doc_id, rank, score, tag>.
Additionally,
This Jupyter notebook is the workspace where you will implement your solutions and document your findings.
Put this notebook and the provided files under the same directory.
Overview of the IR Workflow
To conduct an experiment for the IR tasks in this project, we generally follow three key stages: Indexing -> Retrieval -> Evaluation. Each stage involves a corresponding portion of data from the collection. A collection typically comprises a corpus, queries, and a qrel. These are illustrated below:
What you need to do:
You are expected to deliver the following:
- Correct implementations and evaluations of the methods required by this project specifications.
-
Write-up about the retrieval methods used, including:
- formula that represent the method you implemented.
- code that corresponds to the formula.
- evaluation settings with explanation followed.
- discussion of the findings.
Both the implementations and write-ups must be documented within this Jupyter notebook.
Required Methods to Implement
-
Indexing
- Pyserini index command: You first need to index the new datasets introduced in this project. Each dataset should be made into a separate index. Check how we use the command to build index for given collections from week1-week3.
-
Ranking functions
- BM25: Implemented by yourself, not using the one from Pyserini. Check week3.
-
Query reformulation methods
- Pseudo-Relevance Feedback using BM25 for Query Expansion: Implemented by yourself. Check week5.
- IDF-r Query Reduction: Implemented by yourself. Check week5.
-
Rank fusion methods
- Borda: Implemented by yourself. Check week4.
- CombSUM: Implemented by yourself. Check week4.
- CombMNZ: Implemented by yourself. Check week4.
Parameter Tuning:
N.B. ONLY TUNE WITH TRAINING QUERIES.
-
Tune the parameters in your BM25, Query Expansion, and Query Reduction implementations. Conduct a parameter search over at least 5 carefully selected values for each method, and 5 pairs when the method involves two parameters.
-
For Rank fusion methods, focus on fusing the highest-performing tuned run from each of the BM25, Query Expansion, and Query Reduction implementations.
Required Evaluations to Perform
In this project, we provide three datasets with sampled queries the original BEIR version. You first task is to check these datasets and get familiar with the size, content, format, and consider how to process them. Pay particular attention to the differences between these datasets and the MSMARCO collection we used in the pracs.
In Part 1 of the project, you are required to perform the following evaluations:
-
Run your BM25 with k=1.2, b=0.75 with the test_queries of SciFact, NFCorpus, and TREC-COVID as the baselines.
-
Tune the parameters for BM25, Query Expansion, and Query Reduction with the train_queries of SciFact and NFCorpus. Refer to the Parameter Tuning section outlined above.
-
Report the results of each method from tuning in a table. Perform statistical significance analysis across the results of the methods and report them in the table (e.g., comparing Method_A with para-setting_a on dataset_1 with baseline on the same dataset).
-
Select the best parameter setting of the methods from Sci-Fact and NFCorpus seperately, and run with test_queries of TREC-COVID. Report the results in the table, follow the same requirements listed above.
-
Create a gain-loss plot comparing BM25vs. Pseudo-Relevance Feedback Query Expansion using BM25, as well as plots comparing BM25 vs. each rank fusion method on TREC-COVID. Using the baseline BM25 and the two best parameter settings from SciFact and NFCorpus for Query Expansion.
Discussions
- Comment on trends and differences observed in your results. Do the methods work well on SciFact and NFCorpus also generalise to TREC-COVID? Is there a method that consistently outperforms others on all the datasets?
- Provide insights into whether rank fusion works or not, considering factors like runs included in the fusion process and query characteristics.
- Discuss the results obtained from SciFact and NFCorpus compared to those on TREC-COVID. Do ranking methods perform well? Why or why not? Include the statistical significance analysis results comparing the baseline with the tuned methods, and each method against another (including fusion, query expansion and reduction).
Evaluation Measures
Evaluate the retrieval methods using the following measures:
- nDCG at 10 (ndcg_cut_10): Use this as the primary measure for tuning
All gain-loss plots should be produced with respect to nDCG at 10.
For statistical significance analysis, use the paired t-test and distinguish between p < 0.05 and p < 0.01.
How to submit
Submit one .zip file containing:
- this notebook in .ipynb format
- this notebook saved as a .pdf by navigating to the menu and selecting File -> Save and Export Notebook As -> HTML, then open and Print it with your browser.
N.B.
-
Ensure that the code is executable. Include all your discussions and analysis within this notebook, not as a separate file.
-
Don't add any runs, indexes in the zip file! (If you do this, the file will be too big and you will encounter errors in submitting.)
-
Submit the file via the link on the INFS7410 course site on BlackBoard by 30 August 2024, 16:00 Eastern Australia Standard Time, unless you have received an extension according to UQ policy, which must be requested before the assignment due date.
Check datasets and perform indexing
Note: Try out different indexes to find the most effective setup for your needs.
First, have a look at the datasets. Then, try to use the indexing command from pracs. Remember to add -storeDocvectors.
You may want to store your indexes under ./indexes/.
You can check the pyserini guidance.
Run the following cell to load and cache some useful packages and statistics that you will use later.
Read and run the following cell to define the search function.
NOTE: This search function differs from the one used in the week3 prac. When implementing methods, ensure that you use this search function, which involves iterating over posting lists, rather than the week3 function, which is solely for re-ranking BM25 results.
After this line, feel free to edit this notebook whatever you like. You can use the following cells as a template to guide you in completing this project.
Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., BM25
When you have described and provided implementations for each method, include a table with statistical analysis here.
For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html
Then you can edit this cell to describe the gain-loss plots you will create below.
Then you can edit this cell to provide some analysis about if there is a method that consistently outperform the others on all the datasets.
Then you can edit this cell to provide insights of whether rank fusion works or not.