INFS7410 Information Retrieval and Web Search


INFS7410 Project - Part 1

version 1.4

Assignment Preamble

Due Date

30 August 2024, 16:00 Eastern Australia Standard Time

Weight

This assignment (Project - Part 1) constitutes 20% of the overall mark for INFS7410.
(Part 1 + Part 2 = 40% of the total course grade)

Completion Requirements
  • You should complete it individually.
  • You can check the detailed marking sheet provided alongside this notebook.
    • see INFS7410-project-part-1-marking-sheet.pdf
Prerequisites Checker

You should have already tackled the activities from pracs week1-5, including:

  • Indexing corpus (prac-week1)
  • Implementing retrieval functions (prac-week3)
  • Implementing rank fusion methods (prac-week4)
  • Implementing query expansion and reduction from pseudo relevance feedback(prac-week5)
  • Performing evaluation, visualisation, and statistical significance test (prac-week2)
Tips
  • Start early to allow ample time for completion.
  • Proceed step-by-step through the assignment tasks.
  • Most of the assignment relies on knowledge and code from your computer practicals. However, be prepared for challenges that may require additional time and effort to solve.

Aims

Project Aim:

The aim of the entire project is to implement a number of representiative information retrieval methods, evaluate and compare them in the context of real use-cases.

Part 1 Aim

The aim of Part 1 is to:

  • Familiarise yourself with the basic retrieval workflow.
  • Set up the infrastructure for indexing the corpus and evaluating with queries.
  • Implement classical information retrieval methods coverd in the pracs and lectures.
  • Tune your retrieval methods to improve their effectiveness.

The Information Retrieval Tasks: Fact Checking and Bio-Medical Retrieval

In this project, we will consider two tasks in IR:

  • Fact Checking verifies a claim against a large collection of evidence. Here, we focus on the scientific domain, which ranges from basic science to clinical medicine. We verify scientific claims by retrieving evidence from a corpus of research literature containing scientific paper abstracts.
  • Bio-Medical Retrieval involves searching for relevant scientific documents, such as research papers or blogs, in response to a specific query within the biomedical domain.

For these tasks, we will use selected datasets from the BEIR benchmark, specifically SciFact (Fact checking), NFCorpus (Biomedical),and TREC-COVID (Biomedical).

What we give you:

Files from Previous Practicals

You can freely re-use all the materials from prac-week1 to prac-week5, e.g. your implemenation/codes.

Files for This Project

  • infs7410_project_collections.zip (74.1 MB)
    • Click here to download and unzip.
  • INFS7410-project-part-1.ipynb (This notebook)
  • INFS7410-project-part-1-marking-sheet.pdf

We provide the following collections for the project:

  • NF Corpus: (training + test)
    • ./nfcorpus_corpus.jsonl
    • ./nfcorpus/nfcorpus_train_queries.tsv
    • ./nfcorpus/nfcorpus_train_queries.tsv
    • ./nfcorpus/nfcorpus_train_qrels.txt
    • ./nfcorpus/nfcorpus_test_qrels.txt
  • SciFact: (training + test)
    • ./scifact_corpus.jsonl
    • ./scifact/scifact_train_queries.tsv
    • ./scifact/scifact_train_queries.tsv
    • ./scifact/scifact_train_qrels.txt
    • ./scifact/scifact_train_qrels.txt
  • TREC-COVID: (test only)
    • ./trec-covid_corpus.jsonl
    • ./trec-covid/trec-covid_test_qrels.txt
    • ./trec-covid/trec-covid_test_queries.tsv

Generally, each collection contains:

  • corpus.jsonl: Containing the texts to be retrieved for each query.
  • queries.tsv: Listing queries used for retrieval, each line containing a topic id and the query text.
  • qrels.txt: Containing relevance judgements for your runs in TREC format <qid, Q0, doc_id, rank, score, tag>.

Additionally,

This Jupyter notebook is the workspace where you will implement your solutions and document your findings.

Put this notebook and the provided files under the same directory.

Overview of the IR Workflow

To conduct an experiment for the IR tasks in this project, we generally follow three key stages: Indexing -> Retrieval -> Evaluation. Each stage involves a corresponding portion of data from the collection. A collection typically comprises a corpus, queries, and a qrel. These are illustrated below:

What you need to do:

You are expected to deliver the following:

  • Correct implementations and evaluations of the methods required by this project specifications.
  • Write-up about the retrieval methods used, including:
    • formula that represent the method you implemented.
    • code that corresponds to the formula.
    • evaluation settings with explanation followed.
    • discussion of the findings.

Both the implementations and write-ups must be documented within this Jupyter notebook.

Required Methods to Implement

  • Indexing
    • Pyserini index command: You first need to index the new datasets introduced in this project. Each dataset should be made into a separate index. Check how we use the command to build index for given collections from week1-week3.
  • Ranking functions
    • BM25: Implemented by yourself, not using the one from Pyserini. Check week3.
  • Query reformulation methods
    • Pseudo-Relevance Feedback using BM25 for Query Expansion: Implemented by yourself. Check week5.
    • IDF-r Query Reduction: Implemented by yourself. Check week5.
  • Rank fusion methods
    • Borda: Implemented by yourself. Check week4.
    • CombSUM: Implemented by yourself. Check week4.
    • CombMNZ: Implemented by yourself. Check week4.

Parameter Tuning:

N.B. ONLY TUNE WITH TRAINING QUERIES.

  • Tune the parameters in your BM25, Query Expansion, and Query Reduction implementations. Conduct a parameter search over at least 5 carefully selected values for each method, and 5 pairs when the method involves two parameters.

  • For Rank fusion methods, focus on fusing the highest-performing tuned run from each of the BM25, Query Expansion, and Query Reduction implementations.

Required Evaluations to Perform

In this project, we provide three datasets with sampled queries the original BEIR version. You first task is to check these datasets and get familiar with the size, content, format, and consider how to process them. Pay particular attention to the differences between these datasets and the MSMARCO collection we used in the pracs.

In Part 1 of the project, you are required to perform the following evaluations:

  1. Run your BM25 with k=1.2, b=0.75 with the test_queries of SciFact, NFCorpus, and TREC-COVID as the baselines.

  2. Tune the parameters for BM25, Query Expansion, and Query Reduction with the train_queries of SciFact and NFCorpus. Refer to the Parameter Tuning section outlined above.

  3. Report the results of each method from tuning in a table. Perform statistical significance analysis across the results of the methods and report them in the table (e.g., comparing Method_A with para-setting_a on dataset_1 with baseline on the same dataset).

  4. Select the best parameter setting of the methods from Sci-Fact and NFCorpus seperately, and run with test_queries of TREC-COVID. Report the results in the table, follow the same requirements listed above.

  5. Create a gain-loss plot comparing BM25vs. Pseudo-Relevance Feedback Query Expansion using BM25, as well as plots comparing BM25 vs. each rank fusion method on TREC-COVID. Using the baseline BM25 and the two best parameter settings from SciFact and NFCorpus for Query Expansion.

Discussions

  1. Comment on trends and differences observed in your results. Do the methods work well on SciFact and NFCorpus also generalise to TREC-COVID? Is there a method that consistently outperforms others on all the datasets?
  2. Provide insights into whether rank fusion works or not, considering factors like runs included in the fusion process and query characteristics.
  3. Discuss the results obtained from SciFact and NFCorpus compared to those on TREC-COVID. Do ranking methods perform well? Why or why not? Include the statistical significance analysis results comparing the baseline with the tuned methods, and each method against another (including fusion, query expansion and reduction).

Evaluation Measures

Evaluate the retrieval methods using the following measures:

  • nDCG at 10 (ndcg_cut_10): Use this as the primary measure for tuning

All gain-loss plots should be produced with respect to nDCG at 10.

For statistical significance analysis, use the paired t-test and distinguish between p < 0.05 and p < 0.01.

How to submit

Submit one .zip file containing:

  1. this notebook in .ipynb format
  2. this notebook saved as a .pdf by navigating to the menu and selecting File -> Save and Export Notebook As -> HTML, then open and Print it with your browser.

N.B.

  • Ensure that the code is executable. Include all your discussions and analysis within this notebook, not as a separate file.

  • Don't add any runs, indexes in the zip file! (If you do this, the file will be too big and you will encounter errors in submitting.)

  • Submit the file via the link on the INFS7410 course site on BlackBoard by 30 August 2024, 16:00 Eastern Australia Standard Time, unless you have received an extension according to UQ policy, which must be requested before the assignment due date.

Check datasets and perform indexing


Note: Try out different indexes to find the most effective setup for your needs.

First, have a look at the datasets. Then, try to use the indexing command from pracs. Remember to add -storeDocvectors.

You may want to store your indexes under ./indexes/.

You can check the pyserini guidance.

In [ ]:
# Run your commands for indexing the corpus of each datasets

Initialise packages and functions

You also need to decide which stemming algorithm and whether keep stopwords or not in the following cell.

In [ ]:
stemming=None# None or 'porter' or othersstopwords=False# False or Trueindex='indexes/lucene-index-[name_of_dataset]-[tag]/'# Load the index of the dataset you selected

Run the following cell to load and cache some useful packages and statistics that you will use later.

In [ ]:
frompyserini.searchimportSimpleSearcherfrompyserini.analysisimportAnalyzer,get_lucene_analyzerfrompyserini.indeximportIndexReaderfromtqdmimporttqdmlucene_analyzer=get_lucene_analyzer(stemming=stemming,stopwords=stopwords)analyzer=Analyzer(lucene_analyzer)searcher=SimpleSearcher(index)searcher.set_analyzer(lucene_analyzer)index_reader=IndexReader(index)# Get the total number of documents in the collectiontotal_doc_num=index_reader.stats()['documents']# Get all document IDs of the collectiondoc_ids=[index_reader.convert_internal_docid_to_collection_docid(i)foriintqdm(range(total_doc_num))]# Cache document vectors: dict{'doc_id': {'term_1': term_freq, ...}}doc_vec_dict={}fordocidintqdm(doc_ids,desc="Caching doc vec:"):doc_vec_dict[docid]=index_reader.get_document_vector(docid)# Cache document lengths for each documentdoc_len_dict={}fordocidintqdm(doc_ids,desc="Caching doc len:"):doc_len_dict[docid]=sum(doc_vec_dict[docid].values())

Read and run the following cell to define the search function.

NOTE: This search function differs from the one used in the week3 prac. When implementing methods, ensure that you use this search function, which involves iterating over posting lists, rather than the week3 function, which is solely for re-ranking BM25 results.

In [ ]:
defsearch(query:str,k:int=1000,scorer=None):"""Inputs:query (str): the query string to perform the search.k (int): the number of documents to be returned.scorer: your implemented scoring function, such as bm25.Output:results (list): the sorted result list, a list of tuples.The first element in the tuples is the docid,the second is the doc score."""assertscorerisnotNoneprint("-----------------------------------------------------")print("Current query:",query)# Get the analyzed term listq_terms=analyzer.analyze(query)doc_socres={}forterminq_terms:# Get the posting list for the current termpostings_list=index_reader.get_postings_list(term,analyzer=None)# Get the document frequency of the current termdf=index_reader.get_term_counts(term,analyzer=None)[0]ifpostings_listisnotNone:# Iterate the posting listforpostingintqdm(postings_list,desc=f"Iterate posting for term '{term}'"):internal_id=posting.docid# Convert pyserini internal docid to the actual dociddocid=index_reader.convert_internal_docid_to_collection_docid(internal_id)tf=posting.tf# Use the cached dictionary.doc_len=doc_len_dict[docid]# Call the scoring function (you will implement these below).score=scorer(tf,df,doc_len)ifdocidindoc_socres.keys():doc_socres[docid]+=scoreelse:doc_socres[docid]=score# Sort the results by the score.results=[(docid,doc_socre)fordocid,doc_socreindoc_socres.items()]results=sorted(results,key=lambdax:x[1],reverse=True)[:k]returnresultsprint("-----------------------------------------------------")

After this line, feel free to edit this notebook whatever you like. You can use the following cells as a template to guide you in completing this project.


In [ ]:
# Import all your python libraries and put setup code here.

Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., BM25

In [ ]:
# Put your implementation of BM25 here, including parameter tuning.

When you have described and provided implementations for each method, include a table with statistical analysis here.

For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html

Then you can edit this cell to describe the gain-loss plots you will create below.

In [ ]:
# Put your implementations for the gain-loss plots here.

Then you can edit this cell to provide some analysis about if there is a method that consistently outperform the others on all the datasets.

Then you can edit this cell to provide insights of whether rank fusion works or not.

In [ ]:

发表评论

电子邮件地址不会被公开。 必填项已用*标注