Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Last update: Oct 26, 2021

Related tags

Overview

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Official Code Repository for the paper "Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation" ([email protected] 2021): https://aclanthology.org/2021.sdp-1.2/

Abstract

One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pre-trained language model, which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.

Dependencies

Python 3.7.9
Pytorch 1.7.0
Transformers 4.3

Run

1. Installing anserini

We use the open-source information retrieval toolkit anserini.

# install maven
sudo apt-get install maven

# cloning / installing anserini
git clone https://github.com/castorini/anserini.git --recurse-submodules
cd anserini/
# changing jacoco from 0.8.2 to 0.8.3 in pom.xml to build correctly
mvn clean package appassembler:assemble

# compile evaluation tools and other scripts
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

2. Data Preprocessing

python 0_0_extract_text.py
python 0_1_convert_qrels_to_binary.py
python 0_2_convert_qrels_to_ndcg_scale.py

3. Data Tokenization

python 1_convert_text_to_tokenized.py

4. Abstractive Generation with Stochastic Text Generation

python 2_abstract_summary_multi.py

We provide the abstractly & stochastically generated output file in this repository (test_pegasus_xsum_4mc.tar.gz).

5. Convert to json format

We refer to the repository of https://github.com/nyu-dl/dl4ir-doc2query.

python 3_concat_collection_summary_to_json.py

6. Indexing, Retrieval, Evaluation

We refer to the repository of https://github.com/boudinfl/ir-using-kg#data.

sh 4_create_indexes.sh
sh 5_retrieve.sh
sh 6_evaluate.sh

Cite

@inproceedings{jeong-etal-2021-unsupervised,
    title = "Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation",
    author = "Jeong, Soyeong  and
      Baek, Jinheon  and
      Park, ChaeHun  and
      Park, Jong",
    booktitle = "Proceedings of the Second Workshop on Scholarly Document Processing",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.sdp-1.2",
    doi = "10.18653/v1/2021.sdp-1.2",
    pages = "7--17"
}

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Related tags

Overview

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Abstract

Dependencies

Run

1. Installing anserini

2. Data Preprocessing

3. Data Tokenization

4. Abstractive Generation with Stochastic Text Generation

5. Convert to json format

6. Indexing, Retrieval, Evaluation

Cite

Owner

NLP*CL Laboratory

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

A script that automatically creates a branch name using google translation api and jira api

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

Auto-researching tool generating word documents.

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Unofficial PyTorch implementation of Google AI's VoiceFilter system

MMDA - multimodal document analysis

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

Kurumi ChatBot

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Crie tokens de autenticação íntegros e seguros com UToken.

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

Protein Language Model

Natural Language Processing Specialization

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.