OpenMatch
An Open-Source Package for Information Retrieval.
😃
What's New
-
Top Spot on TREC-COVID Challenge (May 2020, Round2)
The twin goals of the challenge are to evaluate search algorithms and systems for helping scientists, clinicians, policy makers, and others manage the existing and rapidly growing corpus of scientific literature related to COVID-19, and to discover methods that will assist with managing scientific information in future global biomedical crises.
>> Reproduce Our Submit >> About COVID-19 Dataset >> Our Paper
Overview
OpenMatch integrates excellent neural methods and technologies to provide a complete solution for deep text matching and understanding. The documentation and tutorial of OpenMatch are available at here.
1/ Document Retrieval
Document Retrieval refers to extracting a set of related documents from large-scale document-level data based on user queries.
* Sparse Retrieval
Sparse Retriever is defined as a sparse bag-of-words retrieval model.
* Dense Retrieval
Dense Retriever performs retrieval by encoding documents and queries into dense low-dimensional vectors, and selecting the document that has the highest inner product with the query
2/ Document Reranking
Document reranking aims to further match user query and documents retrieved by the previous step with the purpose of obtaining a ranked list of relevant documents.
* Neural Ranker
Neural Ranker uses neural network as ranker to reorder documents.
* Feature Ensemble
Feature Ensemble can fuse neural features learned by neural ranker with the features of non-neural methods to obtain more robust performance
3/ Domain Transfer Learning
Domain Transfer Learning can leverages external knowledge graphs or weak supervision data to guide and help ranker to overcome data scarcity.
* Knowledge Enhancemnet
Knowledge Enhancement incorporates entity semantics of external knowledge graphs to enhance neural ranker.
* Data Augmentation
Data Augmentation leverages weak supervision data to improve the ranking accuracy in certain areas that lacks large scale relevance labels.
Stage | Model | Paper |
---|---|---|
1/ Sparse Retrieval | BM25 | Best Match25 ~Tool |
1/ Dense Retrieval | ANN | Approximate nearest neighbor ~Tool |
2/ Neural Ranker | K-NRM | End-to-End Neural Ad-hoc Ranking with Kernel Pooling ~Paper |
2/ Neural Ranker | Conv-KNRM | Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search ~Paper |
2/ Neural Ranker | TK | Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking ~Paper |
2/ Neural Ranker | BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ~Paper |
2/ Feature Ensemble | Coordinate Ascent | Linear feature-based models for information retrieval. Information Retrieval ~Paper |
3/ Knowledge Enhancement | EDRM | Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval ~Paper |
3/ Data Augmentation | ReInfoSelect | Selective Weak Supervision for Neural Information Retrieval ~Paper |
Note that the BERT model is following huggingface's implementation - transformers, so other bert-like models are also available in our toolkit, e.g. electra, scibert.
Installation
* From PyPI
pip install git+https://github.com/thunlp/OpenMatch.git
* From Source
git clone https://github.com/thunlp/OpenMatch.git
cd OpenMatch
python setup.py install
* From Docker
To build an OpenMatch docker image from Dockerfile
docker build -t <image_name> .
To run your docker image just built above as a container
docker run --gpus all --name=<container_name> -it -v /:/all/ --rm <image_name>:<TAG>
Quick Start
* Detailed examples are available here.
import torch
import OpenMatch as om
query = "Classification treatment COVID-19"
doc = "By retrospectively tracking the dynamic changes of LYM% in death cases and cured cases, this study suggests that lymphocyte count is an effective and reliable indicator for disease classification and prognosis in COVID-19 patients."
* For bert-like models:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
input_ids = tokenizer.encode(query, doc)
model = om.models.Bert("allenai/scibert_scivocab_uncased")
ranking_score, ranking_features = model(torch.tensor(input_ids).unsqueeze(0))
* For other models:
tokenizer = om.data.tokenizers.WordTokenizer(pretrained="./data/glove.6B.300d.txt")
query_ids, query_masks = tokenizer.process(query, max_len=16)
doc_ids, doc_masks = tokenizer.process(doc, max_len=128)
model = om.models.KNRM(vocab_size=tokenizer.get_vocab_size(),
embed_dim=tokenizer.get_embed_dim(),
embed_matrix=tokenizer.get_embed_matrix())
ranking_score, ranking_features = model(torch.tensor(query_ids).unsqueeze(0),
torch.tensor(query_masks).unsqueeze(0),
torch.tensor(doc_ids).unsqueeze(0),
torch.tensor(doc_masks).unsqueeze(0))
* The GloVe can be downloaded using:
wget http://nlp.stanford.edu/data/glove.6B.zip -P ./data
unzip ./data/glove.6B.zip -d ./data
* Evaluation
metric = om.Metric()
res = metric.get_metric(qrels, ranking_list, 'ndcg_cut_20')
res = metric.get_mrr(qrels, ranking_list, 'mrr_cut_10')
Experiments
Retriever | Reranker | Coor-Ascent | ClueWeb09 | Robust04 | ClueWeb12 |
---|---|---|---|---|---|
SDM | KNRM | - | 0.1880 | 0.3016 | 0.0968 |
SDM | Conv-KNRM | - | 0.1894 | 0.2907 | 0.0896 |
SDM | EDRM | - | 0.2015 | 0.2993 | 0.0937 |
SDM | TK | - | 0.2306 | 0.2822 | 0.0966 |
SDM | BERT Base | - | 0.2701 | 0.4168 | 0.1183 |
SDM | ELECTRA Base | - | 0.2861 | 0.4668 | 0.1078 |
Retriever | Reranker | Coor-Ascent | dev | eval |
---|---|---|---|---|
BM25 | BERT Base | - | 0.349 | 0.345 |
BM25 | ELECTRA Base | - | 0.352 | 0.344 |
BM25 | RoBERTa Large | - | 0.386 | 0.375 |
BM25 | ELECTRA Large | - | 0.388 | 0.376 |
Retriever | Reranker | Coor-Ascent | dev | eval |
---|---|---|---|---|
ANCE FirstP | - | - | 0.373 | 0.334 |
ANCE MaxP | - | - | 0.383 | 0.342 |
ANCE FirstP+BM25 | BERT Base FirstP | + | 0.431 | 0.380 |
ANCE MaxP | BERT Base MaxP | + | 0.432 | 0.391 |
Methods | ClueWeb09-B | Robust04 | TREC-COVID | |||
[email protected] | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | |
BM25 (Anserini) | 0.2773 | 0.1426 | 0.4129 | 0.1117 | 0.6979 | 0.7670 |
RankSVM (Dai et al.) | 0.289 | n.a. | 0.420 | n.a. | n.a. | n.a. |
RankSVM (OpenMatch) | 0.2825 | 0.1476 | 0.4309 | 0.1173 | 0.6995 | 0.7570 |
Coor-Ascent (Dai et al.) | 0.295 | n.a. | 0.427 | n.a. | n.a. | n.a. |
Coor-Ascent (OpenMatch) | 0.2969 | 0.1581 | 0.4340 | 0.1171 | 0.7041 | 0.7770 |
Contribution
Thanks to all the people who contributed to OpenMatch!
Kaitao Zhang, Si Sun, Zhenghao Liu, Aowei Lu
Project Organizers
- Zhiyuan Liu
- Tsinghua University
- Homepage
- Chenyan Xiong
- Microsoft Research AI
- Homepage
- Maosong Sun
- Tsinghua University
- Homepage
Citation
@inproceedings{openmatch,
author = {Liu, Zhenghao and Zhang, Kaitao and Xiong, Chenyan and Liu, Zhiyuan and Sun, Maosong},
title = {OpenMatch: An Open Source Library for Neu-IR Research},
booktitle = {Proceedings of SIGIR},
year = {2021},
url = {https://doi.org/10.1145/3404835.3462789},
pages = {2531–2535}
}