This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' published at ECIR'22.

Related tags

Deep Learningparm
Overview

Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval

This repository contains the code for the paper PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval and is partly based on the DPR Github repository. PARM is a Paragraph Aggregation Retrieval Model for dense document-to-document retrieval tasks, which liberates dense passage retrieval models from their limited input lenght and does retrieval on the paragraph-level.

We focus on the task of legal case retrieval and train and evaluate our models on the COLIEE 2021 data and evaluate our models on the CaseLaw collection.

The dense retrieval models are trained on the COLIEE data and can be found here. For training the dense retrieval model we utilize the DPR Github repository.

PARM Workflow

If you use our models or code, please cite our work:

@inproceedings{althammer2022parm,
      title={Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval}, 
      author={Althammer, Sophia and Hofstätter, Sebastian and Sertkan, Mete and Verberne, Suzan and Hanbury, Allan},
      year={2022},
      booktitle={Advances in Information Retrieval, 44rd European Conference on IR Research, ECIR 2022},
}

Training the dense retrieval model

The dense retrieval models need to be trained, either on the paragraph-level data of COLIEE Task2 or additionally on the document-level data of COLIEE Task1

  • ./DPR/train_dense_encoder.py: trains the dense bi-encoder (Step1)
python -m torch.distributed.launch --nproc_per_node=2 train_dense_encoder.py 
--max_grad_norm 2.0 
--encoder_model_type hf_bert 
--checkpoint_file_name --insert path to pretrained encoder checkpoint here if available-- 
--model_file  --insert path to pretrained chechpoint here if available-- 
--seed 12345 
--sequence_length 256 
--warmup_steps 1237 
--batch_size 22 
--do_lower_case 
--train_file --path to json train file-- 
--dev_file --path to json val file-- 
--output_dir --path to output directory--
--learning_rate 1e-05
--num_train_epochs 70
--dev_batch_size 22
--val_av_rank_start_epoch 60
--eval_per_epoch 1
--global_loss_buf_sz 250000

Generate dense embeddings index with trained DPR model

  • ./DPR/generate_dense_embeddings.py: encodes the corpus in the dense index (Step2)
python generate_dense_embeddings.py
--model_file --insert path to pretrained checkpoint here from Step1--
--pretrained_file  --insert path to pretrained chechpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--out_file --insert path to output index--
--batch_size 750

Search in the dense index

  • ./DPR/dense_retriever.py: searches in the dense index the top n-docs (Step3)
python dense_retriever.py 
--model_file --insert path to pretrained checkpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--qa_file --insert path to csv file with the queries--
--encoded_ctx_file --path to the dense index (.pkl format) from Step2--
--out_file --path to .json output file for search results--
--n-docs 1000

Poolout dense vectors for aggregation step

First you need to get the dense embeddings for the query paragraphs:

  • ./DPR/get_question_tensors.py: encodes the query paragraphs with the dense encoder checkpoint and stores the embeddings in the output file (Step4)
python get_question_tensors.py
--model_file --insert path to pretrained checkpoint here from Step1--
--qa_file --insert path to csv file with the queries--
--out_file --path to output file for output index--

Once you have the dense embeddings of the paragraphs in the index and of the questions, you do the vector-based aggregation step in PARM with VRRF (alternatively with Min, Max, Avg, Sum, VScores, VRanks) and evaluate the aggregated results

  • ./representation_aggregation.py: aggregates the run, stores and evaluates the aggregated run (Step5)
python representation_aggregation.py
--encoded_ctx_file --path to the encoded index (.pkl format) from Step2--
--encoded_qa_file  --path to the encoded queries (.pkl format) from Step4--
--output_top1000s --path to the top-n file (.json format) from Step3--
--label_file  --path to the label file (.json format)--
--aggregation_mode --choose from vrrf/vscores/vranks/sum/max/min/avg
--candidate_mode p_from_retrieved_list
--output_dir --path to output directory--
--output_file_name  --output file name--

Preprocessing

Preprocess COLIEE Task 1 data for dense retrieval

  • ./preprocessing/preprocess_coliee_2021_task1.py: preprocess the COLIEE Task 1 dataset by removing non-English text, removing non-informative summaries, removing tabs etc

Preprocess CaseLaw collection

  • ./preprocessing/caselaw_stat_corpus.py: preprocess the CaseLaw collection

Preprocess data for training the dense retrieval model

In order to train the dense retrieval models, the data needs to be preprocessed. For training and retrieval we split up the documents into their paragraphs.

  • ./preprocessing/preprocess_finetune_data_dpr_task1.py: preprocess the COLIEE Task 1 document-level labels for training the DPR model

  • ./preprocessing/preprocess_finetune_data_dpr.py: preprocess the COLIEE Task 2 paragraph-level labels for training the DPR model

Owner
Sophia Althammer
PhD student @TuVienna Interested in IR and NLP https://sophiaalthammer.github.io/ Currently working on the dossier project to https://dossier-project.eu/
Sophia Althammer
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.13

Keon Lee 140 Dec 21, 2022
Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions"

ModelNet-C Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions". For the latest updates, see: sites.google.com

Jiawei Ren 45 Dec 28, 2022
Jetson Nano-based smart camera system that measures crowd face mask usage in real-time.

MaskCam MaskCam is a prototype reference design for a Jetson Nano-based smart camera system that measures crowd face mask usage in real-time, with all

BDTI 212 Dec 29, 2022
kapre: Keras Audio Preprocessors

Kapre Keras Audio Preprocessors - compute STFT, ISTFT, Melspectrogram, and others on GPU real-time. Tested on Python 3.6 and 3.7 Why Kapre? vs. Pre-co

Keunwoo Choi 867 Dec 29, 2022
Part-Aware Data Augmentation for 3D Object Detection in Point Cloud

Part-Aware Data Augmentation for 3D Object Detection in Point Cloud This repository contains a reference implementation of our Part-Aware Data Augment

Jaeseok Choi 62 Jan 03, 2023
This is the second place solution for : UmojaHack Africa 2022: African Snake Antivenom Binding Challenge

UmojaHack-Africa-2022-African-Snake-Antivenom-Binding-Challenge This is the second place solution for : UmojaHack Africa 2022: African Snake Antivenom

Mami Mokhtar 10 Dec 03, 2022
This repository is based on Ultralytics/yolov5, with adjustments to enable rotate prediction boxes.

Rotate-Yolov5 This repository is based on Ultralytics/yolov5, with adjustments to enable rotate prediction boxes. Section I. Description The codes are

xinzelee 90 Dec 13, 2022
A different spin on dataclasses.

dataklasses Dataklasses is a library that allows you to quickly define data classes using Python type hints. Here's an example of how you use it: from

David Beazley 752 Nov 18, 2022
This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML)

package tests docs license stats support This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML

National Center for Cognitive Research of ITMO University 482 Dec 26, 2022
Text-to-Music Retrieval using Pre-defined/Data-driven Emotion Embeddings

Text2Music Emotion Embedding Text-to-Music Retrieval using Pre-defined/Data-driven Emotion Embeddings Reference Emotion Embedding Spaces for Matching

Minz Won 50 Dec 05, 2022
Tensors and neural networks in Haskell

Hasktorch Hasktorch is a library for tensors and neural networks in Haskell. It is an independent open source community project which leverages the co

hasktorch 920 Jan 04, 2023
TensorFlow 2 implementation of the Yahoo Open-NSFW model

TensorFlow 2 implementation of the Yahoo Open-NSFW model

Bosco Yung 101 Jan 01, 2023
Official code of Team Yao at Multi-Modal-Fact-Verification-2022

Official code of Team Yao at Multi-Modal-Fact-Verification-2022 A Multi-Modal Fact Verification dataset released as part of the De-Factify workshop in

Wei-Yao Wang 11 Nov 15, 2022
A tensorflow model that predicts if the image is of a cat or of a dog.

Quick intro Hello and thank you for your interest in my project! This is the backend part of a two-repo application. The other part can be found here

Tudor Matei 0 Mar 08, 2022
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022 [Project page | Video] Getting sta

51 Nov 29, 2022
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 92 Jan 04, 2023
A multilingual version of MS MARCO passage ranking dataset

mMARCO A multilingual version of MS MARCO passage ranking dataset This repository presents a neural machine translation-based method for translating t

75 Dec 27, 2022
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
Run object detection model on the Raspberry Pi

Using TensorFlow Lite with Python is great for embedded devices based on Linux, such as Raspberry Pi.

Dimitri Yanovsky 6 Oct 08, 2022
Self-training for Few-shot Transfer Across Extreme Task Differences

Self-training for Few-shot Transfer Across Extreme Task Differences (STARTUP) Introduction This repo contains the official implementation of the follo

Cheng Perng Phoo 33 Oct 31, 2022