Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Overview

MetaAdaptRank

This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision.

CONTACT

For any question, please contact Si Sun by email [email protected] (respond to emails more quickly), we will try our best to solve :)

QUICKSTART

python 3.7
Pytorch 1.5.0

0/ Data Preparation

First download and prepare the following data into the data folder:

1 Contrastive Supervision Synthesis

1.1 Source-domain NLG training

  • We train two query generators (QG & ContrastQG) with the MS MARCO dataset using train_nlg.sh in the run_shells folder:

    bash prepro_nlg_dataset.sh
    
  • Optional arguments:

    --generator_mode            choices=['qg', 'contrastqg']
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --train_file                The path to the source-domain nlg training dataset
    --save_dir                  The path to save the checkpoints data; default: ../results
    

1.2 Target-domain NLG inference

  • The whole nlg inference pipline contains five steps:

    • 1.2.1/ Data preprocess
    • 1.2.2/ Seed query generation
    • 1.2.3/ BM25 subset retrieval
    • 1.2.4/ Contrastive doc pairs sampling
    • 1.2.5/ Contrastive query generation
  • 1.2.1/ Data preprocess. convert target-domain documents into the nlg format using prepro_nlg_dataset.sh in the preprocess folder:

    bash prepro_nlg_dataset.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --input_path            The path to the target dataset
    --output_path           The path to save the preprocess data; default: ../data/prepro_target_data
    
  • 1.2.2/ Seed query generation. utilize the trained QG model to generate seed queries for each target documents using nlg_inference.sh in the run_shells folder:

    bash nlg_inference.sh
    
  • Optional arguments:

    --generator_mode            choices='qg'
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --target_dataset_name       choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_load_dir        The path to the pretrained QG checkpoints.
    
  • 1.2.3/ BM25 subset retrieval. utilize BM25 to retrieve document subset according to the seed queries using do_subset_retrieve.sh in the bm25_retriever folder:

    bash do_subset_retrieve.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_folder      choices=['t5-small', 't5-base']
    
  • 1.2.4/ Contrastive doc pairs sampling. pairwise sample contrastive doc pairs from the BM25 retrieved subset using sample_contrast_pairs.sh in the preprocess folder:

    bash sample_contrast_pairs.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_folder      choices=['t5-small', 't5-base']
    
  • 1.2.5/ Contrastive query generation. utilize the trained ContrastQG model to generate new queries based on contrastive document pairs using nlg_inference.sh in the run_shells folder:

    bash nlg_inference.sh
    
  • Optional arguments:

    --generator_mode            choices='contrastqg'
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --target_dataset_name       choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_load_dir        The path to the pretrained ContrastQG checkpoints.
    

2 Meta Learning to Reweight

2.1 Data Preprocess

  • Prepare the contrastive synthetic supervision data (CTSyncSup) into the data/synthetic_data folder.

    • CTSyncSup_clueweb09
    • CTSyncSup_robust04
    • CTSyncSup_trec-covid

    >> example data format

  • Preprocess the target-domain datasets into the 5-fold cross-validation format using run_cv_preprocess.sh in the preprocess folder:

    bash run_cv_preprocess.sh
    
  • Optional arguments:

    --dataset_class         choices=['clueweb09', 'robust04', 'trec-covid']
    --input_path            The path to the target dataset
    --output_path           The path to save the preprocess data; default: ../data/prepro_target_data
    

2.2 Train and Test Models

  • The whole process of training and testing MetaAdaptRank contains three steps:

    • 2.2.1/ Meta-pretraining. The model is trained on synthetic weak supervision data, where the synthetic data are reweighted using meta-learning. The training fold of the target dataset is considered as target data that guides meta-reweighting.

    • 2.2.2/ Fine-tuning. The meta-pretrained model is continuously fine-tuned on the training folds of the target dataset.

    • 2.2.3/ Ensemble and Coor-Ascent. Coordinate Ascent is used to combine the last representation layers of all fine-tuned models, as LeToR features, with the retrieval scores from the base retriever.

  • 2.2.1/ Meta-pretraining using train_meta_bert.sh in the run_shells folder:

    bash train_meta_bert.sh
    

    Optional arguments for meta-pretraining:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --train_dir             The path to the synthetic weak supervision data
    --target_dir            The path to the target dataset
    --save_dir              The path to save the output files and checkpoints; default: ../results
    

    Complete optional arguments can be seen in config.py in the scripts folder.

  • 2.2.2/ Fine-tuning using train_metafine_bert.sh in the run_shells folder:

    bash train_metafine_bert.sh
    

    Optional arguments for fine-tuning:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --train_dir             The path to the target dataset
    --checkpoint_folder     The path to the checkpoint of the meta-pretrained model
    --save_dir              The path to save output files and checkpoint; default: ../results
    
  • 2.2.3/ Testing the fine-tuned model to collect LeToR features through test.sh in the run_shells folder:

    bash test.sh
    

    Optional arguments for testing:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --target_dir            The path to the target evaluation dataset
    --checkpoint_folder     The path to the checkpoint of the fine-tuned model
    --save_dir              The path to save output files and the **features** file; default: ../results
    
  • 2.2.4/ Ensemble. Train and test five models for each fold of the target dataset (5-fold cross-validation), and then ensemble and convert their output features to coor-ascent format using combine_features.sh in the ensemble folder:

    bash combine_features.sh
    

    Optional arguments for ensemble:

    --qrel_path             The path to the qrels of the target dataset
    --result_fold_1         The path to the testing result folder of the first fold model
    --result_fold_2         The path to the testing result folder of the second fold model
    --result_fold_3         The path to the testing result folder of the third fold model
    --result_fold_4         The path to the testing result folder of the fourth fold model
    --result_fold_5         The path to the testing result folder of the fifth fold model
    --save_dir              The path to save the ensembled `features.txt` file; default: ../combined_features
    
  • 2.2.5/ Coor-Ascent. Run coordinate ascent using run_ranklib.sh in the ensemble folder:

    bash run_ranklib.sh
    

    Optional arguments for coor-ascent:

    --qrel_path             The path to the qrels of the target dataset
    --ranklib_path          The path to the ensembled features.
    

    The final evaluation results will be output in the ranklib_path.

Results

All TREC files listed in this paper can be found in Tsinghua Cloud.

Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Artificial intelligence technology inferring issues and logically supporting facts from raw text

개요 비정형 텍스트를 학습하여 쟁점별 사실과 논리적 근거 추론이 가능한 인공지능 원천기술 Artificial intelligence techno

6 Dec 29, 2021
Development Kit for the SoccerNet Challenge

SoccerNetv2-DevKit Welcome to the SoccerNet-V2 Development Kit for the SoccerNet Benchmark and Challenge. This kit is meant as a help to get started w

Silvio Giancola 117 Dec 30, 2022
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation This repo is the official implementation of "MHFormer: Multi-Hypothesis Transforme

Vegetabird 281 Jan 07, 2023
Pose Detection and Machine Learning for real-time body posture analysis during exercise to provide audiovisual feedback on improvement of form.

Posture: Pose Tracking and Machine Learning for prescribing corrective suggestions to improve posture and form while exercising. This repository conta

Pratham Mehta 10 Nov 11, 2022
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021) Jiaxi Jiang, Kai Zhang, Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland 🔥

Jiaxi Jiang 282 Jan 02, 2023
Buffon’s needle: one of the oldest problems in geometric probability

Buffon-s-Needle Buffon’s needle is one of the oldest problems in geometric proba

3 Feb 18, 2022
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 01, 2023
Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers

Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers This is the repo used for human motion prediction with non-autoregress

Idiap Research Institute 26 Dec 14, 2022
This repository contains the data and code for the paper "Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors" ([email protected])

GP-VAE This repository provides datasets and code for preprocessing, training and testing models for the paper: Diverse Text Generation via Variationa

Wanyu Du 18 Dec 29, 2022
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

Fre-GAN Vocoder Fre-GAN: Adversarial Frequency-consistent Audio Synthesis Training: python train.py --config config.json Citation: @misc{kim2021frega

Rishikesh (ऋषिकेश) 93 Dec 17, 2022
Small-bets - Ergodic Experiment With Python

Ergodic Experiment Based on this video. Run this experiment with this command: p

Michael Brant 3 Jan 11, 2022
Haze Removal can remove slight to extreme cases of haze affecting an image

Haze Removal can remove slight to extreme cases of haze affecting an image. Its most typical use is for landscape photography where the haze causes low contrast and low saturation, but it can also be

Grace Ugochi Nneji 3 Feb 15, 2022
Traffic4D: Single View Reconstruction of Repetitious Activity Using Longitudinal Self-Supervision

Traffic4D: Single View Reconstruction of Repetitious Activity Using Longitudinal Self-Supervision Project | PDF | Poster Fangyu Li, N. Dinesh Reddy, X

25 Dec 21, 2022
U-Net Brain Tumor Segmentation

U-Net Brain Tumor Segmentation 🚀 :Feb 2019 the data processing implementation in this repo is not the fastest way (code need update, contribution is

Hao 448 Jan 02, 2023
Code for CVPR 2021 paper: Anchor-Free Person Search

Introduction This is the implementationn for Anchor-Free Person Search in CVPR2021 License This project is released under the Apache 2.0 license. Inst

158 Jan 04, 2023
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

CSAW-M This repository contains code for CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer. Source code for tr

Yue Liu 7 Oct 11, 2022
AI-Fitness-Tracker - AI Fitness Tracker With Python

AI-Fitness-Tracker We have build a AI based Fitness Tracker using OpenCV and Pyt

Sharvari Mangale 5 Feb 09, 2022
ComputerVision - This repository aims at realized easy network architecture

ComputerVision This repository aims at realized easy network architecture Colori

DongDong 4 Dec 14, 2022
SuperSDR: multiplatform KiwiSDR + CAT transceiver integrator

SuperSDR SuperSDR integrates a realtime spectrum waterfall and audio receive from any KiwiSDR around the world, together with a local (or remote) cont

Marco Cogoni 30 Nov 29, 2022
UniFormer - official implementation of UniFormer

UniFormer This repo is the official implementation of "Uniformer: Unified Transf

SenseTime X-Lab 573 Jan 04, 2023