[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

Overview

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links

PRs Welcome arXiv PWC PWC

This repo provides the model, code & data of our paper: LinkBERT: Pretraining Language Models with Document Links (ACL 2022). [PDF] [HuggingFace Models]

Overview

LinkBERT is a new pretrained language model (improvement of BERT) that captures document links such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides using a single document as in BERT.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for knowledge-intensive tasks (e.g. question answering) and cross-document tasks (e.g. reading comprehension, document retrieval).

1. Pretrained Models

We release the pretrained LinkBERT (-base and -large sizes) for both the general domain and biomedical domain. These models have the same format as the HuggingFace BERT models, and you can easily switch them with LinkBERT models.

Model Size Domain Pretraining Corpus Download Link ( πŸ€— HuggingFace)
LinkBERT-base 110M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-base
LinkBERT-large 340M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-large
BioLinkBERT-base 110M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-base
BioLinkBERT-large 340M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-large

To use these models in πŸ€— Transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

To fine-tune the models, see Section 2 & 3 below. When fine-tuned on downstream tasks, LinkBERT achieves the following results.
General benchmarks (MRQA and GLUE):

HotpotQA TriviaQA SearchQA NaturalQ NewsQA SQuAD GLUE
F1 F1 F1 F1 F1 F1 Avg score
BERT-base 76.0 70.3 74.2 76.5 65.7 88.7 79.2
LinkBERT-base 78.2 73.9 76.8 78.3 69.3 90.1 79.6
BERT-large 78.1 73.7 78.3 79.0 70.9 91.1 80.7
LinkBERT-large 80.8 78.2 80.5 81.0 72.6 92.7 81.1

Biomedical benchmarks (BLURB, MedQA, MMLU, etc): BioLinkBERT attains new state-of-the-art 😊

BLURB score PubMedQA BioASQ MedQA-USMLE
PubmedBERT-base 81.10 55.8 87.5 38.1
BioLinkBERT-base 83.39 70.2 91.4 40.0
BioLinkBERT-large 84.30 72.2 94.8 44.6
MMLU-professional medicine
GPT-3 (175 params) 38.7
UnifiedQA (11B params) 43.2
BioLinkBERT-large (340M params) 50.7

2. Set up environment and data

Environment

Run the following commands to create a conda environment:

conda create -n linkbert python=3.8
source activate linkbert
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 datasets==1.11.0 fairscale==0.4.0 wandb sklearn seqeval

Data

You can download the preprocessed datasets on which we evaluated LinkBERT from [here]. Simply download this zip file and unzip it. This includes:

  • MRQA question answering datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD)
  • BLURB biomedical NLP datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.)
  • MedQA-USMLE biomedical reasoning dataset.
  • MMLU-professional medicine reasoning dataset.

They are all preprocessed in the HuggingFace dataset format.

If you would like to preprocess the raw data from scratch, you can take the following steps:

  • First download the raw datasets from the original sources by following instructions in scripts/download_raw_data.sh
  • Then run the preprocessing scripts scripts/preprocess_{mrqa,blurb,medqa,mmlu}.py.

3. Fine-tune LinkBERT

Change the working directory to src/, and follow the instructions below for each dataset.

MRQA

To fine-tune for the MRQA datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD), run commands listed in run_examples_mrqa_linkbert-{base,large}.sh.

BLURB

To fine-tune for the BLURB biomedial datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.), run commands listed in run_examples_blurb_biolinkbert-{base,large}.sh.

MedQA & MMLU

To fine-tune for the MedQA-USMLE dataset, run commands listed in run_examples_medqa_biolinkbert-{base,large}.sh.

To evaluate the fine-tuned model additionally on MMLU-professional medicine, run the commands listed at the bottom of run_examples_medqa_biolinkbert-large.sh.

Reproducibility

We also provide Codalab worksheet, on which we record our experiments. You may find it useful for replicating the experiments using the same model, code, data, and environment.

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}
Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

LAV Learning from All Vehicles Dian Chen, Philipp KrΓ€henbΓΌhl CVPR 2022 (also arXiV 2203.11934) This repo contains code for paper Learning from all veh

Dian Chen 300 Dec 15, 2022
[ICLR2021] Unlearnable Examples: Making Personal Data Unexploitable

Unlearnable Examples Code for ICLR2021 Spotlight Paper "Unlearnable Examples: Making Personal Data Unexploitable " by Hanxun Huang, Xingjun Ma, Sarah

Hanxun Huang 98 Dec 07, 2022
PyTorch implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose Release Notes The official PyTorch implementation of Neural View S

Angtian Wang 20 Oct 09, 2022
This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

Keon Lee 13 Dec 05, 2022
Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021)

Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021) In this repository we provide PyTorch implementations for GeMCL; a

4 Apr 15, 2022
Deep Two-View Structure-from-Motion Revisited

Deep Two-View Structure-from-Motion Revisited This repository provides the code for our CVPR 2021 paper Deep Two-View Structure-from-Motion Revisited.

Jianyuan Wang 145 Jan 06, 2023
Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

61 Jan 07, 2023
Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Introduction 1. Usage (For MSS) 1.1 Prepare running environment 1.2 Use pretrained model 1.3 Train new MSS models from scratch 1.3.1 How to train 1.3.

Leo 100 Dec 25, 2022
Kaggle G2Net Gravitational Wave Detection : 2nd place solution

Kaggle G2Net Gravitational Wave Detection : 2nd place solution

Hiroshechka Y 33 Dec 26, 2022
Progressive Coordinate Transforms for Monocular 3D Object Detection

Progressive Coordinate Transforms for Monocular 3D Object Detection This repository is the official implementation of PCT. Introduction In this paper,

58 Nov 06, 2022
Code of TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

TVT Code of TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation Datasets: Digit: MNIST, SVHN, USPS Object: Office, Office-Home, Vi

37 Dec 15, 2022
Linear Variational State Space Filters

Linear Variational State Space Filters To set up the environment, use the provided scripts in the docker/ folder to build and run the codebase inside

0 Dec 13, 2021
Deep Learning and Reinforcement Learning Library for Scientists and Engineers πŸ”₯

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 29, 2022
πŸ’› Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 51 Jan 06, 2023
Weighted QMIX: Expanding Monotonic Value Function Factorisation

This repo contains the cleaned-up code that was used in "Weighted QMIX: Expanding Monotonic Value Function Factorisation"

whirl 82 Dec 29, 2022
Sign Language Translation with Transformers (COLING'2020, ECCV'20 SLRTP Workshop)

transformer-slt This repository gathers data and code supporting the experiments in the paper Better Sign Language Translation with STMC-Transformer.

Kayo Yin 107 Dec 27, 2022
This repository gives an example on how to preprocess the data of the HECKTOR challenge

HECKTOR 2021 challenge This repository gives an example on how to preprocess the data of the HECKTOR challenge. Any other preprocessing is welcomed an

56 Dec 01, 2022
Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

DiLBERT Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP" Pretrained Model The pretrained model presented in the paper is

Kevin Roitero 2 Dec 15, 2022