[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

Overview

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links

PRs Welcome arXiv PWC PWC

This repo provides the model, code & data of our paper: LinkBERT: Pretraining Language Models with Document Links (ACL 2022). [PDF] [HuggingFace Models]

Overview

LinkBERT is a new pretrained language model (improvement of BERT) that captures document links such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides using a single document as in BERT.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for knowledge-intensive tasks (e.g. question answering) and cross-document tasks (e.g. reading comprehension, document retrieval).

1. Pretrained Models

We release the pretrained LinkBERT (-base and -large sizes) for both the general domain and biomedical domain. These models have the same format as the HuggingFace BERT models, and you can easily switch them with LinkBERT models.

Model Size Domain Pretraining Corpus Download Link ( πŸ€— HuggingFace)
LinkBERT-base 110M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-base
LinkBERT-large 340M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-large
BioLinkBERT-base 110M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-base
BioLinkBERT-large 340M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-large

To use these models in πŸ€— Transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

To fine-tune the models, see Section 2 & 3 below. When fine-tuned on downstream tasks, LinkBERT achieves the following results.
General benchmarks (MRQA and GLUE):

HotpotQA TriviaQA SearchQA NaturalQ NewsQA SQuAD GLUE
F1 F1 F1 F1 F1 F1 Avg score
BERT-base 76.0 70.3 74.2 76.5 65.7 88.7 79.2
LinkBERT-base 78.2 73.9 76.8 78.3 69.3 90.1 79.6
BERT-large 78.1 73.7 78.3 79.0 70.9 91.1 80.7
LinkBERT-large 80.8 78.2 80.5 81.0 72.6 92.7 81.1

Biomedical benchmarks (BLURB, MedQA, MMLU, etc): BioLinkBERT attains new state-of-the-art 😊

BLURB score PubMedQA BioASQ MedQA-USMLE
PubmedBERT-base 81.10 55.8 87.5 38.1
BioLinkBERT-base 83.39 70.2 91.4 40.0
BioLinkBERT-large 84.30 72.2 94.8 44.6
MMLU-professional medicine
GPT-3 (175 params) 38.7
UnifiedQA (11B params) 43.2
BioLinkBERT-large (340M params) 50.7

2. Set up environment and data

Environment

Run the following commands to create a conda environment:

conda create -n linkbert python=3.8
source activate linkbert
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 datasets==1.11.0 fairscale==0.4.0 wandb sklearn seqeval

Data

You can download the preprocessed datasets on which we evaluated LinkBERT from [here]. Simply download this zip file and unzip it. This includes:

  • MRQA question answering datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD)
  • BLURB biomedical NLP datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.)
  • MedQA-USMLE biomedical reasoning dataset.
  • MMLU-professional medicine reasoning dataset.

They are all preprocessed in the HuggingFace dataset format.

If you would like to preprocess the raw data from scratch, you can take the following steps:

  • First download the raw datasets from the original sources by following instructions in scripts/download_raw_data.sh
  • Then run the preprocessing scripts scripts/preprocess_{mrqa,blurb,medqa,mmlu}.py.

3. Fine-tune LinkBERT

Change the working directory to src/, and follow the instructions below for each dataset.

MRQA

To fine-tune for the MRQA datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD), run commands listed in run_examples_mrqa_linkbert-{base,large}.sh.

BLURB

To fine-tune for the BLURB biomedial datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.), run commands listed in run_examples_blurb_biolinkbert-{base,large}.sh.

MedQA & MMLU

To fine-tune for the MedQA-USMLE dataset, run commands listed in run_examples_medqa_biolinkbert-{base,large}.sh.

To evaluate the fine-tuned model additionally on MMLU-professional medicine, run the commands listed at the bottom of run_examples_medqa_biolinkbert-large.sh.

Reproducibility

We also provide Codalab worksheet, on which we record our experiments. You may find it useful for replicating the experiments using the same model, code, data, and environment.

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}
Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
Pytorch version of VidLanKD: Improving Language Understanding viaVideo-Distilled Knowledge Transfer

VidLanKD Implementation of VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer by Zineng Tang, Jaemin Cho, Hao Tan, Mohi

Zineng Tang 54 Dec 20, 2022
Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Graph-InfoClust-GIC [PAKDD 2021] PAKDD'21 version Graph InfoClust: Maximizing Coarse-Grain Mutual Information in Graphs Preprint version Graph InfoClu

Costas Mavromatis 21 Dec 03, 2022
Official PyTorch implementation of MAAD: A Model and Dataset for Attended Awareness

MAAD: A Model for Attended Awareness in Driving Install // Datasets // Training // Experiments // Analysis // License Official PyTorch implementation

7 Oct 16, 2022
Neural network pruning for finding a sparse computational model for controlling a biological motor task.

MothPruning Scientific Overview Originally inspired by biological nervous systems, deep neural networks (DNNs) are powerful computational tools for mo

Olivia Thomas 0 Dec 14, 2022
Vehicle Detection Using Deep Learning and YOLO Algorithm

VehicleDetection Vehicle Detection Using Deep Learning and YOLO Algorithm Dataset take or find vehicle images for create a special dataset for fine-tu

Maryam Boneh 96 Jan 05, 2023
A GridMixup augmentation, inspired by GridMask and CutMix

GridMixup A GridMixup augmentation, inspired by GridMask and CutMix Easy install pip install git+https://github.com/IlyaDobrynin/GridMixup.git Overvie

IlyaDo 42 Dec 28, 2022
Sarus implementation of classical ML models. The models are implemented using the Keras API of tensorflow 2. Vizualization are implemented and can be seen in tensorboard.

Sarus published models Sarus implementation of classical ML models. The models are implemented using the Keras API of tensorflow 2. Vizualization are

Sarus Technologies 39 Aug 19, 2022
This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Stock Market Buy/Sell/Hold prediction Using convolutional Neural Network This repo is an attempt to implement the research paper titled "Algorithmic F

Asutosh Nayak 136 Dec 28, 2022
[3DV 2021] A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks

dispersion-score Official implementation of 3DV 2021 Paper A Dataset-dispersion Perspective on Reconstruction versus Recognition in Single-view 3D Rec

Yefan 7 May 28, 2022
Reinforcement Learning via Supervised Learning

Reinforcement Learning via Supervised Learning Installation Run pip install -e . in an environment with Python = 3.7.0, 3.9. The code depends on MuJ

Scott Emmons 49 Nov 28, 2022
PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

DECOR-GAN PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement, Zhiqin Chen, Vladimir G. Kim, Matthew Fish

Zhiqin Chen 72 Dec 31, 2022
A new benchmark for Icon Question Answering (IconQA) and a large-scale icon dataset Icon645.

IconQA About IconQA is a new diverse abstract visual question answering dataset that highlights the importance of abstract diagram understanding and c

Pan Lu 24 Dec 30, 2022
γ€ŠDual-Resolution Correspondence Network》(NeurIPS 2020)

Dual-Resolution Correspondence Network Dual-Resolution Correspondence Network, NeurIPS 2020 Dependency All dependencies are included in asset/dualrcne

Active Vision Laboratory 45 Nov 21, 2022
A list of all papers and resoureces on Semantic Segmentation

Semantic-Segmentation A list of all papers and resoureces on Semantic Segmentation. Dataset importance SemanticSegmentation_DL Some implementation of

Alan Tang 1.1k Dec 12, 2022
Get a Grip! - A robotic system for remote clinical environments.

Get a Grip! Within clinical environments, sterilization is an essential procedure for disinfecting surgical and medical instruments. For our engineeri

Jay Sharma 1 Jan 05, 2022
Self-supervised Label Augmentation via Input Transformations (ICML 2020)

Self-supervised Label Augmentation via Input Transformations Authors: Hankook Lee, Sung Ju Hwang, Jinwoo Shin (KAIST) Accepted to ICML 2020 Install de

hankook 96 Dec 29, 2022
TLDR: Twin Learning for Dimensionality Reduction

TLDR (Twin Learning for Dimensionality Reduction) is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self

NAVER 105 Dec 28, 2022
A voice recognition assistant similar to amazon alexa, siri and google assistant.

kenyan-Siri Build an Artificial Assistant Full tutorial (video) To watch the tutorial, click on the image below Installation For windows users (run th

Alison Parker 3 Aug 19, 2022
SoK: Vehicle Orientation Representations for Deep Rotation Estimation

SoK: Vehicle Orientation Representations for Deep Rotation Estimation Raymond H. Tu, Siyuan Peng, Valdimir Leung, Richard Gao, Jerry Lan This is the o

FIRE Capital One Machine Learning of the University of Maryland 12 Oct 07, 2022
Pytorch Implementation of Continual Learning With Filter Atom Swapping (ICLR'22 Spolight) Paper

Continual Learning With Filter Atom Swapping Pytorch Implementation of Continual Learning With Filter Atom Swapping (ICLR'22 Spolight) Paper If find t

11 Aug 29, 2022