[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

Overview

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links

PRs Welcome arXiv PWC PWC

This repo provides the model, code & data of our paper: LinkBERT: Pretraining Language Models with Document Links (ACL 2022). [PDF] [HuggingFace Models]

Overview

LinkBERT is a new pretrained language model (improvement of BERT) that captures document links such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides using a single document as in BERT.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for knowledge-intensive tasks (e.g. question answering) and cross-document tasks (e.g. reading comprehension, document retrieval).

1. Pretrained Models

We release the pretrained LinkBERT (-base and -large sizes) for both the general domain and biomedical domain. These models have the same format as the HuggingFace BERT models, and you can easily switch them with LinkBERT models.

Model Size Domain Pretraining Corpus Download Link ( 🤗 HuggingFace)
LinkBERT-base 110M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-base
LinkBERT-large 340M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-large
BioLinkBERT-base 110M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-base
BioLinkBERT-large 340M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-large

To use these models in 🤗 Transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

To fine-tune the models, see Section 2 & 3 below. When fine-tuned on downstream tasks, LinkBERT achieves the following results.
General benchmarks (MRQA and GLUE):

HotpotQA TriviaQA SearchQA NaturalQ NewsQA SQuAD GLUE
F1 F1 F1 F1 F1 F1 Avg score
BERT-base 76.0 70.3 74.2 76.5 65.7 88.7 79.2
LinkBERT-base 78.2 73.9 76.8 78.3 69.3 90.1 79.6
BERT-large 78.1 73.7 78.3 79.0 70.9 91.1 80.7
LinkBERT-large 80.8 78.2 80.5 81.0 72.6 92.7 81.1

Biomedical benchmarks (BLURB, MedQA, MMLU, etc): BioLinkBERT attains new state-of-the-art 😊

BLURB score PubMedQA BioASQ MedQA-USMLE
PubmedBERT-base 81.10 55.8 87.5 38.1
BioLinkBERT-base 83.39 70.2 91.4 40.0
BioLinkBERT-large 84.30 72.2 94.8 44.6
MMLU-professional medicine
GPT-3 (175 params) 38.7
UnifiedQA (11B params) 43.2
BioLinkBERT-large (340M params) 50.7

2. Set up environment and data

Environment

Run the following commands to create a conda environment:

conda create -n linkbert python=3.8
source activate linkbert
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 datasets==1.11.0 fairscale==0.4.0 wandb sklearn seqeval

Data

You can download the preprocessed datasets on which we evaluated LinkBERT from [here]. Simply download this zip file and unzip it. This includes:

  • MRQA question answering datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD)
  • BLURB biomedical NLP datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.)
  • MedQA-USMLE biomedical reasoning dataset.
  • MMLU-professional medicine reasoning dataset.

They are all preprocessed in the HuggingFace dataset format.

If you would like to preprocess the raw data from scratch, you can take the following steps:

  • First download the raw datasets from the original sources by following instructions in scripts/download_raw_data.sh
  • Then run the preprocessing scripts scripts/preprocess_{mrqa,blurb,medqa,mmlu}.py.

3. Fine-tune LinkBERT

Change the working directory to src/, and follow the instructions below for each dataset.

MRQA

To fine-tune for the MRQA datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD), run commands listed in run_examples_mrqa_linkbert-{base,large}.sh.

BLURB

To fine-tune for the BLURB biomedial datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.), run commands listed in run_examples_blurb_biolinkbert-{base,large}.sh.

MedQA & MMLU

To fine-tune for the MedQA-USMLE dataset, run commands listed in run_examples_medqa_biolinkbert-{base,large}.sh.

To evaluate the fine-tuned model additionally on MMLU-professional medicine, run the commands listed at the bottom of run_examples_medqa_biolinkbert-large.sh.

Reproducibility

We also provide Codalab worksheet, on which we record our experiments. You may find it useful for replicating the experiments using the same model, code, data, and environment.

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}
Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 05, 2022
This is the official github repository of the Met dataset

The Met dataset This is the official github repository of the Met dataset. The official webpage of the dataset can be found here. What is it? This cod

Nikolaos-Antonios Ypsilantis 35 Dec 17, 2022
Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

hhhusiyi 96 Dec 22, 2022
This is a collection of all challenges in HKCERT CTF 2021

香港網絡保安新生代奪旗挑戰賽 2021 (HKCERT CTF 2021) This is a collection of all challenges (and writeups) in HKCERT CTF 2021 Challenges ID Chinese name Name Score S

10 Jan 27, 2022
Modified prey-predator system - Modified prey–predator model describes the rate of change for each species by adding coupling terms.

Modified prey-predator system We aim to study the behaviors of the modified prey–predator model and establish the effects of several parameters that p

Seoyoung Oh 1 Jan 02, 2022
A tiny, friendly, strong baseline code for Person-reID (based on pytorch).

Pytorch ReID Strong, Small, Friendly A tiny, friendly, strong baseline code for Person-reID (based on pytorch). Strong. It is consistent with the new

Zhedong Zheng 3.5k Jan 08, 2023
Decensoring Hentai with Deep Neural Networks. Formerly named DeepMindBreak.

DeepCreamPy Decensoring Hentai with Deep Neural Networks. Formerly named DeepMindBreak. A deep learning-based tool to automatically replace censored a

616 Jan 06, 2023
Learning Super-Features for Image Retrieval

Learning Super-Features for Image Retrieval This repository contains the code for running our FIRe model presented in our ICLR'22 paper: @inproceeding

NAVER 101 Dec 28, 2022
This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

A Memory-saving Training Framework for Transformers This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for Trans

Zhuang AI Group 105 Dec 06, 2022
ROS Basics and TurtleSim

Waypoint Follower Anna Garverick This package draws given waypoints, then waits for a service call with a start position to send the turtle to each wa

Anna Garverick 1 Dec 13, 2021
CKD - Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding

Collaborative Knowledge Distillation for Heterogeneous Information Network Embed

zhousheng 9 Dec 05, 2022
Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

HiT-GAN Official TensorFlow Implementation HiT-GAN presents a Transformer-based generator that is trained based on Generative Adversarial Networks (GA

Google Research 78 Oct 31, 2022
Pytorch implementation of DeepMind's differentiable neural computer paper.

DNC pytorch This is a Pytorch implementation of DeepMind's Differentiable Neural Computer (DNC) architecture introduced in their recent Nature paper:

Yuanpu Xie 91 Nov 21, 2022
Image Super-Resolution by Neural Texture Transfer

SRNTT: Image Super-Resolution by Neural Texture Transfer Tensorflow implementation of the paper Image Super-Resolution by Neural Texture Transfer acce

Zhifei Zhang 413 Nov 30, 2022
EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos. In this project, we provide the basic code for fitt

ZJU3DV 2.2k Jan 05, 2023
SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

IMLHF 3 Oct 11, 2022
GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

ICLR Computational Geometry & Topology Challenge 2022 Welcome to the ICLR 2022 Computational Geometry & Topology challenge 2022 --- by the ICLR 2022 W

42 Dec 13, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

248 Dec 04, 2022
An original implementation of "MetaICL Learning to Learn In Context" by Sewon Min, Mike Lewis, Luke Zettlemoyer and Hannaneh Hajishirzi

MetaICL: Learning to Learn In Context This includes an original implementation of "MetaICL: Learning to Learn In Context" by Sewon Min, Mike Lewis, Lu

Meta Research 141 Jan 07, 2023