[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Last update: Dec 07, 2022

Overview

SapBERT: Self-alignment pretraining for BERT

This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations [arxiv]; and our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [PDF].

Huggingface Models

[SapBERT]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-XLMR]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-mean-token]

Same as the standard SapBERT but trained with mean-pooling instead of [CLS] representations.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Train SapBERT

Prepare training data as insrtructed in data/generate_pretraining_data.ipynb.

Run:

cd umls_pretraining
./pretrain.sh 0,1

where 0,1 specifies the GPU devices.

Evaluate SapBERT

Please view evaluation/README.md for details.

Citations

@article{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	journal={arXiv preprint arXiv:2010.11784},
	year={2020}
}

Acknowledgement

Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.

License

SapBERT is MIT licensed. See the LICENSE file for details.

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Related tags

Overview

SapBERT: Self-alignment pretraining for BERT

Huggingface Models

[SapBERT]

[SapBERT-XLMR]

[SapBERT-mean-token]

Environment

Train SapBERT

Evaluate SapBERT

Citations

Acknowledgement

License

Owner

Cambridge Language Technology Lab

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Tooling for the Common Objects In 3D dataset.

Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

Code for GNMR in ICDE 2021

For visualizing the dair-v2x-i dataset

Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen.

Hyperbolic Image Segmentation, CVPR 2022

Saliency - Framework-agnostic implementation for state-of-the-art saliency methods (XRAI, BlurIG, SmoothGrad, and more).

Research on Tabular Deep Learning (Python package & papers)

Unimodal Face Classification with Multimodal Training

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

An onlinel learning to rank python codebase.

Project for music generation system based on object tracking and CGAN

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

Selective Wavelet Attention Learning for Single Image Deraining

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Harmonious Textual Layout Generation over Natural Images via Deep Aesthetics Learning

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English