UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Related tags

Deep LearningUmlsBERT
Overview

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

General info

This is the code that was used of the paper : UmlsBERT: Augmenting Contextual Embeddings with a Clinical Metathesaurus (NAACL 2021).

In this work, we introduced UmlsBERT, a contextual embedding model capable of integrating domain knowledge during pre-training. It was trained on biomedical corpora and uses the Unified Medical Language System (UMLS) clinical metathesaurus in two ways:

  • We proposed a new multi-label loss function for the pre-training of the Masked Language Modelling (Masked LM) task of UmlsBERT that considers the connections between medical words using the CUI attribute of UMLS.

  • We introduced a semantic group embedding that enriches the input embeddings process of UmlsBERT by forcing the model to take into consideration the association of the words that are part of the same semantic group.

Technologies

This project was created with python 3.7 and PyTorch 0.4.1 and it is based on the transformer github repo of the huggingface team

Setup

We recommend installing and running the code from within a virtual environment.

Creating a Conda Virtual Environment

First, download Anaconda from this link

Second, create a conda environment with python 3.7.

$ conda create -n umlsbert python=3.7

Upon restarting your terminal session, you can activate the conda environment:

$ conda activate umlsbert 

Install the required python packages

In the project root directory, run the following to install the required packages.

pip3 install -r requirements.txt

Install from a VM

If you start a VM, please run the following command sequentially before install the required python packages. The following code example is for a vast.ai Virtual Machine.

apt-get update
apt install git-all
apt install python3-pip
apt-get install jupyter

Dowload pre-trained UmlsBERT model

In order to use pre-trained UmlsBERT model for the word embeddings (or the semantic embeddings), you need to dowload it into the folder examples/checkpoint/ from the link:

 wget -O umlsbert.tar.xz https://www.dropbox.com/s/kziiuyhv9ile00s/umlsbert.tar.xz?dl=0

into the folder examples/checkpoint/ and unzip it with the following command:

tar -xvf umlsbert.tar.xz

Reproduce UmlsBERT

Pretraining

  • The UmlsBERT was pretrained on the MIMIC data. Unfortunately, we cannot provide the text of the MIMIC III dataset as training course is mandatory in order to access the particular dataset.

  • The MIMIC III dataset can be downloaded from the following link

  • The pretraining an UmlsBERT model depends on data from NLTK so you'll have to download them. Run the Python interpreter (python3) and type the commands:

>>> import nltk
>>> nltk.download('punkt')
  • After downloading the NOTEEVENTS table in the examples/language-modeling/ folder, run the following python code that we provide in the examples/language-modeling/ folder to create the mimic_string.txt on the folder examples/language-modeling/:
python3 mimic.py

you can pre-trained a UmlsBERT model by running the following command on the examples/language-modeling/:

Example for pretraining Bio_clinicalBert:

python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1  --model_name_or_path  emilyalsentzer/Bio_ClinicalBERT  --mlm     --do_train     --learning_rate 5e-5     --max_steps 150000   --block_size 128   --save_steps 1000     --per_gpu_train_batch_size 32     --seed 42     --line_by_line      --train_data_file mimic_string.txt  --umls --config_name  config.json --med_document ./voc/vocab_updated.txt

Downstream Tasks

MedNLi task

  • MedNLI is available through the MIMIC-III derived data repository. Any individual certified to access MIMIC-III can access MedNLI through the following link

    • Converting into an appropriate format: After downloading and unzipping the MedNLi dataset (mednli-a-natural-language-inference-dataset-for-the-clinical-domain-1.0.0.zip) on the folder examples/text-classification/dataset/mednli/, run the following python code in the examples/text-classification/dataset/mednli/ folder that we provide in order to convert the dataset into a format that is appropriate for the UmlsBERT model
python3  mednli.py
  • This python code will create the files: train.tsv,dev_matched.tsv and test_matched.tsv in the text-classification/dataset/mednli/mednli folder
  • We provide an example-notebook under the folder experiements/:

or directly run UmlsBert on the text-classification/ folder:

python3 run_glue.py --output_dir ./models/medicalBert-v1 --model_name_or_path  ../checkpoint/umlsbert   --data_dir  dataset/mednli/mednli  --num_train_epochs 3 --per_device_train_batch_size 32  --learning_rate 1e-4   --do_train --do_eval  --do_predict  --task_name mnli --umls --med_document ./voc/vocab_updated.txt

NER task

  • Due to the copyright issue of i2b2 datasets, in order to download them follow the link.

    • Converting into an appropriate format: Since we wanted to directly compare with the Bio_clinical_Bert we used their code in order to convert the i2b2 dataset to a format which is appropriate for the BERT architecture which can be found in the following link: link

    We provide the code for converting the i2b2 dataset with the following instruction for each dataset:

  • i2b2 2006:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2006_deid unzip the deid_surrogate_test_all_groundtruth_version2.zip and deid_surrogate_train_all_version2.zip
    • run the create.sh scrip with the command ./create.sh
    • The script will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2006 folder
  • i2b2 2010:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2010_relations unzip the test_data.tar.gz, concept_assertion_relation_training_data.tar.gz and reference_standard_for_test_data.tar.gz
    • Run the jupyter notebook Reformat.ipynb
    • The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2010 folder
  • i2b2 2012:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2012 unzip the 2012-07-15.original-annotation.release.tar.gz and 2012-08-08.test-data.event-timex-groundtruth.tar.gz
    • Run the jupyter notebook Reformat.ipynb
    • The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2012 folder
  • i2b2 2014:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2014_deid_hf_risk unzip the 2014_training-PHI-Gold-Set1.tar.gz,training-PHI-Gold-Set2.tar.gz and testing-PHI-Gold-fixed.tar.gz
    • Run the jupyter notebook Reformat.ipynb
    • The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2014 folder
  • We provide an example-notebook under the folder experiements/:

or directly run UmlsBert on the token-classification/ folder:

python3 run_ner.py --output_dir ./models/medicalBert-v1 --model_name_or_path  ../checkpoint/umlsbert    --labels dataset/NER/2006/label.txt --data_dir  dataset/NER/2006 --do_train --num_train_epochs 20 --per_device_train_batch_size 32  --learning_rate 1e-4  --do_predict --do_eval --umls --med_document ./voc/vocab_updated.txt

If you find our work useful, can cite our paper using:

@misc{michalopoulos2020umlsbert,
      title={UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus}, 
      author={George Michalopoulos and Yuanxin Wang and Hussam Kaka and Helen Chen and Alex Wong},
      year={2020},
      eprint={2010.10391},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Meta graph convolutional neural network-assisted resilient swarm communications

Resilient UAV Swarm Communications with Graph Convolutional Neural Network This repository contains the source codes of Resilient UAV Swarm Communicat

62 Dec 06, 2022
A playable implementation of Fully Convolutional Networks with Keras.

keras-fcn A re-implementation of Fully Convolutional Networks with Keras Installation Dependencies keras tensorflow Install with pip $ pip install git

JihongJu 202 Sep 07, 2022
An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

CV Lab @ Yonsei University 35 Oct 26, 2022
本步态识别系统主要基于GaitSet模型进行实现

本步态识别系统主要基于GaitSet模型进行实现。在尝试部署本系统之前,建立理解GaitSet模型的网络结构、训练和推理方法。 系统的实现效果如视频所示: 演示视频 由于模型较大,部分模型文件存储在百度云盘。 链接提取码:33mb 具体部署过程 1.下载代码 2.安装requirements.txt

16 Oct 22, 2022
Pytorch implementation of our paper accepted by NeurIPS 2021 -- Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021) (Link) Overview Prerequisites Linu

Shaojie Li 34 Mar 31, 2022
A TensorFlow implementation of SOFA, the Simulator for OFfline LeArning and evaluation.

SOFA This repository is the implementation of SOFA, the Simulator for OFfline leArning and evaluation. Keeping Dataset Biases out of the Simulation: A

22 Nov 23, 2022
DeepAL: Deep Active Learning in Python

DeepAL: Deep Active Learning in Python Python implementations of the following active learning algorithms: Random Sampling Least Confidence [1] Margin

Kuan-Hao Huang 583 Jan 03, 2023
Repositorio oficial del curso IIC2233 Programación Avanzada 🚀✨

IIC2233 - Programación Avanzada Evaluación Las evaluaciones serán efectuadas por medio de actividades prácticas en clases y tareas. Se calculará la no

IIC2233 @ UC 47 Sep 06, 2022
The code for paper "Learning Implicit Fields for Generative Shape Modeling".

implicit-decoder The tensorflow code for paper "Learning Implicit Fields for Generative Shape Modeling", Zhiqin Chen, Hao (Richard) Zhang. Project pag

Zhiqin Chen 353 Dec 30, 2022
A Pytorch Implementation of [Source data‐free domain adaptation of object detector through domain

A Pytorch Implementation of Source data‐free domain adaptation of object detector through domain‐specific perturbation Please follow Faster R-CNN and

1 Dec 25, 2021
A Review of Deep Learning Techniques for Markerless Human Motion on Synthetic Datasets

HOW TO USE THIS PROJECT A Review of Deep Learning Techniques for Markerless Human Motion on Synthetic Datasets Based on DeepLabCut toolbox, we run wit

1 Jan 10, 2022
[Link]mareteutral - pars tradg wth M []

pairs-trading-with-ML Jonathan Larkin, August 2017 One popular strategy classification is Pairs Trading. Though this category of strategies can exhibi

Jonathan Larkin 134 Jan 06, 2023
Official PyTorch implementation of the paper "Graph-based Generative Face Anonymisation with Pose Preservation" in ICIAP 2021

Contents AnonyGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evaluation Acknowledgments Citat

Nicola Dall'Asen 10 May 24, 2022
Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

InfoSwap: Information Bottleneck Disentanglement for Identity Swapping Code usage Please check out the user manual page. Paper Gege Gao, Huaibo Huang,

Grace Hešeri 56 Dec 20, 2022
Keras Implementation of The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation by (Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio)

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation: Work In Progress, Results can't be replicated yet with the m

Yad Konrad 196 Aug 30, 2022
Deep Learning for Morphological Profiling

Deep Learning for Morphological Profiling An end-to-end implementation of a ML System for morphological profiling using self-supervised learning to di

Danielh Carranza 0 Jan 20, 2022
Learning Confidence for Out-of-Distribution Detection in Neural Networks

Learning Confidence Estimates for Neural Networks This repository contains the code for the paper Learning Confidence for Out-of-Distribution Detectio

235 Jan 05, 2023
PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

Jake Tae 5 Jan 27, 2022
A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Awesome PyTorch Scholarship Resources A collection of awesome PyTorch and Python learning resources. Contributions are always welcome! Course Informat

Arnas Gečas 302 Dec 03, 2022
Official pytorch implementation of Rainbow Memory (CVPR 2021)

Rainbow Memory: Continual Learning with a Memory of Diverse Samples

Clova AI Research 91 Dec 17, 2022