UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Related tags

Deep LearningUmlsBERT
Overview

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

General info

This is the code that was used of the paper : UmlsBERT: Augmenting Contextual Embeddings with a Clinical Metathesaurus (NAACL 2021).

In this work, we introduced UmlsBERT, a contextual embedding model capable of integrating domain knowledge during pre-training. It was trained on biomedical corpora and uses the Unified Medical Language System (UMLS) clinical metathesaurus in two ways:

  • We proposed a new multi-label loss function for the pre-training of the Masked Language Modelling (Masked LM) task of UmlsBERT that considers the connections between medical words using the CUI attribute of UMLS.

  • We introduced a semantic group embedding that enriches the input embeddings process of UmlsBERT by forcing the model to take into consideration the association of the words that are part of the same semantic group.

Technologies

This project was created with python 3.7 and PyTorch 0.4.1 and it is based on the transformer github repo of the huggingface team

Setup

We recommend installing and running the code from within a virtual environment.

Creating a Conda Virtual Environment

First, download Anaconda from this link

Second, create a conda environment with python 3.7.

$ conda create -n umlsbert python=3.7

Upon restarting your terminal session, you can activate the conda environment:

$ conda activate umlsbert 

Install the required python packages

In the project root directory, run the following to install the required packages.

pip3 install -r requirements.txt

Install from a VM

If you start a VM, please run the following command sequentially before install the required python packages. The following code example is for a vast.ai Virtual Machine.

apt-get update
apt install git-all
apt install python3-pip
apt-get install jupyter

Dowload pre-trained UmlsBERT model

In order to use pre-trained UmlsBERT model for the word embeddings (or the semantic embeddings), you need to dowload it into the folder examples/checkpoint/ from the link:

 wget -O umlsbert.tar.xz https://www.dropbox.com/s/kziiuyhv9ile00s/umlsbert.tar.xz?dl=0

into the folder examples/checkpoint/ and unzip it with the following command:

tar -xvf umlsbert.tar.xz

Reproduce UmlsBERT

Pretraining

  • The UmlsBERT was pretrained on the MIMIC data. Unfortunately, we cannot provide the text of the MIMIC III dataset as training course is mandatory in order to access the particular dataset.

  • The MIMIC III dataset can be downloaded from the following link

  • The pretraining an UmlsBERT model depends on data from NLTK so you'll have to download them. Run the Python interpreter (python3) and type the commands:

>>> import nltk
>>> nltk.download('punkt')
  • After downloading the NOTEEVENTS table in the examples/language-modeling/ folder, run the following python code that we provide in the examples/language-modeling/ folder to create the mimic_string.txt on the folder examples/language-modeling/:
python3 mimic.py

you can pre-trained a UmlsBERT model by running the following command on the examples/language-modeling/:

Example for pretraining Bio_clinicalBert:

python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1  --model_name_or_path  emilyalsentzer/Bio_ClinicalBERT  --mlm     --do_train     --learning_rate 5e-5     --max_steps 150000   --block_size 128   --save_steps 1000     --per_gpu_train_batch_size 32     --seed 42     --line_by_line      --train_data_file mimic_string.txt  --umls --config_name  config.json --med_document ./voc/vocab_updated.txt

Downstream Tasks

MedNLi task

  • MedNLI is available through the MIMIC-III derived data repository. Any individual certified to access MIMIC-III can access MedNLI through the following link

    • Converting into an appropriate format: After downloading and unzipping the MedNLi dataset (mednli-a-natural-language-inference-dataset-for-the-clinical-domain-1.0.0.zip) on the folder examples/text-classification/dataset/mednli/, run the following python code in the examples/text-classification/dataset/mednli/ folder that we provide in order to convert the dataset into a format that is appropriate for the UmlsBERT model
python3  mednli.py
  • This python code will create the files: train.tsv,dev_matched.tsv and test_matched.tsv in the text-classification/dataset/mednli/mednli folder
  • We provide an example-notebook under the folder experiements/:

or directly run UmlsBert on the text-classification/ folder:

python3 run_glue.py --output_dir ./models/medicalBert-v1 --model_name_or_path  ../checkpoint/umlsbert   --data_dir  dataset/mednli/mednli  --num_train_epochs 3 --per_device_train_batch_size 32  --learning_rate 1e-4   --do_train --do_eval  --do_predict  --task_name mnli --umls --med_document ./voc/vocab_updated.txt

NER task

  • Due to the copyright issue of i2b2 datasets, in order to download them follow the link.

    • Converting into an appropriate format: Since we wanted to directly compare with the Bio_clinical_Bert we used their code in order to convert the i2b2 dataset to a format which is appropriate for the BERT architecture which can be found in the following link: link

    We provide the code for converting the i2b2 dataset with the following instruction for each dataset:

  • i2b2 2006:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2006_deid unzip the deid_surrogate_test_all_groundtruth_version2.zip and deid_surrogate_train_all_version2.zip
    • run the create.sh scrip with the command ./create.sh
    • The script will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2006 folder
  • i2b2 2010:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2010_relations unzip the test_data.tar.gz, concept_assertion_relation_training_data.tar.gz and reference_standard_for_test_data.tar.gz
    • Run the jupyter notebook Reformat.ipynb
    • The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2010 folder
  • i2b2 2012:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2012 unzip the 2012-07-15.original-annotation.release.tar.gz and 2012-08-08.test-data.event-timex-groundtruth.tar.gz
    • Run the jupyter notebook Reformat.ipynb
    • The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2012 folder
  • i2b2 2014:

    • In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2014_deid_hf_risk unzip the 2014_training-PHI-Gold-Set1.tar.gz,training-PHI-Gold-Set2.tar.gz and testing-PHI-Gold-fixed.tar.gz
    • Run the jupyter notebook Reformat.ipynb
    • The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2014 folder
  • We provide an example-notebook under the folder experiements/:

or directly run UmlsBert on the token-classification/ folder:

python3 run_ner.py --output_dir ./models/medicalBert-v1 --model_name_or_path  ../checkpoint/umlsbert    --labels dataset/NER/2006/label.txt --data_dir  dataset/NER/2006 --do_train --num_train_epochs 20 --per_device_train_batch_size 32  --learning_rate 1e-4  --do_predict --do_eval --umls --med_document ./voc/vocab_updated.txt

If you find our work useful, can cite our paper using:

@misc{michalopoulos2020umlsbert,
      title={UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus}, 
      author={George Michalopoulos and Yuanxin Wang and Hussam Kaka and Helen Chen and Alex Wong},
      year={2020},
      eprint={2010.10391},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
ELSED: Enhanced Line SEgment Drawing

ELSED: Enhanced Line SEgment Drawing This repository contains the source code of ELSED: Enhanced Line SEgment Drawing the fastest line segment detecto

Iago Suárez 125 Dec 31, 2022
Official implementation of "Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection" in CVPR 2022.

Jadena Official implementation of "Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection" in CVPR 2022. arXiv

Qing Guo 13 Nov 29, 2022
Differential fuzzing for the masses!

NEZHA NEZHA is an efficient and domain-independent differential fuzzer developed at Columbia University. NEZHA exploits the behavioral asymmetries bet

147 Dec 05, 2022
Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

EleutherAI 432 Dec 16, 2022
Best Practices on Recommendation Systems

Recommenders What's New (February 4, 2021) We have a new relase Recommenders 2021.2! It comes with lots of bug fixes, optimizations and 3 new algorith

Microsoft 14.8k Jan 03, 2023
DeepLab-ResNet rebuilt in TensorFlow

DeepLab-ResNet-TensorFlow This is an (re-)implementation of DeepLab-ResNet in TensorFlow for semantic image segmentation on the PASCAL VOC dataset. Fr

Vladimir 1.2k Nov 04, 2022
A modular, open and non-proprietary toolkit for core robotic functionalities by harnessing deep learning

A modular, open and non-proprietary toolkit for core robotic functionalities by harnessing deep learning Website • About • Installation • Using OpenDR

OpenDR 304 Dec 28, 2022
Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

DSPoint Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion" Coming soon, as soon as I finish a

Ziyao Zeng 14 Feb 26, 2022
PyTorch implementation of the YOLO (You Only Look Once) v2

PyTorch implementation of the YOLO (You Only Look Once) v2 The YOLOv2 is one of the most popular one-stage object detector. This project adopts PyTorc

申瑞珉 (Ruimin Shen) 433 Nov 24, 2022
Official PyTorch implementation and pretrained models of the paper Self-Supervised Classification Network

Self-Classifier: Self-Supervised Classification Network Official PyTorch implementation and pretrained models of the paper Self-Supervised Classificat

Elad Amrani 24 Dec 21, 2022
Learning to Draw: Emergent Communication through Sketching

Learning to Draw: Emergent Communication through Sketching This is the official code for the paper "Learning to Draw: Emergent Communication through S

19 Jul 22, 2022
Pytorch reimplementation of the Mixer (MLP-Mixer: An all-MLP Architecture for Vision)

MLP-Mixer Pytorch reimplementation of Google's repository for the MLP-Mixer (Not yet updated on the master branch) that was released with the paper ML

Eunkwang Jeon 18 Dec 08, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Jan 04, 2023
Simple embedding based text classifier inspired by fastText, implemented in tensorflow

FastText in Tensorflow This project is based on the ideas in Facebook's FastText but implemented in Tensorflow. However, it is not an exact replica of

Alan Patterson 306 Dec 02, 2022
Python scripts for performing lane detection using the LSTR model in ONNX

ONNX LSTR Lane Detection Python scripts for performing lane detection using the Lane Shape Prediction with Transformers (LSTR) model in ONNX. Requirem

Ibai Gorordo 29 Aug 30, 2022
🕵 Artificial Intelligence for social control of public administration

Non-tech crash course into Operação Serenata de Amor Tech crash course into Operação Serenata de Amor Contributing with code and tech skills Supportin

Open Knowledge Brasil - Rede pelo Conhecimento Livre 4.4k Dec 31, 2022
A simple code to convert image format and channel as well as resizing and renaming multiple images.

Rename-Resize-and-convert-multiple-images A simple code to convert image format and channel as well as resizing and renaming multiple images. This cod

Happy N. Monday 3 Feb 15, 2022
This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

GANs for Fun Created because I can! GOAL The goal of this repo is to be freely used by ML devs to check the GAN performances without coding from scrat

Sagnik Roy 13 Jan 26, 2022
Memory Defense: More Robust Classificationvia a Memory-Masking Autoencoder

Memory Defense: More Robust Classificationvia a Memory-Masking Autoencoder Authors: - Eashan Adhikarla - Dan Luo - Dr. Brian D. Davison Abstract Many

Eashan Adhikarla 4 Dec 25, 2022
Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

Bayesian Methods Research Group 56 Nov 15, 2022