A project for developing transformer-based models for clinical relation extraction

Overview

Clinical Relation Extration with Transformers

Aim

This package is developed for researchers easily to use state-of-the-art transformers models for extracting relations from clinical notes. No prior knowledge of transformers is required. We handle the whole process from data preprocessing to training to prediction.

Dependency

The package is built on top of the Transformers developed by the HuggingFace. We have the requirement.txt to specify the packages required to run the project.

Background

Our training strategy is inspired by the paper: https://arxiv.org/abs/1906.03158 We only support train-dev mode, but you can do 5-fold CV.

Available models

  • BERT
  • XLNet
  • RoBERTa
  • ALBERT
  • DeBERTa
  • Longformer

We will keep adding new models.

usage and example

  • data format

see sample_data dir (train.tsv and test.tsv) for the train and test data format

The sample data is a small subset of the data prepared from the 2018 umass made1.0 challenge corpus

# data format: tsv file with 8 columns:
1. relation_type: adverse
2. sentence_1: ALLERGIES : [s1] Penicillin [e1] .
3. sentence_2: [s2] ALLERGIES [e2] : Penicillin .
4. entity_type_1: Drug
5. entity_type_2: ADE
6. entity_id_1: T1
7. entity_id2: T2
8. file_id: 13_10

note: 
1) the entity between [s1][e1] is the first entity in a relation; the second entity in the relation is inbetween [s2][e2]
2) even the two entities in the same sentenc, we still require to put them separately
3) in the test.tsv, you can set all labels to neg or no_relation or whatever, because we will not use the label anyway
4) We recommend to evaluate the test performance in a separate process based on prediction. (see **post-processing**)
5) We recommend using official evaluation scripts to do evaluation to make sure the results reported are reliable.
  • preprocess data (see the preprocess.ipynb script for more details on usage)

we did not provide a script for training and test data generation

we have a jupyter notebook with preprocessing 2018 n2c2 data as an example

you can follow our example to generate your own dataset

  • special tags

we use 4 special tags to identify two entities in a relation

# the defaults tags we defined in the repo are

EN1_START = "[s1]"
EN1_END = "[e1]"
EN2_START = "[s2]"
EN2_END = "[e2]"

If you need to customize these tags, you can change them in
config.py
  • training

please refer to the wiki page for all details of the parameters flag details

export CUDA_VISIBLE_DEVICES=1
data_dir=./sample_data
nmd=./new_modelzw
pof=./predictions.txt
log=./log.txt

# NOTE: we have more options available, you can check our wiki for more information
python ./src/relation_extraction.py \
		--model_type bert \
		--data_format_mode 0 \
		--classification_scheme 1 \
		--pretrained_model bert-base-uncased \
		--data_dir $data_dir \
		--new_model_dir $nmd \
		--predict_output_file $pof \
		--overwrite_model_dir \
		--seed 13 \
		--max_seq_length 256 \
		--cache_data \
		--do_train \
		--do_lower_case \
		--train_batch_size 4 \
		--eval_batch_size 4 \
		--learning_rate 1e-5 \
		--num_train_epochs 3 \
		--gradient_accumulation_steps 1 \
		--do_warmup \
		--warmup_ratio 0.1 \
		--weight_decay 0 \
		--max_num_checkpoints 1 \
		--log_file $log \
  • prediction
export CUDA_VISIBLE_DEVICES=1
data_dir=./sample_data
nmd=./new_model
pof=./predictions.txt
log=./log.txt

# we have to set data_dir, new_model_dir, model_type, log_file, and eval_batch_size, data_format_mode
python ./src/relation_extraction.py \
		--model_type bert \
		--data_format_mode 0 \
		--classification_scheme 1 \
		--pretrained_model bert-base-uncased \
		--data_dir $data_dir \
		--new_model_dir $nmd \
		--predict_output_file $pof \
		--overwrite_model_dir \
		--seed 13 \
		--max_seq_length 256 \
		--cache_data \
		--do_predict \
		--do_lower_case \
		--eval_batch_size 4 \
		--log_file $log \
  • post-processing (we only support transformation to brat format)
# see --help for more information
data_dir=./sample_data
pof=./predictions.txt

python src/data_processing/post_processing.py \
		--mode mul \
		--predict_result_file $pof \
		--entity_data_dir ./test_data_entity_only \
		--test_data_file ${data_dir}/test.tsv \
		--brat_result_output_dir ./brat_output

Using json file for experiment config instead of commend line

  • to simplify using the package, we support using json file for configuration
  • using json, you can define all parameters in a separate json file instead of input via commend line
  • config_experiment_sample.json is a sample json file you can follow to develop yours
  • to run experiment with json config, you need to follow run_json.sh
export CUDA_VISIBLE_DEVICES=1

python ./src/relation_extraction_json.py \
		--config_json "./config_experiment_sample.json"

Baseline (baseline directory)

  • We also implemented some baselines for relation extraction using machine learning approaches
  • baseline is for comparison only
  • baseline based on SVM
  • features extracted may not optimize for each dataset (cover most commonly used lexical and semantic features)
  • see baseline/run.sh for example

Issues

raise an issue if you have problems.

Citation

please cite our paper:

# We have a preprint at
https://arxiv.org/abs/2107.08957

Clinical Pre-trained Transformer Models

We have a series transformer models pre-trained on MIMIC-III. You can find them here:

Comments
  • prediction on large corpus

    prediction on large corpus

    The package will have issues dealing with the prediction on a large corpus (e.g., thousands of notes). We need to develop a batch process to avoid OOM issue and parallel may be to speed up.

    enhancement 
    opened by bugface 2
  • Not able to get the prediction for Test.csv

    Not able to get the prediction for Test.csv

    Hi

    I am just trying to run the code to get the predictions for the test.csv. i am trying with the pre trained model at https://transformer-models.s3.amazonaws.com/mimiciii_bert_10e_128b.zip.

    While running code I am getting an error as AttributeError: 'BertConfig' object has no attribute 'tags'

    Screen shot of my scree is as below

    image

    opened by vikasgoel2000 1
  • Binary classification with BCELoss or Focal Loss

    Binary classification with BCELoss or Focal Loss

    For binary mode, we currently still use CrossEntropyLoss, but BCELoss is designed for binary classification. We need to add options to use BCELoss or Focal Loss in binary mode

    enhancement 
    opened by bugface 1
  • Ok

    Ok

    Keep forgetting your Singpass username and password? Set it up once on Singpass app for password-free logins next time.

    Download Singpass app at https://app.singpass.gov.sg/share?src=gxe1ax

    opened by Andre11232 0
  • Confused on usage

    Confused on usage

    The input to the prediction model is a .tsv file where the first column is the relation type. So it is unclear to me why we need the model to predict the relation type again.

    Am I misunderstanding? For predicting relations for new data, will the first column be autofilled with NonRel?

    opened by jiwonjoung 1
  • roberta question

    roberta question

    Thank you for providing and actively maintaining this repository. I'm trying to run the roberta on the sample data, but I'm encountering an error (I have tested bert and deberta, and both worked well without any error)

    Here is the code I ran

    export CUDA_VISIBLE_DEVICES=1
    data_dir=./sample_data
    nmd=./roberta_re_model
    pof=./roberta_re_predictions.txt
    log=./roberta_re_log.txt
    
    python ./src/relation_extraction.py \
    		--model_type roberta \
    		--data_format_mode 0 \
    		--classification_scheme 2 \
    		--pretrained_model roberta-base \
    		--data_dir $data_dir \
    		--new_model_dir $nmd \
    		--predict_output_file $pof \
    		--overwrite_model_dir \
    		--seed 13 \
    		--max_seq_length 256 \
    		--cache_data \
    		--do_train \
    		--do_lower_case \
                    --do_predict \
    		--train_batch_size 4 \
    		--eval_batch_size 4 \
    		--learning_rate 1e-5 \
    		--num_train_epochs 3 \
    		--gradient_accumulation_steps 1 \
    		--do_warmup \
    		--warmup_ratio 0.1 \
    		--weight_decay 0 \
    		--max_num_checkpoints 1 \
    		--log_file $log \
    

    but I ran into this error:

    2022-05-12 06:07:50 - Transformer_Relation_Extraction - ERROR - Training error:
    Traceback (most recent call last):
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 59, in app
        task_runner.train()
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/task.py", line 100, in train
        batch_output = self.model(**batch_input)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/models.py", line 159, in forward
        output_hidden_states=output_hidden_states
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
        past_key_values_length=past_key_values_length,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
        self.norm_type, self.scale_grad_by_freq, self.sparse)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
    
    Traceback (most recent call last):
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 59, in app
        task_runner.train()
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/task.py", line 100, in train
        batch_output = self.model(**batch_input)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/models.py", line 159, in forward
        output_hidden_states=output_hidden_states
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
        past_key_values_length=past_key_values_length,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
        self.norm_type, self.scale_grad_by_freq, self.sparse)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
    Traceback (most recent call last):
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 59, in app
        task_runner.train()
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/task.py", line 100, in train
        batch_output = self.model(**batch_input)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/models.py", line 159, in forward
        output_hidden_states=output_hidden_states
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
        past_key_values_length=past_key_values_length,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
        self.norm_type, self.scale_grad_by_freq, self.sparse)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 181, in <module>
        app(args)
      File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 63, in app
        raise RuntimeError()
    RuntimeError
    

    Any help would be much appreciated. Thanks for your project!

    opened by jeonge1 4
  • save trained model as a RE model and a core model with only transformer layers

    save trained model as a RE model and a core model with only transformer layers

    we need to separately save the whole RE model and a core transformer model with only transformer layers so that the model can be used for other training tasks.

    enhancement 
    opened by bugface 0
  • ELECTRA and GPT2 support

    ELECTRA and GPT2 support

    Hi,

    I'm wondering how to add ELECTRA and GPT2 support to this module.

    Neither ELECTRA nor GPT2 has pooled output, unlike BERT/RoBERTa-based model.

    I noticed in the models.py the model is implemented as following:

            outputs = self.roberta(
                input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                position_ids=position_ids,
                head_mask=head_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states
            )
    
            pooled_output = outputs[1]
            seq_output = outputs[0]
            logits = self.output2logits(pooled_output, seq_output, input_ids)
    
            return self.calc_loss(logits, outputs, labels)
    

    There are no pooled_output for ELECTRA/GPT2 sequence classification models, only seq_output is in the outputs variable.

    How to get around this limitation and get a working version of ELECTRA/GPT2? Thank you!

    opened by Stochastic-Adventure 2
Releases(v1.0.0)
Owner
uf-hobi-informatics-lab
codebase for hobi informatics lab
uf-hobi-informatics-lab
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

Meta Research 5.3k Jan 03, 2023
classification task on dataset-CIFAR10,by using Tensorflow/keras

CIFAR10-Tensorflow classification task on dataset-CIFAR10,by using Tensorflow/keras 在这一个库中,我使用Tensorflow与keras框架搭建了几个卷积神经网络模型,针对CIFAR10数据集进行了训练与测试。分别使

3 Oct 17, 2021
NeWT: Natural World Tasks

NeWT: Natural World Tasks This repository contains resources for working with the NeWT dataset. ❗ At this time the binary tasks are not publicly avail

Visipedia 26 Oct 18, 2022
Tackling Obstacle Tower Challenge using PPO & A2C combined with ICM.

Obstacle Tower Challenge using Deep Reinforcement Learning Unity Obstacle Tower is a challenging realistic 3D, third person perspective and procedural

Zhuoyu Feng 5 Feb 10, 2022
PyTorch implementation of DreamerV2 model-based RL algorithm

PyDreamer Reimplementation of DreamerV2 model-based RL algorithm in PyTorch. The official DreamerV2 implementation can be found here. Features ... Run

118 Dec 15, 2022
Pytorch implementation of TailCalibX : Feature Generation for Long-tail Classification

TailCalibX : Feature Generation for Long-tail Classification by Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi [arXiv] [

Rahul Vigneswaran 34 Jan 02, 2023
ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

ST++ This is the official PyTorch implementation of our paper: ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation. Lihe Ya

Lihe Yang 147 Jan 03, 2023
The official PyTorch code implementation of "Human Trajectory Prediction via Counterfactual Analysis" in ICCV 2021.

Human Trajectory Prediction via Counterfactual Analysis (CausalHTP) The official PyTorch code implementation of "Human Trajectory Prediction via Count

46 Dec 03, 2022
simple demo codes for Learning to Teach with Dynamic Loss Functions

Learning to Teach with Dynamic Loss Functions This repo contains the simple demo for the NeurIPS-18 paper: Learning to Teach with Dynamic Loss Functio

Lijun Wu 15 Dec 30, 2021
A curated list of Machine Learning and Deep Learning tutorials in Jupyter Notebook format ready to run in Google Colaboratory

Awesome Machine Learning Jupyter Notebooks for Google Colaboratory A curated list of Machine Learning and Deep Learning tutorials in Jupyter Notebook

Carlos Toxtli 245 Jan 01, 2023
code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

On Robust Prefix-Tuning for Text Classification Prefix-tuning has drawed much attention as it is a parameter-efficient and modular alternative to adap

Zonghan Yang 12 Nov 30, 2022
The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

News December 27: v1.1.0 New loss functions: CentroidTripletLoss and VICRegLoss Mean reciprocal rank + per-class accuracies See the release notes Than

Kevin Musgrave 5k Jan 05, 2023
9th place solution in "Santa 2020 - The Candy Cane Contest"

Santa 2020 - The Candy Cane Contest My solution in this Kaggle competition "Santa 2020 - The Candy Cane Contest", 9th place. Basic Strategy In this co

toshi_k 22 Nov 26, 2021
🇰🇷 Text to Image in Korean

KoDALLE Utilizing pretrained language model’s token embedding layer and position embedding layer as DALLE’s text encoder. Background Training DALLE mo

HappyFace 74 Sep 22, 2022
CoaT: Co-Scale Conv-Attentional Image Transformers

CoaT: Co-Scale Conv-Attentional Image Transformers Introduction This repository contains the official code and pretrained models for CoaT: Co-Scale Co

mlpc-ucsd 191 Dec 03, 2022
Codebase for BMVC 2021 paper "Text Based Person Search with Limited Data"

Text Based Person Search with Limited Data This is the codebase for our BMVC 2021 paper. Please bear with me refactoring this codebase after CVPR dead

Xiao Han 33 Nov 24, 2022
A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation(DANN), support Office-31 and Office-Home dataset

DANN A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation Prerequisites Linux or OSX NVIDIA GPU + CUDA (may CuDNN) and corre

8 Apr 16, 2022
Code for Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing(ICCV21)

NeuralGIF Code for Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing(ICCV21) We present Neural Generalized Implicit F

Garvita Tiwari 104 Nov 18, 2022
The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"

TimeSformer This is an official pytorch implementation of Is Space-Time Attention All You Need for Video Understanding?. In this repository, we provid

Facebook Research 1k Dec 31, 2022
Automated Hyperparameter Optimization Competition

QQ浏览器2021AI算法大赛 - 自动超参数优化竞赛 ACM CIKM 2021 AnalyticCup 在信息流推荐业务场景中普遍存在模型或策略效果依赖于“超参数”的问题,而“超参数"的设定往往依赖人工经验调参,不仅效率低下维护成本高,而且难以实现更优效果。因此,本次赛题以超参数优化为主题,从真

20 Dec 09, 2021