Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

Related tags

Deep Learningclin_x
Overview

CLIN-X

(CLIN-X-ES) & (CLIN-X-EN)

This repository holds the companion code for the system reported in the paper:

"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain" by Lukas Lange, Heike Adel, Jannik Strötgen and Dietrich Klakow.

The paper wcan be found here. The code allows the users to reproduce and extend the results reported in the paper. Please cite the above paper when reporting, reproducing or extending the results.

@inproceedings{lange-etal-2021-clin-x,
      author    = {Lukas Lange and
                   Heike Adel and
                   Jannik Str{\"{o}}tgen and
                   Dietrich Klakow},
      title     = {"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain},
      year={2021},
      url={https://arxiv.org/abs/2112.08754}
}

In case of questions, please contact the authors as listed on the paper.

Purpose of the project

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

The CLIN-X language models

As part of this work, two XLM-R were adapted to the clinical domain The models can be found here:

  • CLIN-X ES: Spanish clinical XLM-R (link)
  • CLIN-X EN: English clinical XLM-R (link)

The CLIN-X models are open-sourced under the CC-BY 4.0 license. See the LICENSE_models file for details.

Prepare the conda environment

The code requires some python libraries to work:

conda create -n clin-x python==3.8.5
pip install flair==0.8 transformers==4.6.1 torch==1.8.1 scikit-learn==0.23.1 scipy==1.6.3 numpy==1.20.3 nltk tqdm seaborn matplotlib

Masked-Language-Modeling training

The models were trained using the huggingface MLM script that can be found here. The script was called as follows:

python -m torch.distributed.launch --nproc_per_node 8 run_mlm.py  \
--model_name_or_path xlm-roberta-large  \
--train_file data/spanisch_clinical_train.txt  \
--validation_file data/spanisch_clinical_valid.txt  \
--do_train   --do_eval  \
--output_dir models/xlm-roberta-large-spanisch-clinical-domain/  \
--fp16  \
--per_device_train_batch_size 4 --per_device_eval_batch_size 4  \
--save_strategy steps --save_steps 10000

Using the CLIN-X model with our propose model architecture (as reported in Table 7)

The following will describe our different scripts to reproduce the results. See each of the script files for detailed information on the input arguments.

Tokenize and split the data

python tokenize_files.py --input_path path/to/input/files/ --output_path /path/to/bio_files/
python create_data_splits.py --train_files /path/to/bio_files/ --method random --output_dir /path/to/split_files/

Train the model (using random data splits)

The following command trains on model on four splits (1,2,3,4) and uses the remaining split (5) for validation. For different split combinations adjust the list of --training_files and the --dev_file arguments accordingly.

python train_our_model_architecture.py   \
--data_path /path/to/split_files/  \
--train_files random_split_1.txt,random_split_2.txt,random_split_3.txt,random_split_4.txt  \
--dev_file random_split_5.txt  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

Get ensemble predictions

For all models, get the predictions on the test set as following:

python get_test_predictions.py --name models/model_name --conll_path /path/to/bio_files/ --out_path predictions/model_name/

Then, combine different models into one ensemble. Arguments: Output path + List of model predictions

python create_ensemble_data.py predictions/ensemble1 predictions/model_name/ predictions/model_name_2/ ...

Using the CLIN-X model (as reported in Table 3)

While we recommand the usage of our model architecture, the CLIN-X models can be used in many other architectures. In the paper, we compare to the standard transformer sequnece labeling models as proposed by Devlin et al. For this, we provide the train_standard_model_architecture.py script

python train_standard_model_architecture.py  \
--data_path /path/to/bio_files/  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

License

The CLIN-X code is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.

For a list of other open source components included in CLIN-X, see the file 3rd-party-licenses.txt.

Owner
Bosch Research
Bosch Research
Element selection for functional materials discovery by integrated machine learning of atomic contributions to properties

Element selection for functional materials discovery by integrated machine learning of atomic contributions to properties 8.11.2021 Andrij Vasylenko I

Leverhulme Research Centre for Functional Materials Design 4 Dec 20, 2022
A PyTorch implementation of "SelfGNN: Self-supervised Graph Neural Networks without explicit negative sampling"

SelfGNN A PyTorch implementation of "SelfGNN: Self-supervised Graph Neural Networks without explicit negative sampling" paper, which will appear in Th

Zekarias Tilahun 24 Jun 21, 2022
Self-Supervised Learning

Self-Supervised Learning Features self_supervised offers features like modular framework support for multi-gpu training using PyTorch Lightning easy t

Robin 1 Dec 14, 2021
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains This is an accompanying repository to the ICAIL 2021 pap

4 Dec 16, 2021
SeqAttack: a framework for adversarial attacks on token classification models

A framework for adversarial attacks against token classification models

Walter 23 Nov 25, 2022
A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

TorchRL Disclaimer This library is not officially released yet and is subject to change. The features are available before an official release so that

Meta Research 860 Jan 07, 2023
Official PyTorch implementation of the Fishr regularization for out-of-distribution generalization

Fishr: Invariant Gradient Variances for Out-of-distribution Generalization Official PyTorch implementation of the Fishr regularization for out-of-dist

62 Dec 22, 2022
Simple Dynamic Batching Inference

Simple Dynamic Batching Inference 解决了什么问题? 众所周知,Batch对于GPU上深度学习模型的运行效率影响很大。。。 是在Inference时。搜索、推荐等场景自带比较大的batch,问题不大。但更多场景面临的往往是稀碎的请求(比如图片服务里一次一张图)。 如果

116 Jan 01, 2023
Repository for open research on optimizers.

Open Optimizers Repository for open research on optimizers. This is a test in sharing research/exploration as it happens. If you use anything from thi

Ariel Ekgren 6 Jun 24, 2022
Real life contra a deep learning project built using mediapipe and openc

real-life-contra Description A python script that translates the body movement into in game control. Welcome to all new real life contra a deep learni

Programminghut 7 Jan 26, 2022
PyTorch implementation of the paper Deep Networks from the Principle of Rate Reduction

Deep Networks from the Principle of Rate Reduction This repository is the official PyTorch implementation of the paper Deep Networks from the Principl

459 Dec 27, 2022
VOneNet: CNNs with a Primary Visual Cortex Front-End

VOneNet: CNNs with a Primary Visual Cortex Front-End A family of biologically-inspired Convolutional Neural Networks (CNNs). VOneNets have the followi

The DiCarlo Lab at MIT 99 Dec 22, 2022
A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning

Orchard Dataset This repository contains the code used for generating the Orchard Dataset, as seen in the Multi-Hierarchical Reasoning in Sequences: S

Bill Pung 1 Jun 05, 2022
《Improving Unsupervised Image Clustering With Robust Learning》(2020)

Improving Unsupervised Image Clustering With Robust Learning This repo is the PyTorch codes for "Improving Unsupervised Image Clustering With Robust L

Sungwon Park 129 Dec 27, 2022
All supplementary material used by me while TA-ing CS3244: Machine Learning

CS3244-Tutorial-Material All supplementary material used by me while TA-ing CS3244: Machine Learning at NUS School of Computing. What is this? I teach

Rishabh Anand 18 Sep 23, 2022
The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

The Equalization Losses for Long-tailed Object Detection and Instance Segmentation This repo is official implementation CVPR 2021 paper: Equalization

Jingru Tan 129 Dec 16, 2022
MOOSE (Multi-organ objective segmentation) a data-centric AI solution that generates multilabel organ segmentations to facilitate systemic TB whole-person research

MOOSE (Multi-organ objective segmentation) a data-centric AI solution that generates multilabel organ segmentations to facilitate systemic TB whole-person research.The pipeline is based on nn-UNet an

QIMP team 30 Jan 01, 2023
Show Me the Whole World: Towards Entire Item Space Exploration for Interactive Personalized Recommendations

HierarchicyBandit Introduction This is the implementation of WSDM 2022 paper : Show Me the Whole World: Towards Entire Item Space Exploration for Inte

yu song 5 Sep 09, 2022
A dataset for online Arabic calligraphy

Calliar Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic callig

ARBML 114 Dec 28, 2022
Code, environments, and scripts for the paper: "How Private Is Your RL Policy? An Inverse RL Based Analysis Framework"

Privacy-Aware Inverse RL (PRIL) Analysis Framework Code, environments, and scripts for the paper: "How Private Is Your RL Policy? An Inverse RL Based

1 Dec 06, 2021