Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

Related tags

Deep Learningclin_x
Overview

CLIN-X

(CLIN-X-ES) & (CLIN-X-EN)

This repository holds the companion code for the system reported in the paper:

"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain" by Lukas Lange, Heike Adel, Jannik Strötgen and Dietrich Klakow.

The paper wcan be found here. The code allows the users to reproduce and extend the results reported in the paper. Please cite the above paper when reporting, reproducing or extending the results.

@inproceedings{lange-etal-2021-clin-x,
      author    = {Lukas Lange and
                   Heike Adel and
                   Jannik Str{\"{o}}tgen and
                   Dietrich Klakow},
      title     = {"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain},
      year={2021},
      url={https://arxiv.org/abs/2112.08754}
}

In case of questions, please contact the authors as listed on the paper.

Purpose of the project

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

The CLIN-X language models

As part of this work, two XLM-R were adapted to the clinical domain The models can be found here:

  • CLIN-X ES: Spanish clinical XLM-R (link)
  • CLIN-X EN: English clinical XLM-R (link)

The CLIN-X models are open-sourced under the CC-BY 4.0 license. See the LICENSE_models file for details.

Prepare the conda environment

The code requires some python libraries to work:

conda create -n clin-x python==3.8.5
pip install flair==0.8 transformers==4.6.1 torch==1.8.1 scikit-learn==0.23.1 scipy==1.6.3 numpy==1.20.3 nltk tqdm seaborn matplotlib

Masked-Language-Modeling training

The models were trained using the huggingface MLM script that can be found here. The script was called as follows:

python -m torch.distributed.launch --nproc_per_node 8 run_mlm.py  \
--model_name_or_path xlm-roberta-large  \
--train_file data/spanisch_clinical_train.txt  \
--validation_file data/spanisch_clinical_valid.txt  \
--do_train   --do_eval  \
--output_dir models/xlm-roberta-large-spanisch-clinical-domain/  \
--fp16  \
--per_device_train_batch_size 4 --per_device_eval_batch_size 4  \
--save_strategy steps --save_steps 10000

Using the CLIN-X model with our propose model architecture (as reported in Table 7)

The following will describe our different scripts to reproduce the results. See each of the script files for detailed information on the input arguments.

Tokenize and split the data

python tokenize_files.py --input_path path/to/input/files/ --output_path /path/to/bio_files/
python create_data_splits.py --train_files /path/to/bio_files/ --method random --output_dir /path/to/split_files/

Train the model (using random data splits)

The following command trains on model on four splits (1,2,3,4) and uses the remaining split (5) for validation. For different split combinations adjust the list of --training_files and the --dev_file arguments accordingly.

python train_our_model_architecture.py   \
--data_path /path/to/split_files/  \
--train_files random_split_1.txt,random_split_2.txt,random_split_3.txt,random_split_4.txt  \
--dev_file random_split_5.txt  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

Get ensemble predictions

For all models, get the predictions on the test set as following:

python get_test_predictions.py --name models/model_name --conll_path /path/to/bio_files/ --out_path predictions/model_name/

Then, combine different models into one ensemble. Arguments: Output path + List of model predictions

python create_ensemble_data.py predictions/ensemble1 predictions/model_name/ predictions/model_name_2/ ...

Using the CLIN-X model (as reported in Table 3)

While we recommand the usage of our model architecture, the CLIN-X models can be used in many other architectures. In the paper, we compare to the standard transformer sequnece labeling models as proposed by Devlin et al. For this, we provide the train_standard_model_architecture.py script

python train_standard_model_architecture.py  \
--data_path /path/to/bio_files/  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

License

The CLIN-X code is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.

For a list of other open source components included in CLIN-X, see the file 3rd-party-licenses.txt.

Owner
Bosch Research
Bosch Research
Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis in JAX

SYMPAIS: Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis Overview | Installation | Documentation | Examples | Notebo

Yicheng Luo 4 Sep 13, 2022
Binary classification for arrythmia detection with ECG datasets.

HEART DISEASE AI DATATHON 2021 [Eng] / [Kor] #English This is an AI diagnosis modeling contest that uses the heart disease echocardiography and electr

HY_Kim 3 Jul 14, 2022
HyperDict - Self linked dictionary in Python

Hyper Dictionary Advanced python dictionary(hash-table), which can link it-self

8 Feb 06, 2022
S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

S-attack library: A library for evaluating trajectory prediction models This library contains two research projects to assess the trajectory predictio

VITA lab at EPFL 71 Jan 04, 2023
Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations

Transfer-Learning-in-Reinforcement-Learning Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations Final Report Tra

Trung Hieu Tran 4 Oct 17, 2022
A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021)

GDN A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021) Abstract In this paper, we consider an inverse problem i

4 Sep 13, 2022
IsoGCN code for ICLR2021

IsoGCN The official implementation of IsoGCN, presented in the ICLR2021 paper Isometric Transformation Invariant and Equivariant Graph Convolutional N

horiem 39 Nov 25, 2022
FaceQgen: Semi-Supervised Deep Learning for Face Image Quality Assessment

FaceQgen FaceQgen: Semi-Supervised Deep Learning for Face Image Quality Assessment This repository is based on the paper: "FaceQgen: Semi-Supervised D

Javier Hernandez-Ortega 3 Aug 04, 2022
Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

yueliu1999 297 Dec 27, 2022
Learning from graph data using Keras

Steps to run = Download the cora dataset from this link : https://linqs.soe.ucsc.edu/data unzip the files in the folder input/cora cd code python eda

Mansar Youness 64 Nov 16, 2022
A flexible tool for creating, organizing, and sharing visualizations of live, rich data. Supports Torch and Numpy.

Visdom A flexible tool for creating, organizing, and sharing visualizations of live, rich data. Supports Python. Overview Concepts Setup Usage API To

FOSSASIA 9.4k Jan 07, 2023
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 04, 2023
Pyramid addon for OpenAPI3 validation of requests and responses.

Validate Pyramid views against an OpenAPI 3.0 document Peace of Mind The reason this package exists is to give you peace of mind when providing a REST

Pylons Project 79 Dec 30, 2022
Object Detection using YOLO from PyImageSearch

Object Detection using YOLO from PyImageSearch By applying object detection, you’ll not only be able to determine what is in an image, but also where

Mohamed NIANG 1 Feb 09, 2022
Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Rbc-traps-generative-design - The generative design for single red clood cell hydrodynamic traps using GEFEST framework

Natural Systems Simulation Lab 4 Jun 16, 2022
Python implementation of Project Fluent

Project Fluent This is a collection of Python packages to use the Fluent localization system. python-fluent consists of these packages: fluent.syntax

Project Fluent 155 Dec 28, 2022
YOLOv5 detection interface - PyQt5 implementation

所有代码已上传,直接clone后,运行yolo_win.py即可开启界面。 2021/9/29:加入置信度选择 界面是在ultralytics的yolov5基础上建立的,界面使用pyqt5实现,内容较简单,娱乐而已。 功能: 模型选择 本地文件选择(视频图片均可) 开关摄像头

487 Dec 27, 2022
Implementation of ResMLP, an all MLP solution to image classification, in Pytorch

ResMLP - Pytorch Implementation of ResMLP, an all MLP solution to image classification out of Facebook AI, in Pytorch Install $ pip install res-mlp-py

Phil Wang 178 Dec 02, 2022
This repository contains the code for: RerrFact model for SciVer shared task

RerrFact This repository contains the code for: RerrFact model for SciVer shared task. Setup for Inference 1. Download SciFact database Download the S

Ashish Rana 1 May 22, 2022
Dashboard for the COVID19 spread

COVID-19 Data Explorer App A streamlit Dashboard for the COVID-19 spread. The app is live at: [https://covid19.cwerner.ai]. New data is queried from G

Christian Werner 22 Sep 29, 2022