RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations.

Related tags

Deep LearningRCT-ART
Overview

Randomised controlled trial abstract result tabulator

RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations. The system is currently constrained to result sentences with specific measures of an outcome for a specific intervention and does not extract comparative relationship (e.g. a relative decrease between the study intervention and placebo).

This repository contains custom pipes and models developed, trained and run using the spaCy library. These are defined and initiated through configs and custom scripts.

In addition, we include all stages of our datasets from their raw format, gold-standard annotations, pre-processed spacy docs and output tables of the system, as well as the evaluation results of the system for its different NLP tasks across each pre-trained model.

Running the system from Python

After cloning this repository and pip installing its dependencies from requirements.txt, the system can be run in two steps:

1. Download and extract the trained models

In the primary study of RCT-ART, we explored a number of BERT-based models in the development of the system. Here, we make available the BioBERT-based named entity recognition (NER) and relation extraction (RE) models:

Download models from here.

The train_models folder of the compression file should be extracted into the root of the cloned directory for the system scripts to be able to access the models.

2a. Demo the system NLP tasks

Once the model folder has been extracted, a streamlit demo of the system NER, RE and tabulation tasks can be run locally on your browser with the following command:

streamlit run scripts/demo.py

2b. Process multiple RCT result sentences

Alternatively, multiple result sentences can be processed by the system using tabulate.py in the scripts directory. Input sentences should be in the Doc format, with the sentences from the study available within datasets/preprocessed.

Training new models for the system

The NER and RE models employed by RCT-ART were both trained using spaCy config files, where we defined their architectures and training hyper-parameters. These are included in the config directory, with a config for each model type and the different BERT-based language representation models we explored in the development of the system. The simplest way to initiate spaCy model training is with the library's inbuilt commands (https://spacy.io/usage/training), passing in the paths of the config file, training set and development set. Below are the commands we used to train the models made available with this repository:

spaCy cmd for training BioBERT-based NER model on all-domains dataset

python -m spacy train configs/ner_biobert.cfg --output ../trained_models/biobert/ner/all_domains --paths.train ../datasets/preprocessed/all_domains/results_only/train.spacy --paths.dev ../datasets/preprocessed/all_domains/results_only/dev.spacy -c ../scripts/custom_functions.py --gpu-id 0

spaCy cmd for training BioBERT-based RE model on all-domains dataset

python -m spacy train configs/rel_biobert.cfg --output ../trained_models/biobert/rel/all_domains  --paths.train ../datasets/preprocessed/all_domains/results_only/train.spacy --paths.dev ../datasets/preprocessed/all_domains/results_only/dev.spacy -c ../scripts/custom_functions.py --gpu-id 0

Repository contents breakdown

The following is a brief description of the assets available in this repository.

configs

Includes the spaCy config files for training NER and RE models of the RCT-ART system. These files define the model architectures, including the BERT-base language representations. Three of BERT language representations were experimented with for each model in the main study of this sytem: BioBERT, SciBERT and RoBERTa.

datasets

Includes all stages of the data used to train and test the RCT-ART models from raw to split gold-standard files in spaCy doc format.

Before filtering and result sentence extraction, abstracts were sourced from the EBM-NLP corpus and the annotated corpus from the Trenta et al. study, which explored automated information extraction from RCTs, and was a key reference for our study.

evaluation_results

Output txt files from the evaluate.py script, giving precision, recall and F1 scores for each of the system tasks across the various dataset cuts.

output_tables

Output csv files from the tabulate.py script, includes the predicted tables output by our system for each test result sentence.

scripts

Below is a contents list of the repository scripts with brief descriptions. Full descriptions can be found at the head of each script.

custom_functions.py -- helper functions for supporting key modules of system.

data_collection.py -- classes and functions for filtering the EBM-NLP corpus and result sentence preprocessing.

demo.py -- a browser-based demo of the RCT-ART system developed with spaCy and Streamlit (see above).

entity_ruler.py -- a script for rules-based entity recognition. Unused in final system, but made available for further development.

evaluate.py -- a set of function for evaluating the system across the NLP tasks: NER, RE, joint NER + RE and tabulation.

preprocessing.py -- a set of function for further data preprocessing after data collection and splitting data into train, test and dev sets.

rel_model.py -- defines the relation extraction model.

rel_pipe.py -- integrates the relation extraction model as a spaCy pipeline component.

tabulate.py -- run the full system by loading the NER and RE models and running their outputs through a tabulation function. Can be used on batches of RCT sentences to output batches of CSV files.

train_multiple_models.py -- iterates through spaCy train commands with different input parameters allowing batches of models to be trained.

Common issues

The transformer models of this system need a GPU with suitable video RAM -- in the primary study, they were trained and run on a GeForce RTX 3080 10GB.

There can be issues with the transformer library dependencies -- CUDA and pytorch. If an issue occurs, ensure CUDA 11.1 is installed on your system, and try reinstalling PyTorch with the following command:

pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

References

  1. The relation extraction component was adapted from the following spaCy project tutorial.

  2. The EBM-NLP corpus is accessible from here and its publication can be found here.

  3. The glaucoma corpus can be found in the Trenta et al. study.

a reimplementation of Optical Flow Estimation using a Spatial Pyramid Network in PyTorch

pytorch-spynet This is a personal reimplementation of SPyNet [1] using PyTorch. Should you be making use of this work, please cite the paper according

Simon Niklaus 269 Jan 02, 2023
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
Sequence-tagging using deep learning

Classification using Deep Learning Requirements PyTorch version = 1.9.1+cu111 Python version = 3.8.10 PyTorch-Lightning version = 1.4.9 Huggingface

Vineet Kumar 2 Dec 20, 2022
Multi-resolution SeqMatch based long-term Place Recognition

MRS-SLAM for long-term place recognition In this work, we imply an multi-resolution sambling based visual place recognition method. This work is based

METASLAM 6 Dec 06, 2022
Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Motionformer This is an official pytorch implementation of paper Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In this rep

Facebook Research 192 Dec 23, 2022
Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning" This is the code for the paper Solving Graph-based Public Goo

Victor-Alexandru Darvariu 3 Dec 05, 2022
Pytorch Lightning code guideline for conferences

Deep learning project seed Use this seed to start new deep learning / ML projects. Built in setup.py Built in requirements Examples with MNIST Badges

Pytorch Lightning 1k Jan 06, 2023
A general framework for deep learning experiments under PyTorch based on pytorch-lightning

torchx Torchx is a general framework for deep learning experiments under PyTorch based on pytorch-lightning. TODO list gan-like training wrapper text

Yingtian Liu 6 Mar 17, 2022
2D Human Pose estimation using transformers. Implementation in Pytorch

PE-former: Pose Estimation Transformer Vision transformer architectures perform very well for image classification tasks. Efforts to solve more challe

Panteleris Paschalis 23 Oct 17, 2022
2D&3D human pose estimation

Human Pose Estimation Papers [CVPR 2016] - 201511 [IJCAI 2016] - 201602 Other Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors

133 Jan 02, 2023
An open source library for face detection in images. The face detection speed can reach 1000FPS.

libfacedetection This is an open source library for CNN-based face detection in images. The CNN model has been converted to static variables in C sour

Shiqi Yu 11.4k Dec 27, 2022
Paaster is a secure by default end-to-end encrypted pastebin built with the objective of simplicity.

Follow the development of our desktop client here Paaster Paaster is a secure by default end-to-end encrypted pastebin built with the objective of sim

Ward 211 Dec 25, 2022
Interactive web apps created using geemap and streamlit

geemap-apps Introduction This repo demostrates how to build a multi-page Earth Engine App using streamlit and geemap. You can deploy the app on variou

Qiusheng Wu 27 Dec 23, 2022
Raindrop strategy for Irregular time series

Graph-Guided Network For Irregularly Sampled Multivariate Time Series Overview This repository contains processed datasets and implementation code for

Zitnik Lab @ Harvard 74 Jan 03, 2023
Very large and sparse networks appear often in the wild and present unique algorithmic opportunities and challenges for the practitioner

Sparse network learning with snlpy Very large and sparse networks appear often in the wild and present unique algorithmic opportunities and challenges

Andrew Stolman 1 Apr 30, 2021
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
M3DSSD: Monocular 3D Single Stage Object Detector

M3DSSD: Monocular 3D Single Stage Object Detector Setup pytorch 0.4.1 Preparation Download the full KITTI detection dataset. Then place a softlink (or

mumianyuxin 64 Dec 27, 2022
BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

BBB Streamer NG? Makes a conference like this... ...streamable like this! I also recorded a small video showing the basic features: https://www.youtub

Lukas Schauer 60 Oct 21, 2022
Born-Infeld (BI) for AI: Energy-Conserving Descent (ECD) for Optimization

Born-Infeld (BI) for AI: Energy-Conserving Descent (ECD) for Optimization This repository contains the code for the BBI optimizer, introduced in the p

G. Bruno De Luca 5 Sep 06, 2022
TensorFlow implementation of ENet

TensorFlow-ENet TensorFlow implementation of ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. This model was tested on th

Kwotsin 255 Oct 17, 2022