Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

Overview

BI-RADS BERT

Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

This implementation could be used on other radiology in house corpus as well. Labelling your own data should take the same form as reports and dataframes in './mockdata'.

Conda Environment setup

This project was developed using conda environments. To build the conda environment use the line of code below from the command line

conda create --name NLPenv --file requirements.txt --channel default --channel conda-forge --channel huggingface --channel pytorch

Dataset Organization

Two datasets are needed to build BERT embeddings and fine tuned Field Extractors. 1. dataframe of SQL data, 2. labeled data for field extraction.

Dataframe of SQL data: example file './mock_data/sql_dataframe.csv'. This file was efficiently made by producing a spreadsheet of all entries in the sql table and saving them as a csv file. It will require that each line of the report be split and coordinated with a SequenceNumber column to combine all the reports. Then continue to the 'How to Run BERT Pretraining' Section.

Labeled data for Field Extraction: example of files in './mock_data/labaled_data'. Exach txt file is a save dict object with fields:

example = {
    'original_report': original text report unprocessed from the exam_dataframe.csv, 
    'sectionized': dict example of the report in sections, ex. {'Title': '...', 'Hx': '...', ...}
    'PID': patient identification number,
    'date': date of the exam,
    'field_name1': name of a field you wish to classify, vlaue is the label, 
    'field_name2': more labeled fields are an option, 
    ...
}

How to Run BERT Pretraining

Step 1: SQLtoDataFrame.py

This script can be ran to convert SQL data from a hospital records system to a dataframe for all exams. Hospital records keep each individual report line as a separate SQL entry, so by using 'SequenceNumber' we can assemble them in order.

python ./examples/SQLtoDataFrame.py 
--input_sql ./mock_data/sql_dataframe.csv 
--save_name /folder/to/save/exam_dataframe/save_file.csv

This will output an 'exam_dataframe.csv' file that can be used in the next step.

Step 2: TextPreProcessingBERTModel.py

This script is ran to convert the exam_dataframe.csv file into a pre_training text file for training and validation, with a vocabulary size. An example of the output can be found in './mock_data/pre_training_data'.

python ./examples/TextPreProcessingBERTModel.py 
--dfolder /folder/that/contains/exam_dataframe 
--ft_folder ./mock_data/labeled_data

Step 3: MLM_Training_transformers.py

This script will now run the BERT pre training with masked language modeling. The Output directory (--output_dir) used is required to be empty; eitherwise the parser parameter --overwrite_output_dir is required to overwrite the files in the output directory.

python ./examples/MLM_Training_transformers.py 
--train_data_file ./mock_data/pre_training_data/VocabOf39_PreTraining_training.txt 
--output_dir /folder/to/save/bert/model
--do_eval 
--eval_data_file ./mock_data/pre_training_data/PreTraining_validation.txt 

How to Run BERT Fine Tuning

--pre_trained_model parsed arugment that can be used for all the follwing scripts to load a pre trained embedding. The default is bert-base-uncased. To get BioClinical BERT use --pre_trained_model emilyalsentzer/Bio_ClinicalBERT.

Step 4: BERTFineTuningSectionTokenization.py

This script will run fine tuning to train a section tokenizer with the option of using auxiliary data.

python ./examples/BERTFineTuningSectionTokenization.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/section_tokenizer

Optional parser arguements:

--aux_data If used then the Section Tokenizer will be trained with the auxilliary data.

--k_fold If used then the experiment is run with a 5 fold cross validation.

Step 5: BERTFineTuningFieldExtractionWoutSectionization.py

This script will run fine tuning training of field extraction without section tokenization.

python ./examples/BERTFineTuningFieldExtractionWoutSectionization.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/field_extractor_WoutST
--field_name Modality

field_name is a required parsed arguement.

Optional parser arguements:

--k_fold If used then the experiment is run with a 5 fold cross validation.

Step 6: BERTFineTuningFieldExtraction.py

This script will run fine tuning training of field extraction with section tokenization.

python ./examples/BERTFineTuningFieldExtraction.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/field_extractor
--field_name Modality
--report_section Title

field_name and report_section is a required parsed arguement.

Optional parser arguements:

--k_fold If used then the experiment is run with a 5 fold cross validation.

Additional Codes

post_ExperimentSummary.py

This code can be used to run statistical analysis of test results that are produced from BERTFineTuning codes.

To determine the best final model, we performed statistical significance testing with a 95% confidence. We used the Mann-Whitney U test to compare the medians of different section tokenizers as the distribution of accuracy and G.F1 performance is skewed to the left (medians closer to 100%). For the field extraction classifiers, we used the McNemar test to compare the agreement between two classifiers. The McNemar test was chosen because it has been robustly proven to have an acceptable probability of Type I errors (not detecting a difference between two classifiers when there is a difference). After evaluating both configurations of field extraction explored in this paper, we performed another McNemar test to assist in choosing the best technique. All statistical tests were performed with p-value adjustments for multiple comparisons testing with Bonferonni correction.

Note: input folder must contain 2 or more .xlsx files of experiemtnal results to perform a statistical test.

python ./examples/post_ExperimentSummary.py --folder /folder/where/xlsx/files/are/located --stat_test MannWhitney

--stat_test options: 'MannWhitney' and 'McNemar'.

'MannWhitney': MannWhitney U-Test. This test was used for the Section Tokenizer experimental results comparing the results from different models. https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

'McNemar' : McNemar's test. This test was used for the Field Extraction experimental results comparing the results from different models. https://en.wikipedia.org/wiki/McNemar%27s_test

Contact

Please post a Github issue if you have any questions.

This is a repository of our model for weakly-supervised video dense anticipation.

Introduction This is a repository of our model for weakly-supervised video dense anticipation. More results on GTEA, Epic-Kitchens etc. will come soon

2 Apr 09, 2022
A criticism of a recent paper on buggy image downsampling methods in popular image processing and deep learning libraries.

A criticism of a recent paper on buggy image downsampling methods in popular image processing and deep learning libraries.

70 Jul 12, 2022
code release for USENIX'22 paper `On the Security Risks of AutoML`

This project is a minimized runnable project cut from trojanzoo, which contains more datasets, models, attacks and defenses. This repo will not be mai

Ren Pang 5 Apr 19, 2022
MegEngine implementation of YOLOX

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

旷视天元 MegEngine 77 Nov 22, 2022
Official code for the publication "HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder".

HyFactor Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce a novel open-source archit

Laboratoire-de-Chemoinformatique 11 Oct 10, 2022
Official Pytorch implementation of MixMo framework

MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks Official PyTorch implementation of the MixMo framework | paper | docs Alexandr

79 Nov 07, 2022
SMCA replication There are no extra compiled components in SMCA DETR and package dependencies are minimal

Usage There are no extra compiled components in SMCA DETR and package dependencies are minimal, so the code is very simple to use. We provide instruct

22 May 06, 2022
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

Libo Qin 25 Sep 06, 2022
Efficient 3D Backbone Network for Temporal Modeling

VoV3D is an efficient and effective 3D backbone network for temporal modeling implemented on top of PySlowFast. Diverse Temporal Aggregation and

102 Dec 06, 2022
Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Moustafa Meshry 16 Oct 05, 2022
Code for classifying international patents based on the text of their titles/abstracts

Patent Classification Goal: To train a machine learning classifier that can automatically classify international patents downloaded from the WIPO webs

Prashanth Rao 1 Nov 08, 2022
A practical ML pipeline for data labeling with experiment tracking using DVC.

Auto Label Pipeline A practical ML pipeline for data labeling with experiment tracking using DVC Goals: Demonstrate reproducible ML Use DVC to build a

Todd Cook 4 Mar 08, 2022
Conditional Gradients For The Approximately Vanishing Ideal

Conditional Gradients For The Approximately Vanishing Ideal Code for the paper: Wirth, E., and Pokutta, S. (2022). Conditional Gradients for the Appro

IOL Lab @ ZIB 0 May 25, 2022
Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022)

Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022) By Shilong Zhang*, Zhuoran Yu*, Liyang Liu*, Xinjiang Wang, Aojun Zhou,

Shilong Zhang 129 Dec 24, 2022
Implementation for Learning to Track with Object Permanence

Learning to Track with Object Permanence A video-based MOT approach capable of tracking through full occlusions: Learning to Track with Object Permane

Toyota Research Institute - Machine Learning 91 Jan 03, 2023
CT Based COVID 19 Diagnose by Image Processing and Deep Learning

This project proposed the deep learning and image processing method to undertake the diagnosis on 2D CT image and 3D CT volume.

1 Feb 08, 2022
Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

This is the official implementation of our paper Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR), which has been accepted by WSDM2022.

Yongchun Zhu 81 Dec 29, 2022
We have made you a wrapper you can't refuse

We have made you a wrapper you can't refuse We have a vibrant community of developers helping each other in our Telegram group. Join us! Stay tuned fo

20.6k Jan 09, 2023
Differentiable simulation for system identification and visuomotor control

gradsim gradSim: Differentiable simulation for system identification and visuomotor control gradSim is a unified differentiable rendering and multiphy

105 Dec 18, 2022
A Python Package for Portfolio Optimization using the Critical Line Algorithm

PyCLA A Python Package for Portfolio Optimization using the Critical Line Algorithm Getting started To use PyCLA, clone the repo and install the requi

19 Oct 11, 2022