Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Overview

cim-misspelling

Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.

image

This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.

Requirements

How to Run

Clone the repo

$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git

Data preparing

  1. Download the MIMIC-III dataset from PhysioNet, especially NOTEEVENTS.csv and put under data/mimic3

  2. Download LRWD and prevariants of the SPECIALIST Lexicon from the LSG website (2018AB version) and put under data/umls.

  3. Download the English dictionary english.txt from here (commit 7cb484d) and put under data/english_words.

  4. Run scripts/build_vocab_corpus.ipynb to build the dictionary and split the MIMIC-III notes into files.

  5. Run the Jupyter notebook for the dataset that you want to download/pre-process:

    • MIMIC-III misspelling dataset, or ClinSpell (Fivez et al., 2017): scripts/preprocess_clinspell.ipynb
    • CSpell dataset (Lu et al., 2019): scripts/preprocess_cspell.ipynb
    • Synthetic misspelling dataset from the MIMIC-III: scripts/synthetic_dataset.ipynb
  6. Download the BlueBERT model from here under bert/ncbi_bert_{base|large}.

    • For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
    • For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"

Pre-training the char-based LM on MIMIC-III

Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM. The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data. You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256. There are several options you may want to configure:

  • num_gpus: number of GPUs
  • batch_size: batch size
  • training_step: total number of steps to train
  • init_ckpt/init_step: the checkpoint file/steps to resume pretraining
  • num_beams: beam search width for evaluation
  • mimic_csv_dir: directory of the MIMIC-III csv splits
  • bert_dir: directory of the BlueBERT files

You can also download the pre-trained LMs and put under model/:

Misspelling Correction with CIM

Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load

Cite this work

@InProceedings{juyong2022context,
  title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
  author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  pages = {234--247},
  year = {2022},
  volume = {174},
  series = {Proceedings of Machine Learning Research},
  month = {07--08 Apr},
  publisher = {PMLR}
}
Owner
Juyong Kim
Juyong Kim
AttGAN: Facial Attribute Editing by Only Changing What You Want (IEEE TIP 2019)

News 11 Jan 2020: We clean up the code to make it more readable! The old version is here: v1. AttGAN TIP Nov. 2019, arXiv Nov. 2017 TensorFlow impleme

Zhenliang He 568 Dec 14, 2022
Medical image analysis framework merging ANTsPy and deep learning

ANTsPyNet A collection of deep learning architectures and applications ported to the python language and tools for basic medical image processing. Bas

Advanced Normalization Tools Ecosystem 118 Dec 24, 2022
Code release for NeuS

NeuS We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inpu

Peng Wang 813 Jan 04, 2023
Read and write layered TIFF ImageSourceData and ImageResources tags

Read and write layered TIFF ImageSourceData and ImageResources tags Psdtags is a Python library to read and write the Adobe Photoshop(r) specific Imag

Christoph Gohlke 4 Feb 05, 2022
Specificity-preserving RGB-D Saliency Detection

Specificity-preserving RGB-D Saliency Detection Authors: Tao Zhou, Huazhu Fu, Geng Chen, Yi Zhou, Deng-Ping Fan, and Ling Shao. 1. Preface This reposi

Tao Zhou 35 Jan 08, 2023
Unifying Global-Local Representations in Salient Object Detection with Transformer

GLSTR (Global-Local Saliency Transformer) This is the official implementation of paper "Unifying Global-Local Representations in Salient Object Detect

11 Aug 24, 2022
Discover hidden deepweb pages

DeepWeb Scapper Att: Demo version An simple script to scrappe deepweb to find pages. Will return if any of those exists and will save on a file. You s

Héber Júlio 77 Oct 02, 2022
Dynamic Slimmable Network (CVPR 2021, Oral)

Dynamic Slimmable Network (DS-Net) This repository contains PyTorch code of our paper: Dynamic Slimmable Network (CVPR 2021 Oral). Architecture of DS-

Changlin Li 197 Dec 09, 2022
ruptures: change point detection in Python

Welcome to ruptures ruptures is a Python library for off-line change point detection. This package provides methods for the analysis and segmentation

Charles T. 1.1k Jan 03, 2023
Alex Pashevich 62 Dec 24, 2022
Source code for "Interactive All-Hex Meshing via Cuboid Decomposition [SIGGRAPH Asia 2021]".

Interactive All-Hex Meshing via Cuboid Decomposition Video demonstration This repository contains an interactive software to the PolyCube-based hex-me

Lingxiao Li 131 Dec 05, 2022
Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training Introduction This is a PyTorch implementation of "

weijiawu 34 Nov 09, 2022
Source code for our EMNLP'21 paper 《Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning》

Child-Tuning Source code for EMNLP 2021 Long paper: Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning. 1. Environ

46 Dec 12, 2022
Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic plasticity".

Impression-Learning-Camera-Ready Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic

2 Feb 09, 2022
NAS Benchmark in "Prioritized Architecture Sampling with Monto-Carlo Tree Search", CVPR2021

NAS-Bench-Macro This repository includes the benchmark and code for NAS-Bench-Macro in paper "Prioritized Architecture Sampling with Monto-Carlo Tree

35 Jan 03, 2023
A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Awesome PyTorch Scholarship Resources A collection of awesome PyTorch and Python learning resources. Contributions are always welcome! Course Informat

Arnas Gečas 302 Dec 03, 2022
Scripts and outputs related to the paper Prediction of Adverse Biological Effects of Chemicals Using Knowledge Graph Embeddings.

Knowledge Graph Embeddings and Chemical Effect Prediction, 2020. Scripts and outputs related to the paper Prediction of Adverse Biological Effects of

Knowledge Graphs at the Norwegian Institute for Water Research 1 Nov 01, 2021
Tech Resources for Academic Communities

Free tech resources for faculty, students, researchers, life-long learners, and academic community builders for use in tech based courses, workshops, and hackathons.

Microsoft 2.5k Jan 04, 2023
Official Pytorch implementation for 2021 ICCV paper "Learning Motion Priors for 4D Human Body Capture in 3D Scenes" and trained models / data

Learning Motion Priors for 4D Human Body Capture in 3D Scenes (LEMO) Official Pytorch implementation for 2021 ICCV (oral) paper "Learning Motion Prior

165 Dec 19, 2022