Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Overview

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

arXiv

This is the code base for weakly supervised NER.

We provide a three stage framework:

  • Stage I: Domain continual pre-training;
  • Stage II: Noise-aware weakly supervised pre-training;
  • Stage III: Fine-tuning.

In this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.

See details in arXiv.

Performance Benchmark

BioMedical NER

Method (F1) BC5CDR-chem BC5CDR-disease NCBI-disease
BERT 89.99 79.92 85.87
bioBERT 92.85 84.70 89.13
PubMedBERT 93.33 85.62 87.82
Ours 94.17 90.69 92.28

See more in bio_script/README.md

Dependency

pytorch==1.6.0
transformers==3.3.1
allennlp==1.1.0
flashtool==0.0.10
ray==0.8.7

Install requirements

pip install -r requirements.txt

(If the allennlp and transformers are incompatible, install allennlp first and then update transformers. Since we only use some small functions of allennlp, it should works fine. )

File Structure:

├── bert-ner          #  Python Code for Training NER models
│   └── ...
└── bio_script        #  Shell Scripts for Training BioMedical NER models
    └── ...

Usage

See examples in bio_script

Hyperparameter Explaination

Here we explain hyperparameters used the scripts in ./bio_script.

Training Scripts:

Scripts

  • roberta_mlm_pretrain.sh
  • weak_weighted_selftrain.sh
  • finetune.sh

Hyperparameter

  • GPUID: Choose the GPU for training. It can also be specified by xxx.sh 0,1,2,3.
  • MASTER_PORT: automatically constructed (avoid conflicts) for distributed training.
  • DISTRIBUTE_GPU: use distributed training or not
  • PROJECT_ROOT: automatically detected, the root path of the project folder.
  • DATA_DIR: Directory of the training data, where it contains train.txt test.txt dev.txt labels.txt weak_train.txt (weak data) aug_train.txt (optional).
  • USE_DA: if augment training data by augmentation, i.e., combine train.txt + aug_train.txt in DATA_DIR for training.
  • BERT_MODEL: the model backbone, e.g., roberta-large. See transformers for details.
  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • LOSSFUNC: nll the normal loss function, corrected_nll noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).
  • MAX_WEIGHT: The maximum weight of a sample in the loss.
  • MAX_LENGTH: max sentence length.
  • BATCH_SIZE: batch size per GPU.
  • NUM_EPOCHS: number of training epoches.
  • LR: learning rate.
  • WARMUP: learning rate warmup steps.
  • SAVE_STEPS: the frequency of saving models.
  • EVAL_STEPS: the frequency of testing on validation.
  • SEED: radnom seed.
  • OUTPUT_DIR: the directory for saving model and code. Some parameters will be automatically appended to the path.
    • roberta_mlm_pretrain.sh: It's better to manually check where you want to save the model.]
    • finetune.sh: It will be save in ${BERT_MODEL_PATH}/finetune_xxxx.
    • weak_weighted_selftrain.sh: It will be save in ${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx (see FBA_RULE below)

There are some addition parameters need to be set for weakly supervised learning (weak_weighted_selftrain.sh).

Profiling Script

Scripts

  • profile.sh

Profiling scripts also use the same entry as the training script: bert-ner/run_ner.py but only do evaluation.

Hyperparameter Basically the same as training script.

  • PROFILE_FILE: can be train,dev,test or a specific path to a txt data. E.g., using Weak by

    PROFILE_FILE=weak_train_100.txt PROFILE_FILE=$DATA_DIR/$PROFILE_FILE

  • OUTPUT_DIR: It will be saved in OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile

Weakly Supervised Data Refinement Script

Scripts

  • profile2refinedweakdata.sh

Hyperparameter

  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • WEI_RULE: rule for generating weight for each weak sample.
    • uni: all are 1
    • avgaccu: confidence estimate for new labels generated by all_overwrite
    • avgaccu_weak_non_O_promote: confidence estimate for new labels generated by non_O_overwrite
  • PRED_RULE: rule for generating new weak labels.
    • non_O_overwrite: non-entity ('O') is overwrited by prediction
    • all_overwrite: all use prediction, i.e., self-training
    • no: use original weak labels
    • non_O_overwrite_all_overwrite_over_accu_xx: non_O_overwrite + if confidence is higher than xx all tokens use prediction as new labels

The generated data will be saved in ${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE} WEAK_RULE specified in weak_weighted_selftrain.sh is essential the name of folder weak_${PRED_RULE}-WEI_${WEI_RULE}.

More Rounds of Training, Try Different Combination

  1. To do training with weakly supervised data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Create profile data, e.g., run ./bio_script/profile.sh for dev set and weak set
  • iii) Generate data with weak labels from profile data, e.g., run ./bio_script/profile2refinedweakdata.sh. You can use different rules to generate weights for each sample (WEI_RULE) and different rules to refine weak labels (PRED_RULE). See more details in ./ber-ner/profile2refinedweakdata.py
  • iv) Do training with ./bio_script/weak_weighted_selftrain.sh.
  1. To do fine-tuning with human labeled data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Run ./bio_script/finetune.sh.

Reference

@inproceedings{Jiang2021NamedER,
  title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},
  author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},
  booktitle={ACL/IJCNLP},
  year={2021}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Owner
Amazon
Amazon
CryptoFrog - My First Strategy for freqtrade

cryptofrog-strategies CryptoFrog - My First Strategy for freqtrade NB: (2021-04-20) You'll need the latest freqtrade develop branch otherwise you migh

Robert Davey 137 Jan 01, 2023
Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

Autonomous Learning Group 65 Dec 27, 2022
Notification Triggers for Python

Notipyer Notification triggers for Python Send async email notifications via Python. Get updates/crashlogs from your scripts with ease. Installation p

Chirag Jain 17 May 16, 2022
ByteTrack超详细教程!训练自己的数据集&&摄像头实时检测跟踪

ByteTrack超详细教程!训练自己的数据集&&摄像头实时检测跟踪

Double-zh 45 Dec 19, 2022
Recursive Bayesian Networks

Recursive Bayesian Networks This repository contains the code to reproduce the results from the NeurIPS 2021 paper Lieck R, Rohrmeier M (2021) Recursi

Robert Lieck 11 Oct 18, 2022
Official code of Team Yao at Multi-Modal-Fact-Verification-2022

Official code of Team Yao at Multi-Modal-Fact-Verification-2022 A Multi-Modal Fact Verification dataset released as part of the De-Factify workshop in

Wei-Yao Wang 11 Nov 15, 2022
Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

THUNLP 37 Oct 30, 2022
A 3D sparse LBM solver implemented using Taichi

taichi_LBM3D Background Taichi_LBM3D is a 3D lattice Boltzmann solver with Multi-Relaxation-Time collision scheme and sparse storage structure impleme

Jianhui Yang 121 Jan 06, 2023
Official repo for our 3DV 2021 paper "Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements".

Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements Yu Rong, Jingbo Wang, Ziwei Liu, Chen Change Loy Paper. Pr

Yu Rong 41 Dec 13, 2022
CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)

CrossNorm (CN) and SelfNorm (SN) (Accepted at ICCV 2021) This is the official PyTorch implementation of our CNSN paper, in which we propose CrossNorm

100 Dec 28, 2022
SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.

SOLO: Segmenting Objects by Locations This project hosts the code for implementing the SOLO algorithms for instance segmentation. SOLO: Segmenting Obj

Xinlong Wang 1.5k Dec 31, 2022
Dataset and Code for the paper "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021), and "Depth-only Object Tracking" (BMVC2021)

DeT and DOT Code and datasets for "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021) "Depth-only Object Tracking" (BMVC2021) @InProceedings

Yan Song 55 Dec 15, 2022
🥇 LG-AI-Challenge 2022 1위 솔루션 입니다.

LG-AI-Challenge-for-Plant-Classification Dacon에서 진행된 농업 환경 변화에 따른 작물 병해 진단 AI 경진대회 에 대한 코드입니다. (colab directory에 코드가 잘 정리 되어있습니다.) Requirements python

siwooyong 10 Jun 30, 2022
[NeurIPS 2020] Blind Video Temporal Consistency via Deep Video Prior

pytorch-deep-video-prior (DVP) Official PyTorch implementation for NeurIPS 2020 paper: Blind Video Temporal Consistency via Deep Video Prior TensorFlo

Yazhou XING 90 Oct 19, 2022
Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models

LMPBT Supplementary code for the Paper entitled ``Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models"

1 Sep 29, 2022
PPO Lagrangian in JAX

PPO Lagrangian in JAX This repository implements PPO in JAX. Implementation is tested on the safety-gym benchmark. Usage Install dependencies using th

Karush Suri 2 Sep 14, 2022
JAX code for the paper "Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation"

Optimal Model Design for Reinforcement Learning This repository contains JAX code for the paper Control-Oriented Model-Based Reinforcement Learning wi

Evgenii Nikishin 43 Sep 28, 2022
Lane follower: Lane-detector (OpenCV) + Object-detector (YOLO5) + CAN-bus

Lane Follower This code is for the lane follower, including perception and control, as shown below. Environment Hardware Industrial Camera Intel-NUC(1

Siqi Fan 3 Jul 07, 2022
Pathdreamer: A World Model for Indoor Navigation

Pathdreamer: A World Model for Indoor Navigation This repository hosts the open source code for Pathdreamer, to be presented at ICCV 2021. Paper | Pro

Google Research 122 Jan 04, 2023
Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming

Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming. Outperforming `GPT-3` on SuperGLUE Few-Shot text classification.

YerevaNN 75 Nov 06, 2022