Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

Overview

LADA

This repo contains codes for the following paper:

Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augmentation for Semi-supervised NER. In Proceedings of The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP'2020)

If you would like to refer to it, please cite the paper mentioned above.

Getting Started

These instructions will get you running the codes of LADA.

Requirements

  • Python 3.6 or higher
  • Pytorch >= 1.4.0
  • Pytorch_transformers (also known as transformers)
  • Pandas, Numpy, Pickle, faiss, sentence-transformers

Code Structure

├── code/
│   ├── BERT/
│   │   ├── back_translate.ipynb --> Jupyter Notebook for back translating the dataset
│   │   ├── bert_models.py --> Codes for LADA-based BERT models
│   │   ├── eval_utils.py --> Codes for evaluations
│   │   ├── knn.ipynb --> Jupyter Notebook for building the knn index file
│   │   ├── read_data.py --> Codes for data pre-processing
│   │   ├── train.py --> Codes for trianing BERT model
│   │   └── ...
│   ├── flair/
│   │   ├── train.py --> Codes for trianing flair model
│   │   ├── knn.ipynb --> Jupyter Notebook for building the knn index file
│   │   ├── flair/ --> the flair library
│   │   │   └── ...
│   │   ├── resources/
│   │   │   ├── docs/ --> flair library docs
│   │   │   ├── taggers/ --> save evaluation results for flair model
│   │   │   └── tasks/
│   │   │       └── conll_03/
│   │   │           ├── sent_id_knn_749.pkl --> knn index file
│   │   │           └── ... -> CoNLL-2003 dataset
│   │   └── ...
├── data/
│   └── conll2003/
│       ├── de.pkl -->Back translated training dataset with German as middle language
│       ├── labels.txt --> label index file
│       ├── sent_id_knn_700.pkl
│       └── ...  -> CoNLL-2003 dataset
├── eval/
│   └── conll2003/ --> save evaluation results for BERT model
└── README.md

BERT models

Downloading the data

Please download the CoNLL-2003 dataset and save under ./data/conll2003/ as train.txt, dev.txt, and test.txt.

Pre-processing the data

We utilize Fairseq to perform back translation on the training dataset. Please refer to ./code/BERT/back_translate.ipynb for details.

Here, we have put one example of back translated data, de.pkl, in ./data/conll2003/ . You can directly use it for CoNLL-2003 or generate your own back translated data following ./code/BERT/back_translate.ipynb.

We also provide the kNN index file for the first 700 training sentences (5%) ./data/conll2003/sent_id_knn_700.pkl. You can directly use it for CoNLL-2003 or generate your own kNN index file following ./code/BERT/knn.ipynb

Training models

These section contains instructions for training models on CoNLL-2003 using 5% training data.

Training BERT+Intra-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 

Training BERT+Inter-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \ 
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \ 
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1  

Training BERT+Semi-Intra-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 \
--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \
--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4  --warmup-semi --num-semi-iter 1 \
--semi-loss-method 'origin' 

Training BERT+Semi-Inter-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \ 
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1 \
--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \
--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4  --warmup-semi --num-semi-iter 1 \
--semi-loss-method 'origin' 

flair models

flair is a BiLSTM-CRF sequence labeling model, and we provide code for flair+Inter-LADA

Downloading the data

Please download the CoNLL-2003 dataset and save under ./code/flair/resources/tasks/conll_03/ as eng.train, eng.testa (dev), and eng.testb (test).

Pre-processing the data

We also provide the kNN index file for the first 749 training sentences (5%, including the -DOCSTART- seperator) ./code/flair/resources/tasks/conll_03/sent_id_knn_749.pkl. You can directly use it for CoNLL-2003 or generate your own kNN index file following ./code/flair/knn.ipynb

Training models

These section contains instructions for training models on CoNLL-2003 using 5% training data.

Training flair+Inter-LADA model

CUDA_VISIBLE_DEVICES=1 python ./code/flair/train.py --use-knn-train-data --num-knn-k 5 \
--knn-mix-ratio 0.6 --train-examples 749 --mix-layer 2  --mix-option --alpha 60 --beta 1.5 \
--exp-save-name 'mix'  --mini-batch-size 64  --patience 10 --use-crf 
Owner
GT-SALT
Social and Language Technologies Lab
GT-SALT
Official PyTorch code for the paper: "Point-Based Modeling of Human Clothing" (ICCV 2021)

Point-Based Modeling of Human Clothing Paper | Project page | Video This is an official PyTorch code repository of the paper "Point-Based Modeling of

Visual Understanding Lab @ Samsung AI Center Moscow 64 Nov 22, 2022
Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

SA-AutoAug Scale-aware Automatic Augmentation for Object Detection Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, Jiaya Jia [Paper] [Bi

DV Lab 182 Dec 29, 2022
The pytorch implementation of DG-Font: Deformable Generative Networks for Unsupervised Font Generation

DG-Font: Deformable Generative Networks for Unsupervised Font Generation The source code for 'DG-Font: Deformable Generative Networks for Unsupervised

130 Dec 05, 2022
CONditionals for Ordinal Regression and classification in PyTorch

CONDOR pytorch implementation for ordinal regression with deep neural networks. Documentation: https://GarrettJenkinson.github.io/condor_pytorch About

7 Jul 25, 2022
Code for generating the figures in the paper "Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?"

Code for running simulations for the paper "Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Lin

Matthew Farrell 1 Nov 22, 2022
Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Scene Graph Generation from Natural Language Supervision This repository includes the Pytorch code for our paper "Learning to Generate Scene Graph fro

Yiwu Zhong 64 Dec 24, 2022
Framework for Spectral Clustering on the Sparse Coefficients of Learned Dictionaries

Dictionary Learning for Clustering on Hyperspectral Images Overview Framework for Spectral Clustering on the Sparse Coefficients of Learned Dictionari

Joshua Bruton 6 Oct 25, 2022
This repository includes different versions of the prescribed-time controller as Simulink blocks and MATLAB script codes for engineering applications.

Prescribed-time Control Prescribed-time control (PTC) blocks in Simulink environment, MATLAB R2020b. For more theoretical details, refer to the papers

Amir Shakouri 1 Mar 11, 2022
Noise Conditional Score Networks (NeurIPS 2019, Oral)

Generative Modeling by Estimating Gradients of the Data Distribution This repo contains the official implementation for the NeurIPS 2019 paper Generat

451 Dec 26, 2022
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Nabil Ibtehaz 308 Jan 05, 2023
Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

217 Jan 03, 2023
Perform Linear Classification with Multi-way Data

MultiwayClassification This is an R package to perform linear classification for data with multi-way structure. The distance-weighted discrimination (

Eric F. Lock 2 Dec 15, 2020
A state-of-the-art semi-supervised method for image recognition

Mean teachers are better role models Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post By Antti Tarvainen, Harri Valpola (The

Curious AI 1.4k Jan 06, 2023
MRQy is a quality assurance and checking tool for quantitative assessment of magnetic resonance imaging (MRI) data.

Front-end View Backend View Table of Contents Description Prerequisites Running Basic Information Measurements User Interface Feedback and usage Descr

Center for Computational Imaging and Personalized Diagnostics 58 Dec 02, 2022
Package for working with hypernetworks in PyTorch.

Package for working with hypernetworks in PyTorch.

Christian Henning 71 Jan 05, 2023
Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

tf-fsvd TensorFlow Implementation of Functional Singular Value Decomposition for paper Fast Graph Learning with Unique Optimal Solutions Cite If you f

Sami Abu-El-Haija 14 Nov 25, 2021
基于AlphaPose的TensorRT加速

1. Requirements CUDA 11.1 TensorRT 7.2.2 Python 3.8.5 Cython PyTorch 1.8.1 torchvision 0.9.1 numpy 1.17.4 (numpy版本过高会出报错 this issue ) python-package s

52 Dec 06, 2022
SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning

Datasets | Website | Raw Data | OpenReview SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning Christopher

67 Dec 17, 2022
Official MegEngine implementation of CREStereo(CVPR 2022 Oral).

[CVPR 2022] Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation This repository contains MegEngine implementation of ou

MEGVII Research 309 Dec 30, 2022
3D dataset of humans Manipulating Objects in-the-Wild (MOW)

MOW dataset [Website] This repository maintains our 3D dataset of humans Manipulating Objects in-the-Wild (MOW). The dataset contains 512 images in th

Zhe Cao 28 Nov 06, 2022