Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Last update: Jan 03, 2023

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview

We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

Prepare environment

conda create -n taa python=3.6
conda activate taa
conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
pip install -r requirements.txt 
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"

Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:
- e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
- change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
Search for the best augmentation policy, e.g., low-resource regime for IMDB:
```
sh script/imdb_lowresource.sh
```
scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.
Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:
```
python train.py -c confs/bert_imdb.yaml 
```
train model on full dataset of IMDB:
```
python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  
```

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Related tags

Overview

Text-AutoAugment (TAA)

Overview

Getting Started

Contact

Acknowledgments

Citation

License

Owner

LancoPKU

Face recognize system

Aerial Imagery dataset for fire detection: classification and segmentation (Unmanned Aerial Vehicle (UAV))

System Combination for Grammatical Error Correction Based on Integer Programming

Code for "Adversarial attack by dropping information." (ICCV 2021)

Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Extension to fastai for volumetric medical data

Anti-UAV base on PaddleDetection

Repositório da disciplina de APC, no segundo semestre de 2021

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

Checkout some cool self-projects you can try your hands on to curb your boredom this December!

Unofficial Implement PU-Transformer

This is the formal code implementation of the CVPR 2022 paper 'Federated Class Incremental Learning'.

SuperSonic, a new open-source framework to allow compiler developers to integrate RL into compilers easily, regardless of their RL expertise

This repository contains an implementation of the Permutohedral Attention Module in Pytorch

IA for recognising Traffic Signs using Keras [Tensorflow]

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

Code repo for realtime multi-person pose estimation in CVPR'17 (Oral)

Can we do Customers Segmentation using PHP and Unsupervized Machine Learning ? Yes we can ! 🤡

A simple Neural Network that predicts the label for a series of handwritten digits