Code for text augmentation method leveraging large-scale language models

Overview

HyperMix

Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation.

Getting Started

Installing Packages

The main depedencies can be installed via pip install -r requirements.txt.

Usage

The main code is run through main.py. Check out --help for full list of commands.

python main.py --help

The code will automatically use the first GPU device, if detected.

A typical command to run BERT-base 10 times on the 1% subsample set of the SST-2 dataset and computing the average of all run is as follows.

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter none \
    --save-dir out

The script will create a directory named out in the current working directory and save the script log as out/run.log. It will also save any augmentations created during the experiments (if any augmentation is enabled).

To test GPT3Mix, prepare an OpenAI API key as described at the bottom of this README file, then use the following command:

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter gpt3-mix \
    --save-dir out

Managing Seeds

In the command above, the script will automatically generate seeds for sampling data and optimizing models. The seed used to generate each individual seed is called "master seed" and can be set using --master-data-seed and --master-exp-seed options. As evident from the option names, they are responsible for sampling data and optimizing a freshly initialized models respectively.

Sometimes, we need to manually set the seeds and not rely on automatically generated seeds from the master seeds. Manually seeding can be achieved via --data-seeds option. If this option is given, the master data seed will be ignored. We only support manualy data seeding for now.

OpenAI Key

Store OpenAI API Key under the current working directory as a file named openai-key. When running the main script, it will automatically detect the api key.

API keys can be provided to the script by --api-key option (not recommended) or from a file named openai-key in the current working directory.

Other Notes

At the moment we only support data augmentation leveraging OpenAI GPT-3 (GPT3Mix), but we will release an update that supports HyperCLOVA as soon as it becomes available to the public (HyperMix).

Citation

To cite our code or work, please use the following bibtex:

@inproceedings{yoo2021gpt3mix,
	title = "GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation",
	author = "Yoo, Kang Min  and
	  Park, Dongju  and
	  Kang, Jaewook  and
	  Lee, Sang-Woo  and
	  Park, Woomyoung",
	booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
	month = nov,
	year = "2021",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.findings-emnlp.192",
	pages = "2225--2239",
}
Owner
NAVER AI
Official account of NAVER AI, Korea No.1 Industrial AI Research Group
NAVER AI
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

Google Cloud Platform 8 Nov 26, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 05, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
auto_code_complete is a auto word-completetion program which allows you to customize it on your need

auto_code_complete v1.3 purpose and usage auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the m

RUO 2 Feb 22, 2022