[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Overview

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

This repo provides the source code & data of our paper: LM-Critic: Language Models for Unsupervised Grammatical Error Correction (EMNLP 2021).

@InProceedings{yasunaga2021language,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LM-Critic: Language Models for Unsupervised Grammatical Error Correction},
  year =    {2021},  
  booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},  
}

Overview

We developed a new method to use a pretrained language model (e.g. GPT2) to predict if a sentence is grammatical, which we call LM-Critic. You can play with this LM-Critic as described in Section 1. below. The idea is to deem a sentence to be grammatical if the language model assigns it a higher probability than candidates in its local neighborhood.

We then use the LM-Critic to generate training data for grammatical error correction (GEC) from unlabeled raw text, using the BIFI algorithm. This allows us to train GEC models in an unsupervised way. See Section 2. below.

How LM-Critic works

LM-Critic for GEC: We use LM-Critic to learn GEC models

0. Dependencies

Run the following commands to create a conda environment (assuming CUDA10.1):

conda create -n lm-critic python=3.8
conda activate lm-critic
pip install torch==1.6.0 torchvision==0.7.0
pip install transformers==4.3.3 datasets==1.3.0 absl-py rouge-score
pip install nltk wandb editdistance spacy==3.0.5
python3 -m nltk.downloader punkt

To use the ERRANT scorer for GEC evaluation, create another conda environment separately, as follows:

conda create -n errant200 python=3.6
conda activate errant200
pip3 install errant==2.0.0
python3 -m spacy download en

1. Use LM-Critic

The LM-Critic is defined in critic/critic.py. To play with it, you can run:

CUDA_VISIBLE_DEVICES=0 python3 critic/critic.py

This will prompt you for a sentence input, and returns the judgment (Good: grammatical, Bad: ungrammatical) along with the probability score of the input sentence. For example,

Enter a sentence: I like apple.
Bad! Your sentence log(p) = -22.333
Neighbor sentence with highest log(p): I like apples. (= -19.570)

Enter a sentence: I like apples.
Good! Your sentence log(p) = -19.570

To run intrinsic evaluation of LM-Critic on a test suite, run:

CUDA_VISIBLE_DEVICES=0 python3 eval_critic/eval_critic.py

You can import the LM-Critic function (from critic.critic import gpt2_critic) for your own code as done in this script.

2. Train/run grammatical error correction models

Change the working directory to gec/. First, download all the data (GEC benchmarks and training data) by running ./download_data.sh.

Round 0

Here we train an initial fixer on synthetic GEC data. Run the commands in src/run-round0.sh.

  • This corresponds to the "Transformer" baseline in the paper Table 4.
  • The original synthetic data was dowloaded from here, and our processed data is available at data/round0__synthetic/synthetic_paired_data_9M.json

Round 1

Here we use the BIFI algorithm and unlabeled text data to train an improved fixer. Run the commands in src/run-round1.sh.

  • Specifically, we perform the following four steps: (a) apply the current fixer (from Round 0) to unlabeled sentences and keep outputs that LM-Critic judges as good; (b) train a breaker on the paired data generated in Step (a); (c) apply the trained breaker on unlabeled sentences and keep outputs that LM-Critic judges as bad; (d) train the fixer on the paired data generated so far (Step (a) + Step (c) + synthetic data from Round0).
  • This corresponds to the "+ BIFI" in the paper Table 4.
  • The original unlabeled text data was downloaded from Yahoo! Answer dataset and Wikipedia revision dataset (we take sentences pre revision). Our processed paired data used in Step (d) is available at data/round1__BIFI/BIFI_paired_data_9M.json

For evaluation, we use ERRANT and M^2Scorer. ERRANT is set up in the conda environment described above (errant200) and M^2Scorer is set up in the download script.

Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
Leon is an open-source personal assistant who can live on your server.

Leon Your open-source personal assistant. Website :: Documentation :: Roadmap :: Contributing :: Story 👋 Introduction Leon is an open-source personal

Leon AI 11.7k Dec 30, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

Universitätsbibliothek Mannheim 80 Jan 03, 2023
Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

Conversational AI ChatBot Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users! In this project? Thi

Rajkumar Lakshmanamoorthy 6 Nov 30, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website • Docs • Twitter • Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
Code and data accompanying Natural Language Processing with PyTorch

Natural Language Processing with PyTorch Build Intelligent Language Applications Using Deep Learning By Delip Rao and Brian McMahan Welcome. This is a

Joostware 1.8k Jan 01, 2023