Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

Last update: Mar 09, 2022

Related tags

Overview

nli2paraphrases

Source code repository accompanying the preprint Extracting and filtering paraphrases by bridging natural language inference and paraphrasing. The idea presented in the paper is to re-use NLI datasets for paraphrasing, by finding paraphrases through bidirectional entailment.

Setup

# Make sure to run this from the root of the project (top-level directory)
$ pip3 install -r requirements.txt
$ python3 setup.py install

Project Organization

├── README.md          
├── experiments        <- Experiment scripts, through which training and extraction is done
├── models             <- Intended for storing fine-tuned models and configs
├── requirements.txt   
├── setup.py           
├── src                <- Core source code for this project
│   ├── __init__.py    
│   ├── data           <- data loading scripts
│   ├── models         <- general scripts for training/using a NLI model
│   └── visualization  <- visualization scripts for obtaining a nicer view of extracted paraphrases

Getting started

As an example, let us extract paraphrases from SNLI.

The training and extraction process largely follows the same track for other datasets (with some new or removed flags, run scripts with --help flag to see the specifics).

In the example, we first fine-tune a roberta-base NLI model on SNLI sequences (s1, s2).
Then, we use the fine-tuned model to predict the reverse relation for entailment examples, and select only those examples for which entailment holds in both directions. The extracted paraphrases are stored into extract-argmax.

This example assumes that you have access to a GPU. If not, you can force the scripts to use CPU by setting --use_cpu, although the whole process will be much slower.

# Assuming the current position is in the root directory of the project
$ cd experiments/SNLI_NLI

# Training takes ~1hr30mins on Colab GPU (K80)
$ python3 train_model.py \
--experiment_dir="../models/SNLI_NLI/snli-roberta-base-maxlen42-2e-5" \
--pretrained_name_or_path="roberta-base" \
--model_type="roberta" \
--num_epochs=10 \
--max_seq_len=42 \
--batch_size=256 \
--learning_rate=2e-5 \
--early_stopping_rounds=5 \
--validate_every_n_examples=5000

# Extraction takes ~15mins on Colab GPU (K80)
$ python3 extract_paraphrases.py \
--experiment_dir="extract-argmax" \
--pretrained_name_or_path="../models/SNLI_NLI/snli-roberta-base-maxlen42-2e-5" \
--model_type="roberta" \
--max_seq_len=42 \
--batch_size=1024 \
--l2r_strategy="ground_truth" \
--r2l_strategy="argmax"

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

Related tags

Overview

nli2paraphrases

Setup

Project Organization

Getting started

Owner

Matej Klemen

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

End-to-End Speech Processing Toolkit

3D Pose Estimation for Vehicles

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Project page for End-to-end Recovery of Human Shape and Pose

On Generating Extended Summaries of Long Documents

A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

Official Chainer implementation of GP-GAN: Towards Realistic High-Resolution Image Blending (ACMMM 2019, oral)

SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.

A simple but complete full-attention transformer with a set of promising experimental features from various papers

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

Codebase for ECCV18 "The Sound of Pixels"

Rename Images with Auto Generated Neural Image Captions

Code for Learning Manifold Patch-Based Representations of Man-Made Shapes, in ICLR 2021.

This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

FOSS Digital Asset Distribution Platform built on Frappe.

Open Source Light Field Toolbox for Super-Resolution

This is the official implementation code repository of Underwater Light Field Retention : Neural Rendering for Underwater Imaging (Accepted by CVPR Workshop2022 NTIRE)

A 3D sparse LBM solver implemented using Taichi