Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

Overview

nli2paraphrases

Source code repository accompanying the preprint Extracting and filtering paraphrases by bridging natural language inference and paraphrasing. The idea presented in the paper is to re-use NLI datasets for paraphrasing, by finding paraphrases through bidirectional entailment.

Setup

# Make sure to run this from the root of the project (top-level directory)
$ pip3 install -r requirements.txt
$ python3 setup.py install

Project Organization

├── README.md          
├── experiments        <- Experiment scripts, through which training and extraction is done
├── models             <- Intended for storing fine-tuned models and configs
├── requirements.txt   
├── setup.py           
├── src                <- Core source code for this project
│   ├── __init__.py    
│   ├── data           <- data loading scripts
│   ├── models         <- general scripts for training/using a NLI model
│   └── visualization  <- visualization scripts for obtaining a nicer view of extracted paraphrases

Getting started

As an example, let us extract paraphrases from SNLI.

The training and extraction process largely follows the same track for other datasets (with some new or removed flags, run scripts with --help flag to see the specifics).

In the example, we first fine-tune a roberta-base NLI model on SNLI sequences (s1, s2).
Then, we use the fine-tuned model to predict the reverse relation for entailment examples, and select only those examples for which entailment holds in both directions. The extracted paraphrases are stored into extract-argmax.

This example assumes that you have access to a GPU. If not, you can force the scripts to use CPU by setting --use_cpu, although the whole process will be much slower.

# Assuming the current position is in the root directory of the project
$ cd experiments/SNLI_NLI

# Training takes ~1hr30mins on Colab GPU (K80)
$ python3 train_model.py \
--experiment_dir="../models/SNLI_NLI/snli-roberta-base-maxlen42-2e-5" \
--pretrained_name_or_path="roberta-base" \
--model_type="roberta" \
--num_epochs=10 \
--max_seq_len=42 \
--batch_size=256 \
--learning_rate=2e-5 \
--early_stopping_rounds=5 \
--validate_every_n_examples=5000

# Extraction takes ~15mins on Colab GPU (K80)
$ python3 extract_paraphrases.py \
--experiment_dir="extract-argmax" \
--pretrained_name_or_path="../models/SNLI_NLI/snli-roberta-base-maxlen42-2e-5" \
--model_type="roberta" \
--max_seq_len=42 \
--batch_size=1024 \
--l2r_strategy="ground_truth" \
--r2l_strategy="argmax"

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner
Matej Klemen
MSc student at Faculty of Computer and Information Science (University of Ljubljana). Mainly into data science.
Matej Klemen
A community run, 5-day PyTorch Deep Learning Bootcamp

Deep Learning Winter School, November 2107. Tel Aviv Deep Learning Bootcamp : http://deep-ml.com. About Tel-Aviv Deep Learning Bootcamp is an intensiv

Shlomo Kashani. 1.3k Sep 04, 2021
Repository of 3D Object Detection with Pointformer (CVPR2021)

3D Object Detection with Pointformer This repository contains the code for the paper 3D Object Detection with Pointformer (CVPR 2021) [arXiv]. This wo

Zhuofan Xia 117 Jan 06, 2023
ColossalAI-Examples - Examples of training models with hybrid parallelism using ColossalAI

ColossalAI-Examples This repository contains examples of training models with Co

HPC-AI Tech 185 Jan 09, 2023
A pytorch implementation of Pytorch-Sketch-RNN

Pytorch-Sketch-RNN A pytorch implementation of https://arxiv.org/abs/1704.03477 In order to draw other things than cats, you will find more drawing da

Alexis David Jacq 172 Dec 12, 2022
Efficient Speech Processing Tookit for Automatic Speaker Recognition

Sugar Efficient Speech Processing Tookit for Automatic Speaker Recognition | HuggingFace | What's New EfficientTDNN: Efficient Architecture Search for

WangRui 14 Sep 14, 2022
Ankou: Guiding Grey-box Fuzzing towards Combinatorial Difference

Ankou Ankou is a source-based grey-box fuzzer. It intends to use a more rich fitness function by going beyond simple branch coverage and considering t

SoftSec Lab 54 Dec 24, 2022
Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

123 Dec 27, 2022
A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

TADDY: Anomaly detection in dynamic graphs via transformer This repo covers an reference implementation for the paper "Anomaly detection in dynamic gr

Yue Tan 21 Nov 24, 2022
This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

Bin Xiao 175 Jan 08, 2023
TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

TransPrompt This code is implement for our EMNLP 2021's paper 《TransPrompt:Towards an Automatic Transferable Prompting Framework for Few-shot Text Cla

WangJianing 23 Dec 21, 2022
Contrastive Learning Inverts the Data Generating Process

Official code to reproduce the results and data presented in the paper Contrastive Learning Inverts the Data Generating Process.

71 Nov 25, 2022
This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

Code-and-Dataset-for-CapSal This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detec

lu zhang 48 Aug 19, 2022
Language-Driven Semantic Segmentation

Language-driven Semantic Segmentation (LSeg) The repo contains official PyTorch Implementation of paper Language-driven Semantic Segmentation. Authors

Intelligent Systems Lab Org 416 Jan 03, 2023
Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

trRosetta - Pytorch (wip) Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Phil Wang 67 Dec 17, 2022
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord.

numpy2tfrecord Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord. Installation

Ryo Yonetani 2 Jan 16, 2022
Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

[AAAI2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics Overall pipeline of OCN. Paper Link: [arXiv] [AAAI

13 Nov 21, 2022
Character-Input - Create a program that asks the user to enter their name and their age

Character-Input Create a program that asks the user to enter their name and thei

PyLaboratory 0 Feb 06, 2022
This is a Deep Leaning API for classifying emotions from human face and human audios.

Emotion AI This is a Deep Leaning API for classifying emotions from human face and human audios. Starting the server To start the server first you nee

crispengari 5 Oct 02, 2022
PyTorch implementation of PNASNet-5 on ImageNet

PNASNet.pytorch PyTorch implementation of PNASNet-5. Specifically, PyTorch code from this repository is adapted to completely match both my implemetat

Chenxi Liu 314 Nov 25, 2022