Code for the paper "Flexible Generation of Natural Language Deductions"

Overview

Flexible Generation of Natural Language Deductions

a.k.a. ParaPattern

https://arxiv.org/abs/2104.08825

Kaj Bostrom, Lucy Zhao, Swarat Chaudhuri, and Greg Durrett

This repository contains all the code needed to replicate the experiments from the paper, and additionally provides a set of tools to put together new natural language deduction operations from scratch.

In the data/ folder, you'll find all the data used to train and evaluate our models, already preprocessed and ready to go, with the exception of the MNLI dataset due to its size - if you want to replicate our MNLI-BART baseline, you'll need to download a copy of MNLI and run data/mnli/filter.py for yourself. The data folder also contains several generic conversion scripts, which you may find useful for processing operation training examples, as well as paraphrase.py, which does automatic paraphrase generation if you pass it a path to a suitable sequence-to-sequence paraphrasing model checkpoint, e.g. https://huggingface.co/tuner007/pegasus_paraphrase

In the modeling/ folder, you'll find the fine-tuning code needed to train operation models, as well as scripts to run all the evaluations described in the paper. Just make sure you're on transformers version 4.2.1, not the latest version, since several of the scripts are carefully built around bugs that have since been patched out of the library.

If you have access to multiple GPUs, you can change the --nproc_per_node argument in finetune.sh from 1 to whatever number of GPUs you want to use for training.

In the dep_search/ folder, you'll find tools to perform bulk dependency parsing using spaCy, as well as scripts to index the resulting stream of dependency trees and scrape them using dependency patterns. For reference, the templates used in the paper live in dep_search/templates/. If you want to write your own templates, a good place to start is playing around with the dependency pattern DSL using dep_search.struct_query.parse_query - if you're wondering how to express a given syntactic pattern, you can start by calling dep_search.struct_query.Head.from_spacy on a spaCy token; this will construct a syntactic pattern without any slots from that token's dependency subtree. Printing patterns this way is a great way to familiarize yourself with dependency structure if you need to brush up on that stuff (I can never remember what POS tag/arc label conventions spaCy uses so I was printing out a lot of these trees while I was developing the templates we used in the paper).

Unfortunately, I never got around to optimizing the syntactic search process all that well, so for large free-text corpora (~=100M sentences or more) it can take a day or two to do a full run of parsing and indexing using dep_search/scrape.py. I find a good way to iterate on a pattern is to start by casting a really broad net, and then narrow down your pattern on a subset of those results so that you don't have to re-index your whole original corpus each time you make a small change to a template.

Owner
Kaj Bostrom
PhD student at UT Austin Computer Science. Studying NLP (reading comprehension/language understanding in particular)
Kaj Bostrom
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

GTFONow Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries. Features Automatically escalate privileges using miscon

101 Jan 03, 2023
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022