GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Last update: Jan 09, 2023

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Related tags

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Updates

Abstract

Setup

Download the dataset

Build dataset directory

Download the library

Start the Stanford library

Download the checkpoint

Preprocess dataset

Inference

Training

Security

License

Owner

Amazon Web Services - Labs

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

jiant is an NLP toolkit

Ray-based parallel data preprocessing for NLP and ML.

A minimal code for fairseq vq-wav2vec model inference.

Synthetic data for the people.

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Unsupervised Language Modeling at scale for robust sentiment classification

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Translate U is capable of translating the text present in an image from one language to the other.

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

LSTM model - IMDB review sentiment analysis

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Learning Spatio-Temporal Transformer for Visual Tracking

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.