Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Overview

Weakly Supervised Text-to-SQL Parsing through Question Decomposition

The official repository for the paper "Weakly Supervised Text-to-SQL Parsing through Question Decomposition" by Tomer Wolfson, Daniel Deutch and Jonathan Berant, accepted to the Finings of NAACL 2022.

This repository contains the code and data used in our paper:

  1. Code for automatically synthesizing SQL queries from question decompositions + answers
  2. Code for the models used in our paper mapping text-to-SQL and text-to-QDMR

Setup 🙌🏼

  1. Create the virtual environment
conda create -n [ENV_NAME] python=3.8
conda activate [ENV_NAME]
  1. Clone the repository
git clone https://github.com/tomerwolgithub/question-decomposition-to-sql
cd question-decomposition-to-sql
  1. Install the relevant requirements
pip install -r requirements.txt 
python -m spacy download en_core_web_lg
  1. To train the QDMR parser model please setup a separate environment (due to different Hugginface versions):
conda create -n qdmr_parser_env python=3.8
conda activate qdmr_parser_env
pip install -r requirements_qdmr_parser.txt 
python -m spacy download en_core_web_lg

Download Resources 🗝️

1. QDMR Parsing Datasets:

2. Text-to-SQL Datasets:

3. Databases (schema & contents):

Convert the MySQL databases of Academic, IMDB, Yelp and GeoQuery to sqlite format using the tool of Jean-Luc Lacroix:

./mysql2sqlite academic_mysql.sql | sqlite3 academic_sqlite.db

Data Generation 🔨

Our SQL synthesis is given examples of <QDMR, database, answer> and automatically generates a SQL that executes to the correct answer. The QDMR decompositions are either manually annotated or automatically predicted by a trained QDMR parser.

Begin by copying all relevant sqlite databases to the data_generation directory.

mkdir data_generation/data
mkdir data_generation/data/spider_databases # copy Spider databases here
mkdir data_generation/data/other_databases # copy Academic, IMDB, Yelp and Geo databases here
  1. The SQL synthesis expects a formatted csv file, see example. Note that the SQL query in these files is only used to compute the answer.
  2. This may take several hours, as multiple candidate SQL are being executed on their respective database.
  3. To synthesize SQL from the <QDMR, database, answer> examples run:
python data_generation/main.py \
--input_file input_qdmr_examples.csv \
--output_file qdmr_grounded_sql.csv \
--json_steps True

Synthesized Data

The SQL synthesized using QDMR + answer supervision is available for each dataset in the data/sql_synthesis_results/ directory.

  • data/sql_synthesis_results/gold_qdmr_supervision: contains SQL synthesized using gold QDMRs that are manually annotated
  • data/sql_synthesis_results/predicted_qdmr_supervision: contains SQL synthesized using QDMRs predicted by a trained parser

Models 🗂️

QDMR Parser

The QDMR parser is a T5-large sequence-to-sequence model that is finetuned to map questions to their QDMR. The model expects as input two csv files as its train and dev sets. Use the files from the downloaded Break dataset to train the parser. Make sure that you are in the relevant python environment (requirements_qdmr_parser.txt).

To train the QDMR parser configure the following parameters in train.py:

  • data_dir: the path to the directory containing the NL to QDMR datasets
  • training_set_file: name of the train set csv (e.g. break_train.csv)
  • dev_set_file: name of the dev set csv (e.g. break_dev.csv)
  • output_dir: the directory to store the trained model

After configuration, train the model as follows:

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python src/qdmr_parser/train.py

To test a trained model and store its predictions, configure the following parameters in test.py:

  • checkpoint_path: path to the trained QDMR parser model to be evaluated
  • dev_set_file: name of the dev set csv to generate predictions for
  • predictions_output_file: the output file to store the parser's generated predictions

And run the following command:

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python src/qdmr_parser/test.py

Text-to-SQL

The text-to-SQL models are T5-large sequence-to-sequence models, finetuned to map questions to executable SQL queries. We compare the models trained on gold SQL queries, annotated by experts, to our synthesized SQL from QDMR and answer supervision.

1. Setup directory

Setup the data for the text-to-SQL experiments as follows:

data
├── tables.json			# Spider tables.json
└── databases
│   └── academic			
│       └── academic.sqlite	# Sqlite version of the populated Academic database (see downloads)
│   └── geo			
│       └── geo.sqlite		# Sqlite version of the populated Geo database (see downloads)
│   └── imdb			
│       └── imdb.sqlite		# Sqlite version of the populated IMDB database (see downloads)
│   └── spider_databases 	# Spider databases directory
│       └── activity_1
│           └── activity_1.sqlite
│       └── ...   
│   └── yelp			
│       └── yelp.sqlite		# Sqlite version of the populated Yelp database (see downloads)
└── queries
    └── geo	# See experiments data
        ├── geo_qdmr_train.json
	└── geo_qdmr_predicted_train.json
	└── geo_gold_train.json
	└── geo_gold_dev.json
	└── geo_gold_test.json
	└── geo_gold_train.sql
	└── geo_gold_dev.sql
	└── geo_gold_test.sql
    └── spider
        ├── spider_qdmr_train.json		# See experiments data
	└── spider_qdmr_predicted_train.json 	# See experiments data
	└── spider_gold_train.json 	# Spider training set
	└── spider_gold_dev.json 	# Spider dev set
	└── spider_gold_train.sql 	# Spider training set SQL queries
	└── spider_gold_dev.sql 	# Spider dev set SQL queries

Database files are described in the downloads section. See the experiments section for the exact train and test files.

2. Train model

To train the text-to-SQL model configure its following parameters in train.py:

  • dataset: either spider or geo
  • target_encoding: sql for gold sql and either qdmr_formula or qdmr_sql for the QDMR experiments
  • data_dir: path to the directory containing the experiments data
  • output_dir: the directory to store the trained model
  • db_dir: the directory to store the trained model
  • training_set_file: training set file in the data directory e.g. spider/spider_gold_train.json
  • dev_set_file: dev set file in the data directory e.g. spider/spider_gold_dev.json
  • dev_set_sql: dev set SQL queries in the data directory e.g. spider/spider_gold_dev.sql

Following configuration, to train the model run:

CUDA_VISIBLE_DEVICES=0 python train.py 

3. Test model

To test the text-to-SQL model first configure the relevant parameters and checkpoint_path in test.py. Following the configuration, generate the trained model predictions using:

CUDA_VISIBLE_DEVICES=0 python test.py 

Experiments ⚗️

Data

Gold SQL:

For the Spider experiments we use its original train and dev json and sql files. For Geo880, Academic, IMDB and Yelp we format the original datasets in json files available here.

QDMR Synthesized SQL:

The QDMR text-to-SQL models are not trained directly on the synthesized SQL. Instead, we train on an encoded QDMR representation with its phrase-DB linking (from the SQL synthesis). This representation is automatically mapped to SQL to evaluate the models execution accuracy. To generate these grounded QDMRs we use the output of the data generation phase. The function encoded_grounded_qdmr in src/data_generation/write_encoding.py recieves the json file containing the synthesized SQL examples. It then encodes them as lisp style formulas of QDMR steps and their relevant phrase-DB linking.

For convenience, you can download the encoded QDMR training sets used in our experiments here. These include:

  • qdmr_ground_enc_spider_train.json: 5,349 examples, synthesized using gold QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_few_shot: 5,075 examples, synthesized examples using 700 gold QDMRs, predicted QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_30_db.json: 1,129 examples, synthesized using predicted QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_40_db.json: 1,440 examples, synthesized using predicted QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_40_db_V2.json: 1,552 examples, synthesized using predicted QDMR + answer supervision
  • qdmr_ground_enc_geo880_train.json: 454 examples, synthesized using gold QDMR + answer supervision
  • qdmr_ground_enc_predicted_geo_train_zero_shot.json: 432 examples, synthesized using predicted QDMR + answer supervision

Configurations

The configurations for training the text-to-SQL models on Spider. Other parameters are fixed in train.py.

SQL Gold (Spider):

{'dataset': 'spider',
'target_encoding': 'sql',
'db_dir': 'databases/spider_databases',
'training_set_file': 'queries/spider/spider_gold_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

QDMR Gold (Spider):

{'dataset': 'spider',
'target_encoding': 'qdmr_formula',
'db_dir': 'databases/spider_databases',
'training_set_file': 'queries/spider/spider_qdmr_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

SQL Predicted (Spider):

{'dataset': 'spider',
'target_encoding': 'qdmr_formula',
'db_dir': `databases/spider_databases',
'training_set_file': 'queries/spider/spider_qdmr_predicted_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

The configurations for training the text-to-SQL models on Geo880.

SQL Gold (Geo):

{'dataset': 'geo',
'target_encoding': 'sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_gold_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

QDMR Gold (Geo):

{'dataset': 'geo',
'target_encoding': 'qdmr_sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_qdmr_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

QDMR Predicted (Geo):

{'dataset': 'geo',
'target_encoding': 'qdmr_sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_qdmr_predicted_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

Evaluation

Text-to-SQL model performance is evaluated using SQL execution accuracy in src/text_to_sql/eval_spider.py. The script automatically converts encoded QDMR predictions to SQL before executing them on the target database.

Citation ✍🏽

bibtex
@inproceedings{wolfson-etal-2022-weakly,
    title={"Weakly Supervised Text-to-SQL Parsing through Question Decomposition"},
    author={"Wolfson, Tomer and Deutch, Daniel and Berant, Jonathan"},
    booktitle = {"Findings of the Association for Computational Linguistics: NAACL 2022"},
    year={"2022"},
}

License

This repository and its data is released under the MIT license.

For the licensing of all external datasets and databases used throughout our experiments:

Discover hidden deepweb pages

DeepWeb Scapper Att: Demo version An simple script to scrappe deepweb to find pages. Will return if any of those exists and will save on a file. You s

Héber Júlio 77 Oct 02, 2022
PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

WaveGrad2 - PyTorch Implementation PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. Status (202

Keon Lee 59 Dec 06, 2022
3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

3DMV 3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans. This work is based on our ECCV'18 p

Владислав Молодцов 0 Feb 06, 2022
ICNet and PSPNet-50 in Tensorflow for real-time semantic segmentation

Real-Time Semantic Segmentation in TensorFlow Perform pixel-wise semantic segmentation on high-resolution images in real-time with Image Cascade Netwo

Oles Andrienko 219 Nov 21, 2022
codes for Self-paced Deep Regression Forests with Consideration on Ranking Fairness

Self-paced Deep Regression Forests with Consideration on Ranking Fairness This is official codes for paper Self-paced Deep Regression Forests with Con

Learning in Vision 4 Sep 11, 2022
A novel framework to automatically learn high-quality scanning of non-planar, complex anisotropic appearance.

appearance-scanner About This repository is an implementation of the neural network proposed in Free-form Scanning of Non-planar Appearance with Neura

Xiaohe Ma 14 Oct 18, 2022
A simple AI that will give you si ple task and this is made with python

Crystal-AI A simple AI that will give you si ple task and this is made with python Prerequsites: Python3.6.2 pyttsx3 pip install pyttsx3 pyaudio pip i

CrystalAnd 1 Dec 25, 2021
Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO.

Streamlit Demo: The Udacity Self-driving Car Image Browser This project demonstrates the Udacity self-driving-car dataset and YOLO object detection in

Streamlit 992 Jan 04, 2023
object recognition with machine learning on Respberry pi

Respberrypi_object-recognition object recognition with machine learning on Respberry pi line.py 建立一支與樹梅派連線的 linebot 使用此 linebot 遠端控制樹梅派拍照 config.ini l

1 Dec 11, 2021
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
Efficient Sharpness-aware Minimization for Improved Training of Neural Networks

Efficient Sharpness-aware Minimization for Improved Training of Neural Networks Code for “Efficient Sharpness-aware Minimization for Improved Training

Angusdu 32 Oct 18, 2022
TDmatch is a Python library developed to perform matching tasks in three categories:

TDmatch TDmatch is a Python library developed to perform matching tasks in three categories: Text to Data which matches tuples of a table to text docu

Naser Ahmadi 5 Aug 11, 2022
Code and training data for our ECCV 2016 paper on Unsupervised Learning

Shuffle and Learn (Shuffle Tuple) Created by Ishan Misra Based on the ECCV 2016 Paper - "Shuffle and Learn: Unsupervised Learning using Temporal Order

Ishan Misra 44 Dec 08, 2021
git《Tangent Space Backpropogation for 3D Transformation Groups》(CVPR 2021) GitHub:1]

LieTorch: Tangent Space Backpropagation Introduction The LieTorch library generalizes PyTorch to 3D transformation groups. Just as torch.Tensor is a m

Princeton Vision & Learning Lab 482 Jan 06, 2023
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
FluidNet re-written with ATen tensor lib

fluidnet_cxx: Accelerating Fluid Simulation with Convolutional Neural Networks. A PyTorch/ATen Implementation. This repository is based on the paper,

JoliBrain 50 Jun 07, 2022
RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations.

Randomised controlled trial abstract result tabulator RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into

2 Sep 16, 2022
PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

PaddleBoBo - 元宇宙时代,你也可以动手做一个虚拟主播。 PaddleBoBo是基于飞桨PaddlePaddle深度学习框架和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目。PaddleBoBo致力于简单高效、可复用性强,只需要一张带人像的图片和一段文字,就能

502 Jan 08, 2023
MultiMix: Sparingly Supervised, Extreme Multitask Learning From Medical Images (ISBI 2021, MELBA 2021)

MultiMix This repository contains the implementation of MultiMix. Our publications for this project are listed below: "MultiMix: Sparingly Supervised,

Ayaan Haque 27 Dec 22, 2022
ivadomed is an integrated framework for medical image analysis with deep learning.

Repository on the collaborative IVADO medical imaging project between the Mila and NeuroPoly labs.

144 Dec 19, 2022