The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

Last update: Nov 09, 2022

Overview

TriageSQL

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL"

Dataset Download

Due to the size limitation, please download the dataset from Google Drive.

Citations

If you want to use TriageSQL in your work, please cite as follows:

@article{zhang2020did,
  title={Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL},
  author={Zhang, Yusen and Dong, Xiangyu and Chang, Shuaichen and Yu, Tao and Shi, Peng and Zhang, Rui},
  journal={arXiv preprint arXiv:2010.12634},
  year={2020}
}

Dataset

In each json file of the dataset, one can find a field called type, which includes 5 different values, including small talk, answerable, ambiguous, lack data, and unanswerable by sql, corresponding to 5 different types described in our paper. Here is the summary of our dataset and the corresponding experiment results:

Type	Trainset	Devset	Testset	Type Alias	Reported F1
small talk	31160	7790	500	Improper	0.88
ambiguous	48592	9564	500	Ambiguous	0.43
lack data	90375	19566	500	ExtKnow	0.56
unanswerable by sql	124225	26330	500	Non-SQL	0.90
answerable	139884	32892	500	Answerable	0.53
overall	434236	194037	2500	TriageSQL	0.66

The folder src contains all the source files used to construct the proposed TriageSQL. In addition, some part of files contains more details about the dataset, such as databaseid which is the id of the schema in the original dataset, e.g. "flight_2" in CoSQL, while question_datasetid indicates the original dataset name of the questions, e.g. "quac". Some of the samples do not contain these fields because they are either human-annotated or edited.

Model

We also include the source code for RoBERTa baseline in our project in /model. It is a multi-classifer with 5 classes where '0' represents answerable, '1'-'4' represent distinct types of unanswerable questions. Given the dataset from Google Drive, you may need to conduct some preprocessing to obtain train/dev/test set. You can directly download from here or make your own dataset using the following instructions:

Constructing input file for the RoBERTa model

The same as /testset/test.json, our input file is a json list with shape (num_of_question, 3) containing 3 lists: query, schema, and label.

query: containing strings of questions
schema: contianing strings of schema for each question, i.e., "table_name.column_name1 | table_name.column_name2 | ... " for multi-table questions, and column_name1 | column_name2 for single-table questions.
labels of questions, see config.label_dict for the mapping, leave arbitary value if testing is not needed or true labels are not given.

when preprocessing, please use lower case for all data, and remove the meaningless table names as well, such as T10023-1242. Also, we sample 10k from each type to form the large input dataset

Running

After adjusting the parameters in config.py, one can simply run python train.py or python eval.py to train or evaluate the model.

Explanation of other files

config.py: hyper parameters
train.py: training and evaluation of the model
utils.py: loading the dataset and tokenization
model.py: the RoBERTa classification model we used
test.json: sample of test input

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

Related tags

Overview

TriageSQL

Dataset Download

Citations

Dataset

Model

Constructing input file for the RoBERTa model

Running

Explanation of other files

Owner

Yusen Zhang

Simulating Sycamore quantum circuits classically using tensor network algorithm.

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

Image Captioning using CNN and Transformers

Spatial Sparse Convolution Library

Official implementation for the paper "SAPE: Spatially-Adaptive Progressive Encoding for Neural Optimization".

A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more

A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Towards Long-Form Video Understanding

Studying Python release adoptions by looking at PyPI downloads

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [2021]

H&M Fashion Image similarity search with Weaviate and DocArray

Multi-angle c(q)uestion answering

Camera calibration & 3D pose estimation tools for AcinoSet

This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Torchyolo - Yolov3 ve Yolov4 modellerin Pytorch uygulamasıdır

Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'