The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

Overview

TriageSQL

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL"

Dataset Download

Due to the size limitation, please download the dataset from Google Drive.

Citations

If you want to use TriageSQL in your work, please cite as follows:

@article{zhang2020did,
  title={Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL},
  author={Zhang, Yusen and Dong, Xiangyu and Chang, Shuaichen and Yu, Tao and Shi, Peng and Zhang, Rui},
  journal={arXiv preprint arXiv:2010.12634},
  year={2020}
}

Dataset

In each json file of the dataset, one can find a field called type, which includes 5 different values, including small talk, answerable, ambiguous, lack data, and unanswerable by sql, corresponding to 5 different types described in our paper. Here is the summary of our dataset and the corresponding experiment results:

Type Trainset Devset Testset Type Alias Reported F1
small talk 31160 7790 500 Improper 0.88
ambiguous 48592 9564 500 Ambiguous 0.43
lack data 90375 19566 500 ExtKnow 0.56
unanswerable by sql 124225 26330 500 Non-SQL 0.90
answerable 139884 32892 500 Answerable 0.53
overall 434236 194037 2500 TriageSQL 0.66

The folder src contains all the source files used to construct the proposed TriageSQL. In addition, some part of files contains more details about the dataset, such as databaseid which is the id of the schema in the original dataset, e.g. "flight_2" in CoSQL, while question_datasetid indicates the original dataset name of the questions, e.g. "quac". Some of the samples do not contain these fields because they are either human-annotated or edited.

Model

We also include the source code for RoBERTa baseline in our project in /model. It is a multi-classifer with 5 classes where '0' represents answerable, '1'-'4' represent distinct types of unanswerable questions. Given the dataset from Google Drive, you may need to conduct some preprocessing to obtain train/dev/test set. You can directly download from here or make your own dataset using the following instructions:

Constructing input file for the RoBERTa model

The same as /testset/test.json, our input file is a json list with shape (num_of_question, 3) containing 3 lists: query, schema, and label.

  • query: containing strings of questions
  • schema: contianing strings of schema for each question, i.e., "table_name.column_name1 | table_name.column_name2 | ... " for multi-table questions, and column_name1 | column_name2 for single-table questions.
  • labels of questions, see config.label_dict for the mapping, leave arbitary value if testing is not needed or true labels are not given.

when preprocessing, please use lower case for all data, and remove the meaningless table names as well, such as T10023-1242. Also, we sample 10k from each type to form the large input dataset

Running

After adjusting the parameters in config.py, one can simply run python train.py or python eval.py to train or evaluate the model.

Explanation of other files

  • config.py: hyper parameters
  • train.py: training and evaluation of the model
  • utils.py: loading the dataset and tokenization
  • model.py: the RoBERTa classification model we used
  • test.json: sample of test input
Owner
Yusen Zhang
Yusen Zhang
Simulating Sycamore quantum circuits classically using tensor network algorithm.

Simulating the Sycamore quantum supremacy circuit This repo contains data we have obtained in simulating the Sycamore quantum supremacy circuits with

Feng Pan 46 Nov 17, 2022
Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Clay Mullis 82 Oct 13, 2022
Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data (CVPR 2022) Potentials of primitive shapes f

31 Sep 27, 2022
Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

InfoSwap: Information Bottleneck Disentanglement for Identity Swapping Code usage Please check out the user manual page. Paper Gege Gao, Huaibo Huang,

Grace Hešeri 56 Dec 20, 2022
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

24 Dec 28, 2022
Spatial Sparse Convolution Library

SpConv: Spatially Sparse Convolution Library PyPI Install Downloads CPU (Linux Only) pip install spconv CUDA 10.2 pip install spconv-cu102 CUDA 11.1 p

Yan Yan 1.2k Jan 07, 2023
Official implementation for the paper "SAPE: Spatially-Adaptive Progressive Encoding for Neural Optimization".

SAPE Project page Paper Official implementation for the paper "SAPE: Spatially-Adaptive Progressive Encoding for Neural Optimization". Environment Cre

36 Dec 09, 2022
A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more

Alpha Zero General (any game, any framework!) A simplified, highly flexible, commented and (hopefully) easy to understand implementation of self-play

Surag Nair 3.1k Jan 05, 2023
Woosung Choi 63 Nov 14, 2022
Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

Chao-Yuan Wu 69 Dec 26, 2022
Studying Python release adoptions by looking at PyPI downloads

Analysis of version adoptions on PyPI We get PyPI download statistics via Google's BigQuery using the pypinfo tool. Usage First you need to get an acc

Julien Palard 9 Nov 04, 2022
This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Differentiable Volumetric Rendering Paper | Supplementary | Spotlight Video | Blog Entry | Presentation | Interactive Slides | Project Page This repos

697 Jan 06, 2023
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [2021]

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations This repo contains the Pytorch implementation of our paper: Revisit

Wouter Van Gansbeke 80 Nov 20, 2022
H&M Fashion Image similarity search with Weaviate and DocArray

H&M Fashion Image similarity search with Weaviate and DocArray This example shows how to do image similarity search using DocArray and Weaviate as Doc

Laura Ham 18 Aug 11, 2022
Multi-angle c(q)uestion answering

Macaw Introduction Macaw (Multi-angle c(q)uestion answering) is a ready-to-use model capable of general question answering, showing robustness outside

AI2 430 Jan 04, 2023
Camera calibration & 3D pose estimation tools for AcinoSet

AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild Daniel Joska, Liam Clark, Naoya Muramatsu, Ricardo Jericevich, Fre

African Robotics Unit 42 Nov 16, 2022
This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator This is a Pytorch implementation for the paper "An Empirical Study o

Cuong Nguyen 3 Nov 15, 2021
EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections Ruiqi Zhong, Kristy Lee*, Zheng Zhang*, Dan Klein EMN

Ruiqi Zhong 42 Nov 03, 2022
Torchyolo - Yolov3 ve Yolov4 modellerin Pytorch uygulamasıdır

TORCHYOLO : Yolo Modellerin Pytorch Uygulaması Yapılacaklar: Yolov3 model.py ve

Kadir Nar 3 Aug 22, 2022
Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

pytorch-AdaIN This is an unofficial pytorch implementation of a paper, Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [Hua

Naoto Inoue 873 Jan 06, 2023