An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

Overview

datasets_sql

A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine and follows its query syntax.

Installation

pip install datasets_sql

Quick Start

from datasets import load_dataset, Dataset
from datasets_sql import query

imdb_dset = load_dataset("imdb", split="train")

# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")

# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")

# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")

# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")

order_customer_dset = Dataset.from_dict({
    "order_id": [10001, 10002, 10003],
    "customer_id": [3, 1, 2],
})

customer_dset = Dataset.from_dict({
    "customer_id": [1, 2, 3],
    "name": ["John", "Jane", "Mary"],
})

# Join two tables
join_query_dset = query(
    "SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)
You might also like...
SQL for Humans™
SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

SQL for Humans™
SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

Anomaly detection on SQL data warehouses and databases
Anomaly detection on SQL data warehouses and databases

With CueObserve, you can run anomaly detection on data in your SQL data warehouses and databases. Getting Started Install via Docker docker run -p 300

Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.

Simple DDL Parser Build with ply (lex & yacc in python). A lot of samples in 'tests/. Is it Stable? Yes, library already has about 5000+ usage per day

PyRemoteSQL is a python SQL client that allows you to connect to your remote server with phpMyAdmin installed.

PyRemoteSQL Python MySQL remote client Basically this is a python SQL client that allows you to connect to your remote server with phpMyAdmin installe

edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.
edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.

edaSQL is a python library to bridge the SQL with Exploratory Data Analysis where you can connect to the Database and insert the queries. The query results can be passed to the EDA tool which can give greater insights to the user.

Python script to clone SQL dashboard from one workspace to another

Databricks dashboard clone Unofficial project to allow Databricks SQL dashboard copy from one workspace to another. Resource clone Setup: Create a fil

Some scripts for microsoft SQL server in old version.
Some scripts for microsoft SQL server in old version.

MSSQL_Stuff Some scripts for microsoft SQL server which is in old version. Table of content Overview Usage References Overview These script works when

Making it easy to query APIs via SQL

Shillelagh Shillelagh (ʃɪˈleɪlɪ) is an implementation of the Python DB API 2.0 based on SQLite (using the APSW library): from shillelagh.backends.apsw

Comments
  • How to use query function if dataset is a class attribute

    How to use query function if dataset is a class attribute

    Awesome library!

    This is probably a generic duckdb question but figured I'd ask here first. If I store a reference to a dataset in a class attribute, how do I get query to find my dataset?

    Repro:

    class DatasetQuery:
        
        def __init__(self, dataset_name, split="train"):
            ds = datasets.load_dataset(dataset_name, split="train")
            self.dataset = ds
        
        def query(self, query_str):
            return query(query_str)
    
    dq = DatasetQuery("huggingnft/boredapeyachtclub")
    dq.query("select * from ?? limit 10;")
    

    What do I put in the from clause? I tried ds and self.dataset but neither work. I get ValueError: The datasetdsnot found in the namespace.

    opened by freddyaboulton 4
  • The readme demos are broken

    The readme demos are broken

    I tried running an example from the repo but the code is broken:

    imdb_dset = load_dataset("imdb", split="train")
    dataset = query(
        "SELECT text FROM imdb_dset"
    )
    

    results in AttributeError: 'duckdb.DuckDBPyConnection' object has no attribute 'fetch_arrow_chunk'

    I am using datasets_sql version 0.1.1 and datasets version 2.5.2

    opened by mo6zes 1
  • Be able to stream the results of query

    Be able to stream the results of query

    I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.

    For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:

    SELECT * from poloclub/diffusiondb
    WHERE contains('prompt', 'ceo')
    

    This combined with https://github.com/huggingface/datasets-server/issues/398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.

    I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.

    opened by freddyaboulton 5
Releases(0.3.0)
Owner
Mario Šaško
SWE at Hugging Face
Mario Šaško
Confluent's Kafka Python Client

Confluent's Python Client for Apache KafkaTM confluent-kafka-python provides a high-level Producer, Consumer and AdminClient compatible with all Apach

Confluent Inc. 3.1k Jan 05, 2023
Pure-python PostgreSQL driver

pg-purepy pg-purepy is a pure-Python PostgreSQL wrapper based on the anyio library. A lot of this library was inspired by the pg8000 library. Credits

Lura Skye 11 May 23, 2022
aioodbc - is a library for accessing a ODBC databases from the asyncio

aioodbc aioodbc is a Python 3.5+ module that makes it possible to access ODBC databases with asyncio. It relies on the awesome pyodbc library and pres

aio-libs 253 Dec 31, 2022
Makes it easier to write raw SQL in Python.

CoolSQL Makes it easier to write raw SQL in Python. Usage Quick Start from coolsql import Field name = Field("name") age = Field("age") condition =

Aber 7 Aug 21, 2022
Py2neo is a comprehensive toolkit for working with Neo4j from within Python applications or from the command line.

Py2neo v3 Py2neo is a client library and toolkit for working with Neo4j from within Python applications and from the command line. The core library ha

64 Oct 14, 2022
a small, expressive orm -- supports postgresql, mysql and sqlite

peewee Peewee is a simple and small ORM. It has few (but expressive) concepts, making it easy to learn and intuitive to use. a small, expressive ORM p

Charles Leifer 9.7k Dec 30, 2022
Sample scripts to show extracting details directly from the AIQUM database

Sample scripts to show extracting details directly from the AIQUM database

1 Nov 19, 2021
Asynchronous Python client for InfluxDB

aioinflux Asynchronous Python client for InfluxDB. Built on top of aiohttp and asyncio. Aioinflux is an alternative to the official InfluxDB Python cl

Gustavo Bezerra 159 Dec 27, 2022
Python PostgreSQL database performance insights. Locks, index usage, buffer cache hit ratios, vacuum stats and more.

Python PG Extras Python port of Heroku PG Extras with several additions and improvements. The goal of this project is to provide powerful insights int

Paweł Urbanek 35 Nov 01, 2022
asyncio (PEP 3156) Redis support

aioredis asyncio (PEP 3156) Redis client library. Features hiredis parser Yes Pure-python parser Yes Low-level & High-level APIs Yes Connections Pool

aio-libs 2.2k Jan 04, 2023
Python ODBC bridge

pyodbc pyodbc is an open source Python module that makes accessing ODBC databases simple. It implements the DB API 2.0 specification but is packed wit

Michael Kleehammer 2.6k Dec 27, 2022
A Python Object-Document-Mapper for working with MongoDB

MongoEngine Info: MongoEngine is an ORM-like layer on top of PyMongo. Repository: https://github.com/MongoEngine/mongoengine Author: Harry Marr (http:

MongoEngine 3.9k Jan 08, 2023
Import entity definition document into SQLie3. Manage the entity. Also, create a "Create Table SQL file".

EntityDocumentMaker Version 1.00 After importing the entity definition (Excel file), store the data in sqlite3. エンティティ定義(Excelファイル)をインポートした後、データをsqlit

G-jon FujiYama 1 Jan 09, 2022
Sample code to extract data directly from the NetApp AIQUM MySQL Database

This sample code shows how to connect to the AIQUM Database and pull user quota details from it. AIQUM Requirements: 1. AIQUM 9.7 or higher. 2. An

1 Nov 08, 2021
Python cluster client for the official redis cluster. Redis 3.0+.

redis-py-cluster This client provides a client for redis cluster that was added in redis 3.0. This project is a port of redis-rb-cluster by antirez, w

Grokzen 1.1k Jan 05, 2023
A simple Python tool to transfer data from MySQL to SQLite 3.

MySQL to SQLite3 A simple Python tool to transfer data from MySQL to SQLite 3. This is the long overdue complimentary tool to my SQLite3 to MySQL. It

Klemen Tusar 126 Jan 03, 2023
SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

Ken Reitz 6.9k Jan 03, 2023
Toolkit for storing files and attachments in web applications

DEPOT - File Storage Made Easy DEPOT is a framework for easily storing and serving files in web applications on Python2.6+ and Python3.2+. DEPOT suppo

Alessandro Molina 139 Dec 25, 2022
Official Python low-level client for Elasticsearch

Python Elasticsearch Client Official low-level client for Elasticsearch. Its goal is to provide common ground for all Elasticsearch-related code in Py

elastic 3.8k Jan 01, 2023
Async ORM based on PyPika

PyPika-ORM - ORM for PyPika SQL Query Builder The package gives you ORM for PyPika with asycio support for a range of databases (SQLite, PostgreSQL, M

Kirill Klenov 7 Jun 04, 2022