An easy-to-use feature store

Overview

ByteHub PyPI Latest Release Issues Issues Code style: black

ByteHub logo

An easy-to-use feature store.

💾 What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

  • Be simple to use, with a Pandas-like API;
  • Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
  • Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
  • Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

🦉 Features

  • Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
  • Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
  • Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
  • Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
  • Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
  • Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

📖 Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

🚀 Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

🐾 Roadmap

  • Tasks to automate updates to features using orchestration tools like Airflow
Owner
ByteHub AI
ByteHub AI
A simple and efficient tool to parallelize Pandas operations on all available CPUs

Pandaral·lel Without parallelization With parallelization Installation $ pip install pandarallel [--upgrade] [--user] Requirements On Windows, Pandara

Manu NALEPA 2.8k Dec 31, 2022
Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

Mikhail Kolmogorov 50 Jan 05, 2023
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

Infrastructure Team 93 Jan 03, 2023
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
Candlestick Pattern Recognition with Python and TA-Lib

Candlestick-Pattern-Recognition-with-Python-and-TA-Lib Goal Look at the S&P500 to try and get a better understanding of these candlestick patterns and

Ganesh Jainarain 11 Oct 07, 2022
Pipetools enables function composition similar to using Unix pipes.

Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit

186 Dec 29, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

Noppakorn Jiravaranun 5 Sep 27, 2021