A library to generate synthetic time series data by easy-to-use factors and generator

Overview

timeseries-generator

This repository consists of a python packages that generates synthetic time series dataset in a generic way (under /timeseries_generator) and demo notebooks on how to generate synthetic timeseries data (under /examples). The goal here is to have non-sensitive data available to demo solutions and test the effectiveness of those solutions and/or algorithms. In order to test your algorithm, you want to have time series available containing different kinds of trends. The python package should help create different kinds of time series while still being maintainable.

timeseries_generator package

For this package, it is assumed that a time series is composed of a base value multiplied by many factors.

ts = base_value * factor1 * factor2 * ... * factorN + Noiser

Diagram

These factors can be anything, random noise, linear trends, to seasonality. The factors can affect different features. For example, some features in your time series may have a seasonal component, while others do not.

Different factors are represented in different classes, which inherit from the BaseFactor class. Factor classes are input for the Generator class, which creates a dataframe containing the features, base value, all the different factors working on the base value and and the final factor and value.

Core concept

  • Generator: a python class to generate the time series. A generator contains a list of factors and noiser. By overlaying the factors and noiser, generator can produce a customized time series
  • Factor: a python class to generate the trend, seasonality, holiday factors, etc. Factors take effect by multiplying on the base value of the generator.
  • Noised: a python class to generate time series noise data. Noiser take effect by summing on top of "factorized" time series. This formula describes the concepts we talk above

Built-in Factors

  • LinearTrend: give a linear trend based on the input slope and intercept
  • CountryYearlyTrend: give a yearly-based market cap factor based on the GDP per - capita.
  • EUEcoTrendComponents: give a monthly changed factor based on EU industry product public data
  • HolidayTrendComponents: simulate the holiday sale peak. It adapts the holiday days - differently in different country
  • BlackFridaySaleComponents: simulate the BlackFriday sale event
  • WeekendTrendComponents: more sales at weekends than on weekdays
  • FeatureRandFactorComponents: set up different sale amount for different stores and different product
  • ProductSeasonTrendComponents: simulate season-sensitive product sales. In this example code, we have 3 different types of product:
    • winter jacket: inverse-proportional to the temperature, more sales in winter
    • basketball top: proportional to the temperature, more sales in summer
    • Yoga Mat: temperature insensitive

Installation

pip install timeseries-generator

Usage

from timeseries_generator import LinearTrend, Generator, WhiteNoise, RandomFeatureFactor
import pandas as pd

# setting up a linear tren
lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")
g = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-20-2020"))
g.generate()
g.plot()

# update by adding some white noise to the generator
wn = WhiteNoise(stdev_factor=0.05)
g.update_factor(wn)
g.generate()
g.plot()

Example Notebooks

We currently have 2 example notebooks available:

  1. generate_stationary_process: Good for introducing the basics of the timeseries_generator. Shows how to apply simple linear trends and how to introduce features and labels, as well as random noise.
  2. use_external_factors: Goes more into detail and shows how to use the external_factors submodule. Shows how to create seasonal trends.

Web based prototyping UI

We also use Streamlit to build a web-based UI to demonstrate how to use this package to generate synthesis time series data in an interactive web UI.

streamlit run examples/streamlit/app.py

Web UI

License

This package is released under the Apache License, Version 2.0

You might also like...
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Visualize classified time series data with interactive Sankey plots in Google Earth Engine
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

A collection of Scikit-Learn compatible time series transformers and tools.
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Automatic extraction of relevant features from time series:
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

Probabilistic time series modeling in Python
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Comments
  • Time series data augmentation

    Time series data augmentation

    There is a code example that gives to increase the amount of series data by adding slightly modified copies of already existing time series data or newly created synthetic series data from existing data?

    opened by YAYAYru 0
  • KeyError: 'country'

    KeyError: 'country'

    From the following code,

    from timeseries_generator import HolidayFactor, LinearTrend, Generator
    
    lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")
    
    g: Generator = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-01-2021"))
    
    holiday_factor = HolidayFactor(
        country_feature_name="country",
    )
    g.add_factor(holiday_factor)
    g.generate()
    

    I get the error. I am not sure this is expected behavior.

    File /usr/local/Caskroom/miniconda/base/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py:10083, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    ...
    -> 1849     raise KeyError(key)
       1851 # Check for duplicates
       1852 if values.ndim > 1:
    
    KeyError: 'country'
    
    opened by twobitunicorn 0
  • [Feature request] Customizable feature combinations

    [Feature request] Customizable feature combinations

    Hi team, Thanks for the useful library! I wonder if you'd be open to this idea:

    I would like to be able to:

    • Set up categorizing features (let's say, for illustration, CATEGORY=[footwear, t-shirts, socks], SIZE=[S, M, L, US-Mens-8, US-Womens-6) and define Factors on them
    • Generate time-series with more restricted feature combinations than the outer product (again for illustration, "t-shirt sizes for t-shirts, shoe sizes for footwear")

    Today, it seems like Generator.generate() hard-codes the assumption that time-series should be generated for the product of all provided feature values.

    It'd be helpful if, instead, we could have the option of customizing this join to limit down generated combinations?

    Some options I can think of:

    1. Leave the library as-is: Users generate full outer product and limit down what they want in post-processing
      • This seems possible already, but very RAM-intensive if your desired combinations are sparse?
    2. Accept an optional dataframe of factor combinations as parameter to the generate() method
      • Gives full flexibility over which combinations are kept / ignored, without assuming any particular rigid hierarchies between features
      • ...But might need to do a bit of validation to protect against user errors? May not be super easy to use without some documented examples / functions to generate the dataframe
    3. Some more complex API for feature configuration that accommodates specifying valid/invalid feature combinations
      • Might be nicer for usability, but difficult to make general: E.g. a straightforward hierarchy could be represented as a nested dict, but in practice many applications have multiple intersecting views of product category information e.g. brand, type, target segment, etc.
    opened by athewsey 1
  • Generate hourly data

    Generate hourly data

    First of all, thank you for making this repository public! I enjoy its ease of use and the built-in factors.

    Problem description

    I'm currently trying to generate revenue data for a bar/restaurant on an hourly basis. As far as I can see, the timeseries-generator only supports generating one data point per day, not per hour.

    I tried to generate hourly data like g = Generator(factors={lt}, features=None, date_range=pd.date_range(start='15/9/2021', end='30/9/2021', freq='h')) which didn't work.

    Potential solution

    Add the possibility to generate hourly data too. If this is a promising idea in your opinion, I'm willing to contribute to the implementation.

    Thank you in advance!

    opened by nileger 1
Releases(v0.1.0)
  • v0.1.0(Jul 20, 2021)

    • first release of time series generators, including:
      • base factor
      • linear trend factor
      • sinusoidal factor
      • white noise factor
      • random factor
      • holiday factor
      • weekday factor
      • country GDP factor
      • EU industry index factor
    • Examples
      • notebooks which includes some simple examples
      • streamlit dashboard
    Source code(tar.gz)
    Source code(zip)
Owner
Nike Inc.
Nike Inc.
Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

Happy N. Monday 3 Feb 15, 2022
A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

Peter Corke 1.2k Jan 07, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022
Lseng-iseng eksplor Machine Learning dengan menggunakan library Scikit-Learn

Kalo dengar istilah ML, biasanya rada ambigu. Soalnya punya beberapa kepanjangan, seperti Mobile Legend, Makan Lontong, Ma**ng L*v* dan lain-lain. Tapi pada repo ini membahas Machine Learning :)

Alfiyanto Kondolele 1 Apr 06, 2022
Deploy AutoML as a service using Flask

AutoML Service Deploy automated machine learning (AutoML) as a service using Flask, for both pipeline training and pipeline serving. The framework imp

Chris Rawles 221 Nov 04, 2022
MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use

Pachter Lab 26 Nov 29, 2022
WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can b

Shigang Li 6 Jun 18, 2022
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 07, 2023
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

Paweł Rościszewski 131 Dec 12, 2022
This project has Classification and Clustering done Via kNN and K-Means respectfully

This project has Classification and Clustering done Via kNN and K-Means respectfully. It later tests its efficiency via F1/accuracy/recall/precision for kNN and Davies-Bouldin Index for Clustering. T

Mohammad Ali Mustafa 0 Jan 20, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 09, 2023
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
Binary Classification Problem with Machine Learning

Binary Classification Problem with Machine Learning Solving Approach: 1) Ultimate Goal of the Assignment: This assignment is about solving a binary cl

Dinesh Mali 0 Jan 20, 2022
A Python package for time series classification

pyts: a Python package for time series classification pyts is a Python package for time series classification. It aims to make time series classificat

Johann Faouzi 1.4k Jan 01, 2023
Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

Olá! Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogr

Henrique de Paula 10 Apr 04, 2022
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
ETNA – time series forecasting framework

ETNA Time Series Library Predict your time series the easiest way Homepage | Documentation | Tutorials | Contribution Guide | Release Notes ETNA is an

Tinkoff.AI 675 Jan 08, 2023
inding a method to objectively quantify skill versus chance in games, using reinforcement learning

Skill-vs-chance-games-analysis - Finding a method to objectively quantify skill versus chance in games, using reinforcement learning

Marcus Chiam 4 Nov 19, 2022
Lightweight Machine Learning Experiment Logging 📖

Simple logging of statistics, model checkpoints, plots and other objects for your Machine Learning Experiments (MLE). Furthermore, the MLELogger comes with smooth multi-seed result aggregation and co

Robert Lange 65 Dec 08, 2022