A data preprocessing package for time series data. Design for machine learning and deep learning.

Overview

Time Series Transformer

Documentation https://allen-chiang.github.io/Time-Series-Transformer/

made-with-python Build Build Status Board Status CodeFactor

import pandas as pd
import numpy as np
from time_series_transform.sklearn import *
import time_series_transform as tst

Introduction

This package provides tools for time series data preprocessing. There are two main components inside the package: Time_Series_Transformer and Stock_Transformer. Time_Series_Transformer is a general class for all type of time series data, while Stock_Transformer is a sub-class of Time_Series_Transformer. Time_Series_Transformer has different functions for data manipulation, io transformation, and making simple plots. This tutorial will take a quick look at the functions for data manipulation and basic io. For the plot functions, there will be other tutorial to explain.

Time_Series_Transformer

Since all the time series data having time data, Time_Series_Transformer is required to specify time index. The basic time series data is time series data with no special category. However, there a lot of cases that a time series data is associating with categories. For example, inventory data is usually associate with product name or stores, or stock data is having different ticker names or brokers. To address this question, Time_Series_Transformer can specify the main category index. Given the main category index, the data can be manipulated in parallel corresponding to its category.

Here is a simple example to create a Time_Series_Transformer without specifying its category.

data = {
    'time':[1,2,3,4,5],
    'data1':[1,2,3,4,5],
    'data2':[6,7,8,9,10]
}
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans
data column
-----------
time
data1
data2
time length: 5
category: None

There are two ways to manipulate the data. The first way is use the pre-made functions, and the second way is to use the transform function and provide your custom function. There are six pre-made functions including make_lag, make_lead, make_lag_sequence, make_lead_sequence, and make_stack_sequence. In the following demonstration, we will show each of the pre-made functions.

Pre-made functions

make_lag and make_lead functions are going to create lag/lead data for input columns. This type of manipulation could be useful for machine learning.

trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lag(
    inputLabels = ['data1','data2'],
    lagNum = 1,
    suffix = '_lag_',
    fillMissing = np.nan
            )
print(trans.to_pandas())
   time  data1  data2  data1_lag_1  data2_lag_1
0     1      1      6          NaN          NaN
1     2      2      7          1.0          6.0
2     3      3      8          2.0          7.0
3     4      4      9          3.0          8.0
4     5      5     10          4.0          9.0
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lead(
    inputLabels = ['data1','data2'],
    leadNum = 1,
    suffix = '_lead_',
    fillMissing = np.nan
            )
print(trans.to_pandas())
   time  data1  data2  data1_lead_1  data2_lead_1
0     1      1      6           2.0           7.0
1     2      2      7           3.0           8.0
2     3      3      8           4.0           9.0
3     4      4      9           5.0          10.0
4     5      5     10           NaN           NaN

make_lag_sequence and make_lead_sequence is to create a sequence for a given window length and lag or lead number. This manipulation could be useful for Deep learning.

trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lag_sequence(
    inputLabels = ['data1','data2'],
    windowSize = 2,
    lagNum =1,
    suffix = '_lag_seq_'
)
print(trans.to_pandas())
   time  data1  data2 data1_lag_seq_2 data2_lag_seq_2
0     1      1      6      [nan, nan]      [nan, nan]
1     2      2      7      [nan, 1.0]      [nan, 6.0]
2     3      3      8      [1.0, 2.0]      [6.0, 7.0]
3     4      4      9      [2.0, 3.0]      [7.0, 8.0]
4     5      5     10      [3.0, 4.0]      [8.0, 9.0]
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lead_sequence(
    inputLabels = ['data1','data2'],
    windowSize = 2,
    leadNum =1,
    suffix = '_lead_seq_'
)
print(trans.to_pandas())
   time  data1  data2 data1_lead_seq_2 data2_lead_seq_2
0     1      1      6       [2.0, 3.0]       [7.0, 8.0]
1     2      2      7       [3.0, 4.0]       [8.0, 9.0]
2     3      3      8       [4.0, 5.0]      [9.0, 10.0]
3     4      4      9       [nan, nan]       [nan, nan]
4     5      5     10       [nan, nan]       [nan, nan]

Custom Functions

To use the transform function, you have to create your custom functions. The input data will be passed as dict of list, and the output data should be either pandas DataFrame, pandas Series, numpy ndArray or list. Note, the output length should be in consist with the orignal data length.

For exmaple, this function takes input dictionary data and sum them up. The final output is a list.

import copy
def list_output (dataDict):
    res = []
    for i in dataDict:
        if len(res) == 0:
            res = copy.deepcopy(dataDict[i])
            continue
        for ix,v in enumerate(dataDict[i]):
            res[ix] += v
    return res
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.transform(
    inputLabels = ['data1','data2'],
    newName = 'sumCol',
    func = list_output
)
print(trans.to_pandas())
   time  data1  data2  sumCol
0     1      1      6       7
1     2      2      7       9
2     3      3      8      11
3     4      4      9      13
4     5      5     10      15

The following example will output as pandas DataFrame and also takes additional parameters. Note: since pandas already has column name, the new name will automatically beocme suffix.

def pandas_output(dataDict, pandasColName):
    res = []
    for i in dataDict:
        if len(res) == 0:
            res = copy.deepcopy(dataDict[i])
            continue
        for ix,v in enumerate(dataDict[i]):
            res[ix] += v
    return pd.DataFrame({pandasColName:res})
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.transform(
    inputLabels = ['data1','data2'],
    newName = 'sumCol',
    func = pandas_output,
    pandasColName = "pandasName"
)
print(trans.to_pandas())
   time  data1  data2  sumCol_pandasName
0     1      1      6                  7
1     2      2      7                  9
2     3      3      8                 11
3     4      4      9                 13
4     5      5     10                 15

Data with Category

Since time series data could be associated with different category, Time_Series_Transformer can specify the mainCategoryCol parameter to point out the main category. This class only provide one columns for main category because multiple dimensions can be aggregated into a new column as main category.

The following example has one category with two type a and b. Each of them has some overlaped and different timestamp.

data = {
    "time":[1,2,3,4,5,1,3,4,5],
    'data':[1,2,3,4,5,1,2,3,4],
    "category":['a','a','a','a','a','b','b','b','b']
}
trans = tst.Time_Series_Transformer(data,'time','category')
trans
data column
-----------
time
data
time length: 5
category: a

data column
-----------
time
data
time length: 4
category: b

main category column: category

Since we specify the main category column, data manipulation functions can use n_jobs to execute the function in parallel. The parallel execution is with joblib implmentation (https://joblib.readthedocs.io/en/latest/).

trans = trans.make_lag(
    inputLabels = ['data'],
    lagNum = 1,
    suffix = '_lag_',
    fillMissing = np.nan,
    n_jobs = 2,
    verbose = 10        
)
print(trans.to_pandas())
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


   time  data  data_lag_1 category
0     1     1         NaN        a
1     2     2         1.0        a
2     3     3         2.0        a
3     4     4         3.0        a
4     5     5         4.0        a
5     1     1         NaN        b
6     3     2         1.0        b
7     4     3         2.0        b
8     5     4         3.0        b


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.6s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.6s finished

To further support the category, there are two functions to deal with different time length data: pad_different_category_time and remove_different_category_time. The first function is padding the different length into same length, while the other is remove different timestamp.

trans = tst.Time_Series_Transformer(data,'time','category')
trans = trans.pad_different_category_time(fillMissing = np.nan
)
print(trans.to_pandas())
   time  data category
0     1   1.0        a
1     2   2.0        a
2     3   3.0        a
3     4   4.0        a
4     5   5.0        a
5     1   1.0        b
6     2   NaN        b
7     3   2.0        b
8     4   3.0        b
9     5   4.0        b
trans = tst.Time_Series_Transformer(data,'time','category')
trans = trans.remove_different_category_time()
print(trans.to_pandas())
   time  data category
0     1     1        a
1     3     3        a
2     4     4        a
3     5     5        a
4     1     1        b
5     3     2        b
6     4     3        b
7     5     4        b

IO

IO is a huge component for this package. The current version support pandas DataFrame, numpy ndArray, Apache Arrow Table, Apache Feather, and Apache Parquet. All those io can specify whether to expand category or time for the export format. In this demo, we will show numpy and pandas. Also, Transformer can combine make_label function and sepLabel parameter inside of export to seperate data and label.

pandas

data = {
    "time":[1,2,3,4,5,1,3,4,5],
    'data':[1,2,3,4,5,1,2,3,4],
    "category":['a','a','a','a','a','b','b','b','b']
}
df = pd.DataFrame(data)
trans = tst.Time_Series_Transformer.from_pandas(
    pandasFrame = df,
    timeSeriesCol = 'time',
    mainCategoryCol= 'category'
)
trans
data column
-----------
time
data
time length: 5
category: a

data column
-----------
time
data
time length: 4
category: b

main category column: category

To expand the data, all category should be in consist. Besides the pad and remove function, we can use preprocessType parameter to achive that.

print(trans.to_pandas(
    expandCategory = True,
    expandTime = False,
    preprocessType = 'pad'
))
   time  data_a  data_b
0     1       1     1.0
1     2       2     NaN
2     3       3     2.0
3     4       4     3.0
4     5       5     4.0
print(trans.to_pandas(
    expandCategory = False,
    expandTime = True,
    preprocessType = 'pad'
))
   data_1  data_2  data_3  data_4  data_5 category
0       1     2.0       3       4       5        a
1       1     NaN       2       3       4        b
print(trans.to_pandas(
    expandCategory = True,
    expandTime = True,
    preprocessType = 'pad'
))
   data_a_1  data_b_1  data_a_2  data_b_2  data_a_3  data_b_3  data_a_4  \
0         1       1.0         2       NaN         3       2.0         4   

   data_b_4  data_a_5  data_b_5  
0       3.0         5       4.0  

make_label function can be used with sepLabel parameter. This function can be used for seperating X and y for machine learning cases.

trans = trans.make_lead('data',leadNum = 1,suffix = '_lead_')
trans = trans.make_label("data_lead_1")
data, label = trans.to_pandas(
    expandCategory = False,
    expandTime = False,
    preprocessType = 'pad',
    sepLabel = True
)
print(data)
   time  data category
0     1   1.0        a
1     2   2.0        a
2     3   3.0        a
3     4   4.0        a
4     5   5.0        a
5     1   1.0        b
6     2   NaN        b
7     3   2.0        b
8     4   3.0        b
9     5   4.0        b
print(label)
   data_lead_1
0          2.0
1          3.0
2          4.0
3          5.0
4          NaN
5          2.0
6          NaN
7          3.0
8          4.0
9          NaN

numpy

Since numpy has no column name, it has to use index number to specify column.

data = {
    "time":[1,2,3,4,5,1,3,4,5],
    'data':[1,2,3,4,5,1,2,3,4],
    "category":['a','a','a','a','a','b','b','b','b']
}
npArray = pd.DataFrame(data).values
trans = tst.Time_Series_Transformer.from_numpy(
    numpyData= npArray,
    timeSeriesCol = 0,
    mainCategoryCol = 2)
trans
data column
-----------
0
1
time length: 5
category: a

data column
-----------
0
1
time length: 4
category: b

main category column: 2
trans = trans.make_lead(1,leadNum = 1,suffix = '_lead_')
trans = trans.make_label("1_lead_1")
X,y = trans.to_pandas(
    expandCategory = False,
    expandTime = False,
    preprocessType = 'pad',
    sepLabel = True
)
print(X)
   0    1  2
0  1  1.0  a
1  2  2.0  a
2  3  3.0  a
3  4  4.0  a
4  5  5.0  a
5  1  1.0  b
6  2  NaN  b
7  3  2.0  b
8  4  3.0  b
9  5  4.0  b
print(y)
   1_lead_1
0       2.0
1       3.0
2       4.0
3       5.0
4       NaN
5       2.0
6       NaN
7       3.0
8       4.0
9       NaN

Stock_Transformer

Stock_Transformer is a subclass of Time_Series_Transformer. Hence, all the function demonstrated in Time_Series_Transformer canbe used in Stock_Transformer. The differences for Stock_Transformer is that it is required to specify High, Low, Open, Close, Volume columns. Besides these information, it has pandas-ta strategy implmentation to create technical indicator (https://github.com/twopirllc/pandas-ta). Moreover, the io class for Stock_Transformer support yfinance and investpy. We can directly extract data from these api.

create technical indicator

stock = tst.Stock_Transformer.from_stock_engine_period(
    symbols = 'GOOGL',period ='1y', engine = 'yahoo'
)
stock
data column
-----------
Date
Open
High
Low
Close
Volume
Dividends
Stock Splits
time length: 253
category: None
import pandas_ta as ta
MyStrategy = ta.Strategy(
    name="DCSMA10",
    ta=[
        {"kind": "ohlc4"},
        {"kind": "sma", "length": 10},
        {"kind": "donchian", "lower_length": 10, "upper_length": 15},
        {"kind": "ema", "close": "OHLC4", "length": 10, "suffix": "OHLC4"},
    ]
)
stock = stock.get_technial_indicator(MyStrategy)
print(stock.to_pandas().head())
         Date         Open         High          Low        Close   Volume  \
0  2020-01-06  1351.630005  1398.319946  1351.000000  1397.810059  2338400   
1  2020-01-07  1400.459961  1403.500000  1391.560059  1395.109985  1716500   
2  2020-01-08  1394.819946  1411.849976  1392.630005  1405.040039  1765700   
3  2020-01-09  1421.930054  1428.680054  1410.209961  1419.790039  1660000   
4  2020-01-10  1429.469971  1434.939941  1419.599976  1428.959961  1312900   

   Dividends  Stock Splits        OHLC4  SMA_10  DCL_10_15  DCM_10_15  \
0          0             0  1374.690002     NaN        NaN        NaN   
1          0             0  1397.657501     NaN        NaN        NaN   
2          0             0  1401.084991     NaN        NaN        NaN   
3          0             0  1420.152527     NaN        NaN        NaN   
4          0             0  1428.242462     NaN        NaN        NaN   

   DCU_10_15  EMA_10_OHLC4  
0        NaN           NaN  
1        NaN           NaN  
2        NaN           NaN  
3        NaN           NaN  
4        NaN           NaN  

For more usage please visit our gallery

You might also like...
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

MaD GUI is a basis for graphical annotation and computational analysis of time series data.
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Visualize classified time series data with interactive Sankey plots in Google Earth Engine
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

Data science, Data manipulation and Machine learning package.
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

A python library for easy manipulation and forecasting of time series.
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

An open-source library of algorithms to analyse time series in GPU and CPU.
An open-source library of algorithms to analyse time series in GPU and CPU.

An open-source library of algorithms to analyse time series in GPU and CPU.

Comments
  • sklearn module import error

    sklearn module import error

    Describe the bug A clear and concise description of what the bug is. import error -> from time_series_transform.sklearn import * To Reproduce Steps to reproduce the behavior:

    1. Go to '...'
    2. Click on '....'
    3. Scroll down to '....'
    4. See error

    Expected behavior A clear and concise description of what you expected to happen.

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • OS: [e.g. iOS]
    • Browser [e.g. chrome, safari]
    • Version [e.g. 22]

    Smartphone (please complete the following information):

    • Device: [e.g. iPhone6]
    • OS: [e.g. iOS8.1]
    • Browser [e.g. stock browser, safari]
    • Version [e.g. 22]

    Additional context Add any other context about the problem here.

    opened by allen-chiang 0
Releases(1.1.2)
Adaptive: parallel active learning of mathematical functions

adaptive Adaptive: parallel active learning of mathematical functions. adaptive is an open-source Python library designed to make adaptive parallel fu

741 Dec 27, 2022
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
Machine-Learning with python (jupyter)

Machine-Learning with python (jupyter) 머신러닝 야학 작심 10일과 쥬피터 노트북 기반 데이터 사이언스 시작 들어가기전 https://nbviewer.org/ 페이지를 통해서 쥬피터 노트북 내용을 볼 수 있다. 위 페이지에서 현재 레포 기

HyeonWoo Jeong 1 Jan 23, 2022
Practical Time-Series Analysis, published by Packt

Practical Time-Series Analysis This is the code repository for Practical Time-Series Analysis, published by Packt. It contains all the supporting proj

Packt 325 Dec 23, 2022
PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the

Google Research 76 Nov 25, 2022
Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort

2.3k Jan 04, 2023
Nixtla is an open-source time series forecasting library.

Nixtla Nixtla is an open-source time series forecasting library. We are helping data scientists and developers to have access to open source state-of-

Nixtla 401 Jan 08, 2023
Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

Panagiotis (Panos) Mavritsakis 4 Sep 22, 2022
This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

minvar_invest_portfolio This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing var

1 Jan 06, 2022
Python library for multilinear algebra and tensor factorizations

scikit-tensor is a Python module for multilinear algebra and tensor factorizations

Maximilian Nickel 394 Dec 09, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 04, 2022
Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.4k Jan 15, 2022
Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

END TO END MACHINE LEARNING PROJECT ON HITTERS DATASET Can a machine learning project be implemented to estimate the salaries of baseball players whos

Pinar Oner 7 Dec 18, 2021
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 4k Jan 09, 2023
🔬 A curated list of awesome machine learning strategies & tools in financial market.

🔬 A curated list of awesome machine learning strategies & tools in financial market.

GeorgeZou 1.6k Dec 30, 2022
A Microsoft Azure Web App project named Covid 19 Predictor using Machine learning Model

A Microsoft Azure Web App project named Covid 19 Predictor using Machine learning Model (Random Forest Classifier Model ) that helps the user to identify whether someone is showing positive Covid sym

Priyansh Sharma 2 Oct 06, 2022
Climin is a Python package for optimization, heavily biased to machine learning scenarios

climin climin is a Python package for optimization, heavily biased to machine learning scenarios distributed under the BSD 3-clause license. It works

Biomimetic Robotics and Machine Learning at Technische Universität München 177 Sep 02, 2022
MLR - Machine Learning Research

Machine Learning Research 1. Project Topic 1.1. Exsiting research Benmark: https://paperswithcode.com/sota ACL anthology for NLP papers: http://www.ac

Charles 69 Oct 20, 2022