lightweight, fast and robust columnar dataframe for data analytics with online update

Last update: May 19, 2022

Related tags

Overview

streamdf

Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competition.

Key Features

Fast and robust insertion
- The insertion of row can be performed with amortized constant time (much faster than np.append)
- Automatically falls back to the default value when an abnormal value is inserted
Time-travel
- Get the past state of the data as a slice of the original dataframe without copying
Null/empty-safe aggregations
- Provides a set of aggregation methods that can be safely called when an element has nan or is empty.
Columnar layout
- Internal data is stored in a simple columnar format, which is easier to use for analysis than numpy's structured array

Example

import pandas as pd
from streamdf import StreamDf

df = pd.read_csv('test.csv')
sdf = StreamDf.from_pandas(df)

# extend
sdf.extend({
    'x': 1,
    'y': 2
})

assert len(sdf) == len(df) + 1

# access
print(sdf['x'])

# aggregate
sdf.last_value('x')

import numpy as np
from streamdf import StreamDf

sdf = StreamDf.empty({'x': np.int32, 'time': 'datetime64[D]'}, 'time')

sdf.extend({'x': 1, 'time': np.datetime64('2018-01-01')})
sdf.extend({'x': 5, 'time': np.datetime64('2018-02-01')})
sdf.extend({'x': 3, 'time': np.datetime64('2018-02-03')})

assert len(sdf) == 3

# Time travel (zero copy)
sliced = sdf.slice_until(np.datetime64('2018-02-02'))

assert len(sliced) == 2

lightweight, fast and robust columnar dataframe for data analytics with online update

Related tags

Overview

streamdf

Key Features

Example

Owner

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Reformer, the efficient Transformer, in Pytorch

A Python script which randomly chooses and prints a file from a directory.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

The first online catalogue for Arabic NLP datasets.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

jiant is an NLP toolkit

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

基于百度的语音识别，用python实现，pyaudio+pyqt

2021海华AI挑战赛·中文阅读理解·技术组·第三名

Anomaly Detection 이상치 탐지 전처리 모듈

100+ Chinese Word Vectors 上百种预训练中文词向量

NLP made easy

DVC-NLP-Simple-usecase

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Lumped-element impedance calculator and frequency-domain plotter.

Natural Language Processing Best Practices & Examples

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Predict the spans of toxic posts that were responsible for the toxic label of the posts