Universal 1d/2d data containers with Transformers functionality for data analysis.

Last update: Mar 14, 2022

Overview

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extraction and transformation modelling in an sklearn-compatible transformer interface.

Quickstart

Install the latest version

$ pip install xpandas

and run the example jupyter notebook

$ jupyter examples/ExampleUsage.ipynb

Documentation

The full documentation is available at https://alan-turing-institute.github.io/xpandas/.

Acknowledgements

Bernd Bischl (@berndbischl), who mentioned the idea of a general data container with transformers attached to columns in personal discussion with Franz Kiraly during a London visit in 2016.
Franz Kiraly (@fkiraly), who initiated and funded the project up to release, and who substantially contributed to the API design.
Haoran Xue (@HaoranXue), who, under the supervision of Franz Kiraly, earlier completed a thesis for a degree at UCL on the topic, and who wrote a similar package as part of it. No code was re-used in the creation of the XPandas package.

List of developers and contributors

Comments

Acknowledgments

Before I forget: somewhere prominently the following should be acknowledged in some form, not necessarily in the below:

Bernd Bischl, who mentioned the idea of a general data container with transformers attached to columns in personal discussion during a London visit in 2016. Myself, having (in my opinion) substantially contributed through the API design (?). Haoran Xue, who completed a thesis on the topic erlier. While no code was transferred, lessons that were learnt may have been transferred.

opened by fkiraly 2
Improved documentation

This pull request improves the readability of the documentation.

While going through your codebase, I realised that there's a lot of redundancy in the module naming, e.g. /transformers/transformers/series_transformers/series_transformer.py instead of /transformers/series/series_transformer.py. Is there any specific reason for that? If not I'd suggest you to refactor the module into a more straightforward naming structure.

opened by frthjf 1
sensible default for transformation: column replacement

currently it adds the transformer output while retaining the original column

for retaining original column: use identityTransformer (to be implemented)

opened by fkiraly 0
tutorial: separate data container from transformer tutorial

Structure should be changed to: (1) data container (Xseries and XDataFrame) (2) transformer functionality

since user should be made aware that (1) is a separate interface concept on top of which (2) may be invoked but isn't necessarily tied together

opened by fkiraly 0
Bump numpy from 1.15.2 to 1.22.0
Bumps numpy from 1.15.2 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump ipython from 7.0.1 to 7.16.3
Bumps ipython from 7.0.1 to 7.16.3.

Commits

d43c7c7 release 7.16.3

5fa1e40 Merge pull request from GHSA-pq7m-3gw7-gq5x

8df8971 back to dev

9f477b7 release 7.16.2

138f266 bring back release helper from master branch

5aa3634 Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...

bcae8e0 Backport PR #13335: What's new 7.16.2

8fcdcd3 Pin Jedi to <0.17.2.

2486838 release 7.16.1

20bdc6f fix conda build

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
will it work for multivariate time series prediction both regression and classification
great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values weight height age target 1 56 160 34 1.2 2 77 170 54 3.5 3 87 167 43 0.7 4 55 198 72 0.5 5 88 176 32 2.3

2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

color weight gender height age target

1 black 56 m 160 34 yes 2 white 77 f 170 54 no 3 yellow 87 m 167 43 yes 4 white 55 m 198 72 no 5 white 88 f 176 32 yes
opened by Sandy4321 0
will it work for multivariate time series prediction both regression and classification
great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values weight height age target 1 56 160 34 1.2 2 77 170 54 3.5 3 87 167 43 0.7 4 55 198 72 0.5 5 88 176 32 2.3

2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

color weight gender height age target

1 black 56 m 160 34 yes 2 white 77 f 170 54 no 3 yellow 87 m 167 43 yes 4 white 55 m 198 72 no 5 white 88 f 176 32 yes
opened by Sandy4321 0

Many standard methods do not work (properly) on XDataFrame with hierarchical data

# loading some time-series data
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
from xpandas.data_container import XSeries, XDataFrame
import numpy as np
import pandas as pd

def read_data(file):
    data = file.readlines()
    rows = [row.decode('utf-8').strip().split('  ') for row in data]
    X = pd.DataFrame(rows, dtype=np.float)
    y = X.pop(0)
    ts = XSeries([row for _, row in X.iterrows()])
    X = XDataFrame({'ts1': ts, 'ts2': ts})
    return X, y

url = 'http://www.timeseriesclassification.com/Downloads/GunPoint.zip'
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
file = zipfile.open('GunPoint_TRAIN.txt')
X, y = read_data(file)

X.mean() # returns empty series rather than mean of series, the same for many other methods like .std(), .median(), etc)

X.apply(np.mean) # breaks

X['ts1'].mean() # breaks 

X['ts1'].apply(np.mean) # works

X['ts1'].apply(np.percentile, args=(25,)) # breaks, does not passes on args

opened by mloning 0

Slicing single row of XDataFrame does not work

Slicing of single row in XDataFrame does not work, probably because it tries to return a series which does not work as types are heterogeneous, so instead one may want to return a XDataFrame with a single row.

import pandas as pd
from xpandas.data_container import XDataFrame

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.iloc[0] # works

irisx = XDataFrame(iris) 
irisx.iloc[0] # breaks

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-b46ce52f9af0> in <module>
----> 1 irisx.iloc[0]

~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1476 
   1477             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478             return self._getitem_axis(maybe_callable, axis=axis)
   1479 
   1480     def _is_scalar_access(self, key):

~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   2102             self._validate_integer(key, axis)
   2103 
-> 2104             return self._get_loc(key, axis=axis)
   2105 
   2106     def _convert_to_indexer(self, obj, axis=None, is_setter=False):

~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/indexing.py in _get_loc(self, key, axis)
    143         if axis is None:
    144             axis = self.axis
--> 145         return self.obj._ixs(key, axis=axis)
    146 
    147     def _slice(self, obj, axis=None, kind=None):

~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/frame.py in _ixs(self, i, axis)
   2624                                                       index=self.columns,
   2625                                                       name=self.index[i],
-> 2626                                                       dtype=new_values.dtype)
   2627                 result._set_is_copy(self, copy=copy)
   2628                 return result

~/.conda/envs/sktime/lib/python3.7/site-packages/xpandas/data_container/data_container.py in __init__(self, *args, **kwargs)
     71         check_result, data_type = _check_all_elements_have_the_same_property(data, type)
     72         if not check_result:
---> 73             raise ValueError('Not all elements the same type')
     74 
     75         if data_type is not None:

ValueError: Not all elements the same type

opened by mloning 0

Releases(1.0.2)

1.0.2(Oct 23, 2017)

Please refer to documentation and tutorial.
Source code(tar.gz)
Source code(zip)

Owner

The Alan Turing Institute

The UK's national institute for data science and artificial intelligence.

GitHub Repository https://alan-turing-institute.github.io/xpandas/

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

2.2k Jan 04, 2023

Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

25 Mar 14, 2022

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

254 Dec 06, 2022

High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

2.9k Dec 23, 2022

cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Built based on the Apache Arrow columnar memory format,

5.2k Dec 31, 2022

Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

348 Jan 03, 2023

The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

260 Jan 01, 2023

sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

1.2k Jan 09, 2023

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

206 Dec 13, 2022

Universal 1d/2d data containers with Transformers functionality for data analysis.

Related tags

Overview

Quickstart

Documentation

Acknowledgements

Comments

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Releases(1.0.2)

1.0.2(Oct 23, 2017)

Owner

The Alan Turing Institute

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

Universal 1d/2d data containers with Transformers functionality for data analysis.

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

High performance datastore for time series and tick data

cuDF - GPU DataFrame Library

Pandas Google BigQuery

The easy way to write your own flavor of Pandas

sqldf for pandas

Create HTML profiling reports from pandas DataFrame objects

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

Koalas: pandas API on Apache Spark

Modin: Speed up your Pandas workflows by changing a single line of code

A Python package for manipulating 2-dimensional tabular data structures

NumPy and Pandas interface to Big Data

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio