Data exploration done quick.

Overview

Pandas Tab

Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations.

Support:

  • Python 3.7 and 3.8: Pandas >=0.23.x
  • Python 3.9: Pandas >=1.0.x

Background & Purpose

As someone who made the move from Stata to Python, one thing I noticed is that I end up doing fewer tabulations of my data when working in Pandas. I believe that the reason for this has a lot to do with API differences that make it slightly less convenient to run tabulations extremely quickly.

For example, if you want to look at values counts in column "foo", in Stata it's merely tab foo. In Pandas, it's df["foo"].value_counts(). This is over twice the amount of typing.

It's not just a brevity issue. If you want to add one more column and to go from one-way to two-way tabulation (e.g. look at "foo" and "bar" together), this isn't as simple as adding one more column:

  • df[["foo", "bar"]].value_counts().unstack() requires one additional transformation to move away from a multi-indexed series.
  • pd.crosstab(df["foo"], df["bar"]) is a totally different interface from the one-way tabulation.

Pandas Tab attempts to solve these issues by creating an interface more similar to Stata: df.tab("foo") and df.tab("foo", "bar") give you, respectively, your one-way and two-way tabulations.

Example

# using IPython integration:
# ! pip install pandas-tab[full]
# ! pandas_tab init

import pandas as pd

df = pd.DataFrame({
    "foo":  ["a", "a", "b", "a", "b", "c", "a"],
    "bar":  [4,   5,   7,   6,   7,   7,   5],
    "fizz": [12,  63,  23,  36,  21,  28,  42]
})

# One-way tabulation
df.tab("foo")

# Two-way tabulation
df.tab("foo", "bar")

# One-way with aggregation
df.tab("foo", values="fizz", aggfunc=pd.Series.mean)

# Two-way with aggregation
df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean)

Outputs:

>> # Two-way tabulation >>> df.tab("foo", "bar") bar 4 5 6 7 foo a 1 2 1 0 b 0 0 0 2 c 0 0 0 1 >>> # One-way with aggregation >>> df.tab("foo", values="fizz", aggfunc=pd.Series.mean) mean foo a 38.25 b 22.00 c 28.00 >>> # Two-way with aggregation >>> df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean) bar 4 5 6 7 foo a 12.0 52.5 36.0 NaN b NaN NaN NaN 22.0 c NaN NaN NaN 28.0 ">
>>> # One-way tabulation
>>> df.tab("foo")

     size  percent
foo               
a       4    57.14
b       2    28.57
c       1    14.29

>>> # Two-way tabulation
>>> df.tab("foo", "bar")

bar  4  5  6  7
foo            
a    1  2  1  0
b    0  0  0  2
c    0  0  0  1

>>> # One-way with aggregation
>>> df.tab("foo", values="fizz", aggfunc=pd.Series.mean)

      mean
foo       
a    38.25
b    22.00
c    28.00

>>> # Two-way with aggregation
>>> df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean)

bar     4     5     6     7
foo                        
a    12.0  52.5  36.0   NaN
b     NaN   NaN   NaN  22.0
c     NaN   NaN   NaN  28.0

Setup

Full Installation (IPython / Jupyter Integration)

The full installation includes a CLI that adds a startup script to IPython:

pip install pandas-tab[full]

Then, to enable the IPython / Jupyter startup script:

pandas_tab init

You can quickly remove the startup script as well:

pandas_tab delete

More on the startup script in the section IPython / Jupyter Integration.

Simple installation:

If you don't want the startup script, you don't need the extra dependencies. Simply install with:

pip install pandas-tab

IPython / Jupyter Integration

The startup script auto-loads pandas_tab each time you load up a new IPython kernel (i.e. each time you fire up or restart your Jupyter Notebook).

You can run the startup script in your terminal with pandas_tab init.

Without the startup script:

# WITHOUT STARTUP SCRIPT
import pandas as pd
import pandas_tab

df = pd.read_csv("foo.csv")
df.tab("x", "y")

Once you install the startup script, you don't need to do import pandas_tab:

# WITH PANDAS_TAB STARTUP SCRIPT INSTALLED
import pandas as pd

df = pd.read_csv("foo.csv")
df.tab("x", "y")

The IPython startup script is convenient, but there are some downsides to using and relying on it:

  • It needs to load Pandas in the background each time the kernel starts up. For typical data science workflows, this should not be a problem, but you may not want this if your workflows ever avoid Pandas.
  • The IPython integration relies on hidden state that is environment-dependent. People collaborating with you may be unable to replicate your Jupyter notebooks if there are any df.tab()'s in there and you don't import pandas_tab manually.

For that reason, I recommend the IPython integration for solo exploratory analysis, but for collaboration you should still import pandas_tab in your notebook.

Limitations / Known Issues

  • No tests or guarantees for 3+ way cross tabulations. Both pd.crosstab and pd.Series.value_counts support multi-indexing, however this behavior is not yet tested for pandas_tab.
  • Behavior for dropna kwarg mimics pd.crosstab (drops blank columns), not pd.value_counts (include NaN/None in the index), even for one-way tabulations.
  • No automatic hook into Pandas; you must import pandas_tab in your code to register the extensions. Pandas does not currently search entry points for extensions, other than for plotting backends, so it's not clear that there's a clean way around this.
  • Does not mimic Stata's behavior of taking unambiguous abbreviations of column names, and there is no option to turn this on/off.
  • Pandas 0.x is incompatible with Numpy 1.20.x. If using Pandas 0.x, you need Numpy 1.19.x.
  • (Add more stuff here?)
Owner
W.D.
memes
W.D.
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
First steps with Python in Life Sciences

First steps with Python in Life Sciences This course material is part of the "First Steps with Python in Life Science" three-day course of SIB-trainin

SIB Swiss Institute of Bioinformatics 22 Jan 08, 2023
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

Mikhail Kolmogorov 50 Jan 05, 2023
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

Tom 3 Feb 12, 2022
Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

Wes McKinney 18.6k Jan 08, 2023
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.2k Dec 28, 2022
Data processing with Pandas.

Processing-data-with-python This is a simple example showing how to use Pandas to create a dataframe and the processing data with python. The jupyter

1 Jan 23, 2022
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022