Fast 1D and 2D histogram functions in Python

Overview

Azure Status asv

About

Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, and can handle for example non-regular binning, but this versatility comes at the expense of performance.

The fast-histogram mini-package aims to provide simple and fast histogram functions for regular bins that don't compromise on performance. It doesn't do anything complicated - it just implements a simple histogram algorithm in C and keeps it simple. The aim is to have functions that are fast but also robust and reliable. The result is a 1D histogram function here that is 7-15x faster than numpy.histogram, and a 2D histogram function that is 20-25x faster than numpy.histogram2d.

To install:

pip install fast-histogram

or if you use conda you can instead do:

conda install -c conda-forge fast-histogram

The fast_histogram module then provides two functions: histogram1d and histogram2d:

from fast_histogram import histogram1d, histogram2d

Example

Here's an example of binning 10 million points into a regular 2D histogram:

In [1]: import numpy as np

In [2]: x = np.random.random(10_000_000)

In [3]: y = np.random.random(10_000_000)

In [4]: %timeit _ = np.histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
935 ms ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from fast_histogram import histogram2d

In [6]: %timeit _ = histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
40.2 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(note that 10_000_000 is possible in Python 3.6 syntax, use 10000000 instead in previous versions)

The version here is over 20 times faster! The following plot shows the speedup as a function of array size for the bin parameters shown above:

Comparison of performance between Numpy and fast-histogram

as well as results for the 1D case, also with 30 bins. The speedup for the 2D case is consistently between 20-25x, and for the 1D case goes from 15x for small arrays to around 7x for large arrays.

Q&A

Why don't the histogram functions return the edges?

Computing and returning the edges may seem trivial but it can slow things down by a factor of a few when computing histograms of 10^5 or fewer elements, so not returning the edges is a deliberate decision related to performance. You can easily compute the edges yourself if needed though, using numpy.linspace.

Doesn't package X already do this, but better?

This may very well be the case! If this duplicates another package, or if it is possible to use Numpy in a smarter way to get the same performance gains, please open an issue and I'll consider deprecating this package :)

One package that does include fast histogram functions (including in n-dimensions) and can compute other statistics is vaex, so take a look there if you need more advanced functionality!

Are the 2D histograms not transposed compared to what they should be?

There is technically no 'right' and 'wrong' orientation - here we adopt the convention which gives results consistent with Numpy, so:

numpy.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

should give the same result as:

fast_histogram.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

Why not contribute this to Numpy directly?

As mentioned above, the Numpy functions are much more versatile, so they could not be replaced by the ones here. One option would be to check in Numpy's functions for cases that are simple and dispatch to functions such as the ones here, or add dedicated functions for regular binning. I hope we can get this in Numpy in some form or another eventually, but for now, the aim is to have this available to packages that need to support a range of Numpy versions.

Why not use Cython?

I originally implemented this in Cython, but found that I could get a 50% performance improvement by going straight to a C extension.

What about using Numba?

I specifically want to keep this package as easy as possible to install, and while Numba is a great package, it is not trivial to install outside of Anaconda.

Could this be parallelized?

This may benefit from parallelization under certain circumstances. The easiest solution might be to use OpenMP, but this won't work on all platforms, so it would need to be made optional.

Couldn't you make it faster by using the GPU?

Almost certainly, though the aim here is to have an easily installable and portable package, and introducing GPUs is going to affect both of these.

Why make a package specifically for this? This is a tiny amount of functionality

Packages that need this could simply bundle their own C extension or Cython code to do this, but the main motivation for releasing this as a mini-package is to avoid making pure-Python packages into packages that require compilation just because of the need to compute fast histograms.

Can I contribute?

Yes please! This is not meant to be a finished package, and I welcome pull request to improve things.

Owner
Thomas Robitaille
Thomas Robitaille
Function Plotter: a simple application with GUI to plot mathematical functions

Function-Plotter Function Plotter is a simple application with GUI to plot mathe

Mohamed Nabawe 4 Jan 03, 2022
An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

Germline Mutation Calling This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants. The pipeline require

12 Dec 24, 2022
YOPO is an interactive dashboard which generates various standard plots.

YOPO is an interactive dashboard which generates various standard plots.you can create various graphs and charts with a click of a button. This tool uses Dash and Flask in backend.

ADARSH C 38 Dec 20, 2022
Data Visualization Guide for Presentations, Reports, and Dashboards

This is a highly practical and example-based guide on visually representing data in reports and dashboards.

Anton Zhiyanov 395 Dec 29, 2022
High-level geospatial data visualization library for Python.

geoplot: geospatial data visualization geoplot is a high-level Python geospatial plotting library. It's an extension to cartopy and matplotlib which m

Aleksey Bilogur 1k Jan 01, 2023
IPython/Jupyter notebook module for Vega and Vega-Lite

IPython Vega IPython/Jupyter notebook module for Vega 5, and Vega-Lite 4. Notebooks with embedded visualizations can be viewed on GitHub and nbviewer.

Vega 335 Nov 29, 2022
nvitop, an interactive NVIDIA-GPU process viewer, the one-stop solution for GPU process management

An interactive NVIDIA-GPU process viewer, the one-stop solution for GPU process management.

Xuehai Pan 1.3k Jan 02, 2023
Peloton Stats to Google Sheets with Data Visualization through Seaborn and Plotly

Peloton Stats to Google Sheets with Data Visualization through Seaborn and Plotly Problem: 2 peloton users were looking for a way to track their metri

9 Jul 22, 2022
GDSHelpers is an open-source package for automatized pattern generation for nano-structuring.

GDSHelpers GDSHelpers in an open-source package for automatized pattern generation for nano-structuring. It allows exporting the pattern in the GDSII-

Helge Gehring 76 Dec 16, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022
A pandas extension that solves all problems of Jalai/Iraninan/Shamsi dates

Jalali Pandas Extentsion A pandas extension that solves all problems of Jalai/Iraninan/Shamsi dates Features Series Extenstion Convert string to Jalal

51 Jan 02, 2023
GitHubPoster - Make everything a GitHub svg poster

GitHubPoster Make everything a GitHub svg poster 支持 Strava 开心词场 扇贝 Nintendo Switch GPX 多邻国 Issue

yihong 1.3k Jan 02, 2023
Python package that generates hardware pinout diagrams as SVG images

PinOut A Python package that generates hardware pinout diagrams as SVG images. The package is designed to be quite flexible and works well for general

336 Dec 20, 2022
A central task in drug discovery is searching, screening, and organizing large chemical databases

A central task in drug discovery is searching, screening, and organizing large chemical databases. Here, we implement clustering on molecular similarity. We support multiple methods to provide a inte

NVIDIA Corporation 124 Jan 07, 2023
哔咔漫画window客户端,界面使用PySide2,已实现分类、搜索、收藏夹、下载、在线观看、waifu2x等功能。

picacomic-windows 哔咔漫画window客户端,界面使用PySide2,已实现分类、搜索、收藏夹、下载、在线观看等功能。 功能介绍 登陆分流,还原安卓端的三个分流入口 分类,搜索,排行,收藏夹使用同一的逻辑,滚轮下滑自动加载下一页,双击打开 漫画详情,章节列表和评论列表 下载功能,目

1.8k Dec 31, 2022
Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Aravind Kumar G 2 Nov 17, 2021
Small project to recursively calculate and plot each successive order of the Hilbert Curve

hilbert-curve Small project to recursively calculate and plot each successive order of the Hilbert Curve. After watching 3Blue1Brown's video on Hilber

Stefan Mejlgaard 2 Nov 15, 2021
Visualization of the World Religion Data dataset by Correlates of War Project.

World Religion Data Visualization Visualization of the World Religion Data dataset by Correlates of War Project. Mostly personal project to famirializ

Emile Bangma 1 Oct 15, 2022
Cryptocurrency Centralized Exchange Visualization

This is a simple one that uses Grafina to visualize cryptocurrency from the Bitkub exchange. This service will make a request to the Bitkub API from your wallet and save the response to Postgresql. G

Popboon Mahachanawong 1 Nov 24, 2021
Library for exploring and validating machine learning data

TensorFlow Data Validation TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be hig

688 Jan 03, 2023