The "breathing k-means" algorithm with datasets and example notebooks

Last update: Nov 17, 2022

Overview

The Breathing K-Means Algorithm (with examples)

The Breathing K-Means is an approximation algorithm for the k-means problem that (on average) is better (higher solution quality) and faster (lower CPU time usage) than k-means++.

Techreport: https://arxiv.org/abs/2006.15666 (submitted for publication)

Typical results for the "Birch" data set (100000 points drawn from a mixture of 100 circular Gaussians). k=100

Can you spot the mistakes? :-)

Installation from pypi

pip install bkmeans

Local installation to run the examples

Clone the repository

git clone https://github.com/gittar/breathing-k-means

Enter the top directory.

cd breathing-k-means

Create the conda environment 'bkm' (or any other name) via

conda env create -n bkm -f environment.yml

Activate the created environment via

conda activate bkm

To run a jupyter notebook with examples, type, e.g.:

jupyter lab notebooks/2D.ipynb

Content

The top level folder contains the following subfolders

data/ - data sets used in the notebooks
notebooks/ - jupyter notebooks with all examples from the technical report
- 2D.ipynb 2D problems executed with helper functions for brevity
- 2D_detail.ipynb 2D problems executed with raw API
- 10+D.ipynb high-dimensional problems based on the data sets from the original k-means++ publication
src/
- bkmeans.py - reference implementation of breathing k-means
misc/
- aux.py - auxiliary functions
- dataset.py - general class to administer and plot data sets
- runfunctions.py - wrapper functions used in the notebook

API

The included class BKMeans is subclassed from scikit-learn's KMeans class and has, therefore, the same API. It can be used as a plug-in replacement for scikit-learn's KMeans.

There is one new parameters which can be ignored (left at default) for normal usage:

m (breathing depth), default: 5

The parameter m can also be used, however, to generate faster ( 1 < m < 5) or better (m>5) solutions. For details see the technical report.

Example 1: running on simple random data set

Code:

import numpy as np
from bkmeans import BKMeans

# generate random data set
X=np.random.rand(1000,2)

# create BKMeans instance
bkm = BKMeans(n_clusters=100)

# run the algorithm
bkm.fit(X)

# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)

Output:

1.1775040547902602

Example 2: comparison with k-means++ (multiple runs)

Code:

import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans

# random 2D data set
X=np.random.rand(1000,2)

# number of centroids
k=100

for i in range(5):
    # kmeans++
    km = KMeans(n_clusters=k)
    km.fit(X)

    # breathing k-means
    bkm = BKMeans(n_clusters=k)
    bkm.fit(X)

    # relative SSE improvement of bkm over km++
    imp = 1 - bkm.inertia_/km.inertia_
    print(f"SSE improvement over k-means++: {imp:.2%}")

Output:

SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%

Acknowledgements

Kudos go the scikit-learn team for their excellent sklearn.cluster.KMeans class, also to the developers and maintainers of the other packages used: numpy, scipy, matplotlib, jupyterlab

The "breathing k-means" algorithm with datasets and example notebooks

Related tags

Overview

The Breathing K-Means Algorithm (with examples)

Installation from pypi

Local installation to run the examples

Content

API

Example 1: running on simple random data set

Example 2: comparison with k-means++ (multiple runs)

Acknowledgements

Owner

Bernd Fritzke

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

System Combination for Grammatical Error Correction Based on Integer Programming

Backend code to use MCPI's python API to make infinite worlds with custom generation

Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra

Simple ONNX operation generator. Simple Operation Generator for ONNX.

code from "Tensor decomposition of higher-order correlations by nonlinear Hebbian plasticity"

PyTorch implementation of Interpretable Explanations of Black Boxes by Meaningful Perturbation

Memoized coduals - Shows that it is possible to implement reverse mode autodiff using a variation on the dual numbers called the codual numbers

CR-Fill: Generative Image Inpainting with Auxiliary Contextual Reconstruction. ICCV 2021

ICML 21 - Voice2Series: Reprogramming Acoustic Models for Time Series Classification

Avalanche RL: an End-to-End Library for Continual Reinforcement Learning

Improving Calibration for Long-Tailed Recognition (CVPR2021)

S2s2net - Sentinel-2 Super-Resolution Segmentation Network

Combining Latent Space and Structured Kernels for Bayesian Optimization over Combinatorial Spaces

CRISCE: Automatically Generating Critical Driving Scenarios From Car Accident Sketches

[NeurIPS'21] Shape As Points: A Differentiable Poisson Solver

Anomaly detection in multi-agent trajectories: Code for training, evaluation and the OpenAI highway simulation.

A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

Pytorch implementation of our method for regularizing nerual radiance fields for few-shot neural volume rendering.

Online-compatible Unsupervised Non-resonant Anomaly Detection Repository