Pandas and Dask test helper methods with beautiful error messages.

Last update: Nov 28, 2022

Related tags

Overview

beavis

Pandas and Dask test helper methods with beautiful error messages.

test helpers

These test helper methods are meant to be used in test suites. They provide descriptive error messages to allow for a seamless development workflow.

The test helpers are inspired by chispa and spark-fast-tests, popular test helper libraries for the Spark ecosystem.

There are built-in Pandas testing methods that can also be used, but they don't provide error messages that are as easy to parse. The following sections compare the built-in Pandas output and what's output by Beavis, so you can choose for yourself.

Column comparisons

The built-in assert_series_equal method does not make it easy to decipher the rows that are equal and the rows that are different, so quickly fixing your tests and maintaining flow is hard.

Here's the built-in error message when comparing series that are not equal.

df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])

>   ???
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [index]: [0, 1, 2, 3]
E   [left]:  [1042, 2, 9, 6]
E   [right]: [5, 2, 7, 6]

Here's the beavis error message that aligns rows and highlights the mismatches in red.

import beavis

beavis.assert_pd_column_equality(df, "col1", "col2")

You can also compare columns in a Dask DataFrame.

ddf = dd.from_pandas(df, npartitions=2)
beavis.assert_dd_column_equality(ddf, "col1", "col2")

The assert_dd_column_equality error message is similarly descriptive.

DataFrame comparisons

The built-in pandas.testing.assert_frame_equal method doesn't output an error message that's easy to understand, see this example.

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
pd.testing.assert_frame_equal(df1, df2)

E   AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
E
E   DataFrame.iloc[:, 0] (column name="col1") values are different (50.0 %)
E   [index]: [0, 1]
E   [left]:  [1, 2]
E   [right]: [5, 2]

beavis provides a nicer error message.

beavis.assert_pd_equality(df1, df2)

DataFrame comparison options:

check_index (default True)
check_dtype (default True)

Let's convert the Pandas DataFrames to Dask DataFrames and use the assert_dd_equality function to check they're equal.

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
beavis.assert_dd_equality(ddf1, ddf2)

These DataFrames aren't equal, so we'll get a good error message that's easy to debug.

Development

Install Poetry and run poetry install to create a virtual environment with all the Beavis dependencies on your machine.

Other useful commands:

poetry run pytest tests runs the test suite
poetry run black . to format the code
poetry build packages the library in a wheel file
poetry publish releases the library in PyPi (need correct credentials)

Pandas and Dask test helper methods with beautiful error messages.

Related tags

Overview

beavis

test helpers

Column comparisons

DataFrame comparisons

Development

Owner

Matthew Powers

Pandas and Spark DataFrame comparison for humans

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Developed for analyzing the covariance for OrcVIO

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Convert monolithic Jupyter notebooks into Ploomber pipelines.

The lastest all in one bombing tool coded in python uses tbomb api

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

Python Practicum - prepare for your Data Science interview or get a refresher.

Jupyter notebooks for the book "The Elements of Statistical Learning".

Flood modeling by 2D shallow water equation

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

Maximum Covariance Analysis in Python

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

Py-price-monitoring - A Python price monitor

MDAnalysis is a Python library to analyze molecular dynamics simulations.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.