PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Last update: Dec 24, 2022

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

uber.github.io/h3-py

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">

>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

Bump version in setup.cfg
Publish:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

dataframe:

errors:

opened by Tingmi 5
Fix indexing for polygons and lines

Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

opened by rwaldman 1
Better error handling when null values are passed in
Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

This type behavior would be better/more resilient:

@F.udf(T.ArrayType(T.StringType())) def index_shape(geometry, resolution): if geometry is None: return None return _index_shape(geometry, resolution)
opened by kevinschaich 1
Fix bug in index_shape function which missed hexes for long line segments

Fixes #8

Previous behavior for problematic line:

New behavior for same line:

Previous behavior for problematic polygon:

New behavior for same polygon:

cc: @deankieserman @rwaldman

opened by kevinschaich 0
Bug in index_shape function which misses several hexes

Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

opened by kevinschaich 0

polyfill fails with valid multipolygon geojson

h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

implementation in src/h3_pyspark/__init__.py

@F.udf(returnType=T.ArrayType(T.StringType()))
@handle_nulls
def polyfill(polygons, res, geo_json_conformant):
    # NOTE: this behavior differs from default
    # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
    polygons = json.loads(polygons)
    type_ = polygons["type"].lower()
    if type_ == "multipolygon":
        output = []
        for i in polygons["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
        return sanitize_types(output)
    return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))

test in tests/test_core.py

multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'

def test_polyfill_multipolygon(self):
        h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
        print(h3_pyspark_test_args)
        integer = 12
        data = {
            "res": integer,
            "geo_json_conformant": True,
            "geojson": multipolygon,
        }
        df = spark.createDataFrame([data])
        actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
        actual = actual.collect()[0]["actual"]
        print(actual)
        expected = []
        for i in json.loads(multipolygon)["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            expected.extend(list(h3.polyfill(_polygon, integer, True)))
        expected = sanitize_types(expected)
        assert sort(actual) == sort(expected)

opened by kangeugine 0

Releases(1.2.6)

1.2.6(Mar 10, 2022)
Add edge cases for lines (#11)

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.5...1.2.6
Source code(tar.gz)
Source code(zip)
1.2.4(Mar 4, 2022)
What's Changed

Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4
Source code(tar.gz)
Source code(zip)
1.2.3(Feb 24, 2022)
What's Changed

Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3

Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

New Contributors

@deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3
Source code(tar.gz)
Source code(zip)
1.2.2(Jan 5, 2022)

Source code(tar.gz)
Source code(zip)
1.1.0(Dec 8, 2021)
What's Changed

Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1

Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

New Contributors

@kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0
Source code(tar.gz)
Source code(zip)

Owner

Kevin Schaich

Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.

GitHub Repository https://uber.github.io/h3-py/intro.html

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

3 Oct 30, 2021

PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

5 Sep 23, 2022

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

2 Sep 28, 2022

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

1 Jan 06, 2022

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

1.4k Dec 31, 2022

Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

1 Oct 20, 2021

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

2k Dec 29, 2022

University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021

Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022

CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

1 Oct 20, 2021

A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

3 Aug 22, 2022

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

ElasticBatch Elasticsearch buffer for collecting and batch inserting Python data and pandas DataFrames Overview ElasticBatch makes it easy to efficien

21 Mar 16, 2022

Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

917 Jan 03, 2023

small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

3 Dec 14, 2022

Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

1 Nov 06, 2021

International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

41 Jun 16, 2022

pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

5 Nov 19, 2022

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Related tags

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

Installation

Usage

Publishing

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Fix indexing for polygons and lines

Better error handling when null values are passed in

Fix bug in index_shape function which missed hexes for long line segments

Bug in index_shape function which misses several hexes

polyfill fails with valid multipolygon geojson

Releases(1.2.6)

1.2.6(Mar 10, 2022)

1.2.4(Mar 4, 2022)

What's Changed

1.2.3(Feb 24, 2022)

What's Changed

New Contributors

1.2.2(Jan 5, 2022)

1.1.0(Dec 8, 2021)

What's Changed

New Contributors

Owner

Kevin Schaich

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

PyEmits, a python package for easy manipulation in time-series data.

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Python package for processing UC module spectral data.

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

University Challenge 2021 With Python

Provide a market analysis (R)

Randomisation-based inference in Python based on data resampling and permutation.

CPSPEC is an astrophysical data reduction software for timing

A notebook to analyze Amazon Recommendation Review Dataset.

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

Active Learning demo using two small datasets

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

small package with utility functions for analyzing (fly) calcium imaging data

Python implementation of Principal Component Analysis

International Space Station data with Python research 🌎

pyETT: Python library for Eleven VR Table Tennis data