PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Overview

H3 Logo

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PyPI version PyPI downloads conda version

Tests

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">
>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

  1. Bump version in setup.cfg
  2. Publish:
python3 -m build
python3 -m twine upload --repository pypi dist/*
Comments
  • 'TypeError: must be real number, not NoneType' when using h3_pyspark

    'TypeError: must be real number, not NoneType' when using h3_pyspark

    Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

    dataframe: image

    errors: image

    opened by Tingmi 5
  • Fix indexing for polygons and lines

    Fix indexing for polygons and lines

    Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

    opened by rwaldman 1
  • Better error handling when null values are passed in

    Better error handling when null values are passed in

    Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

    This type behavior would be better/more resilient:

    @F.udf(T.ArrayType(T.StringType()))
    def index_shape(geometry, resolution):
        if geometry is None:
            return None
        return _index_shape(geometry, resolution)
    
    opened by kevinschaich 1
  • Fix bug in index_shape function which missed hexes for long line segments

    Fix bug in index_shape function which missed hexes for long line segments

    Fixes #8

    Previous behavior for problematic line:

    Screen Shot 2022-02-24 at 3 40 36 PM

    New behavior for same line:

    Screen Shot 2022-02-24 at 4 02 47 PM

    Previous behavior for problematic polygon:

    Screen Shot 2022-02-24 at 4 34 59 PM

    New behavior for same polygon:

    Screen Shot 2022-02-24 at 4 35 46 PM

    cc: @deankieserman @rwaldman

    opened by kevinschaich 0
  • Bug in index_shape function which misses several hexes

    Bug in index_shape function which misses several hexes

    Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

    image

    Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

    image
    opened by kevinschaich 0
  • polyfill fails with valid multipolygon geojson

    polyfill fails with valid multipolygon geojson

    h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

    however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

    implementation in src/h3_pyspark/__init__.py

    @F.udf(returnType=T.ArrayType(T.StringType()))
    @handle_nulls
    def polyfill(polygons, res, geo_json_conformant):
        # NOTE: this behavior differs from default
        # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
        polygons = json.loads(polygons)
        type_ = polygons["type"].lower()
        if type_ == "multipolygon":
            output = []
            for i in polygons["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
            return sanitize_types(output)
        return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))
    

    test in tests/test_core.py

    multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'
    
    def test_polyfill_multipolygon(self):
            h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
            print(h3_pyspark_test_args)
            integer = 12
            data = {
                "res": integer,
                "geo_json_conformant": True,
                "geojson": multipolygon,
            }
            df = spark.createDataFrame([data])
            actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
            actual = actual.collect()[0]["actual"]
            print(actual)
            expected = []
            for i in json.loads(multipolygon)["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                expected.extend(list(h3.polyfill(_polygon, integer, True)))
            expected = sanitize_types(expected)
            assert sort(actual) == sort(expected)
    
    opened by kangeugine 0
Releases(1.2.6)
  • 1.2.6(Mar 10, 2022)

  • 1.2.4(Mar 4, 2022)

    What's Changed

    • Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4

    Source code(tar.gz)
    Source code(zip)
  • 1.2.3(Feb 24, 2022)

    What's Changed

    • Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3
    • Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

    New Contributors

    • @deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Dec 8, 2021)

    What's Changed

    • Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1
    • Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

    New Contributors

    • @kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
Kevin Schaich
Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.
Kevin Schaich
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyễn Quang Huy 5 Sep 28, 2022
LynxKite: a complete graph data science platform for very large graphs and other datasets.

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

124 Dec 14, 2022
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

Spectacular AI 94 Jan 04, 2023
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
Data Analytics on Genomes and Genetics

Data Analytics performed on On genomes and Genetics dataset to predict genetic disorder and disorder subclass. DONE by TEAM SIGMA!

1 Jan 12, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 04, 2023
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
Tokyo 2020 Paralympics, Analytics

Tokyo 2020 Paralympics, Analytics Thanks for checking out my app! It was built entirely using matplotlib and Tokyo 2020 Paralympics data. This applica

Petro Ivaniuk 1 Nov 18, 2021
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
Vectorizers for a range of different data types

Vectorizers for a range of different data types

Tutte Institute for Mathematics and Computing 69 Dec 29, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
Automated Exploration Data Analysis on a financial dataset

Automated EDA on financial dataset Just a simple way to get automated Exploration Data Analysis from financial dataset (OHLCV) using Streamlit and ta.

Darío López Padial 28 Nov 27, 2022
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

CRISPRanalysis InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family. In this work, we present a workflow

2 Jan 31, 2022