geobeam - adds GIS capabilities to your Apache Beam and Dataflow pipelines.

Overview

geobeam adds GIS capabilities to your Apache Beam pipelines.

What does geobeam do?

geobeam enables you to ingest and analyze massive amounts of geospatial data in parallel using Dataflow. geobeam provides a set of FileBasedSource classes that make it easy to read, process, and write geospatial data, and provides a set of helpful Apache Beam transforms and utilities that make it easier to process GIS data in your Dataflow pipelines.

See the Full Documentation for complete API specification.

Requirements

  • Apache Beam 2.27+
  • Python 3.7+

Note: Make sure the Python version used to run the pipeline matches the version in the built container.

Supported input types

File format Data type Geobeam class
tiff raster GeotiffSource
shp vector ShapefileSource
gdb vector GeodatabaseSource

Included libraries

geobeam includes several python modules that allow you to perform a wide variety of operations and analyses on your geospatial data.

Module Version Description
gdal 3.2.1 python bindings for GDAL
rasterio 1.1.8 reads and writes geospatial raster data
fiona 1.8.18 reads and writes geospatial vector data
shapely 1.7.1 manipulation and analysis of geometric objects in the cartesian plane

How to Use

1. Install the module

pip install geobeam

2. Write your pipeline

Write a normal Apache Beam pipeline using one of geobeams file sources. See geobeam/examples for inspiration.

3. Run

Run locally

python -m geobeam.examples.geotiff_dem \
  --gcs_url gs://geobeam/examples/dem-clipped-test.tif \
  --dataset=examples \
  --table=dem \
  --band_column=elev \
  --centroid_only=true \
  --runner=DirectRunner \
  --temp_location <temp gs://> \
  --project <project_id>

You can also run "locally" in Cloud Shell using the py-37 container variants

Note: Some of the provided examples may take a very long time to run locally...

Run in Dataflow

Write a Dockerfile

This will run in Dataflow as a custom container based on the dataflow-geobeam/base image. See [geobeam/examples/Dockerfile] for an example that installed the latest geobeam from source.

FROM gcr.io/dataflow-geobeam/base
# FROM gcr.io/dataflow-geobeam/base-py37

RUN pip install geobeam

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
# build locally with docker
docker build -t gcr.io/<project_id>/example
docker push gcr.io/<project_id>/example

# or build with Cloud Build
gcloud builds submit --tag gcr.io/<project_id>/<name> --timeout=3600s --machine-type=n1-highcpu-8

Start the Dataflow job

Note on Python versions

If you are starting a Dataflow job on a machine running Python 3.7, you must use the images suffixed with py-37. (Cloud Shell runs Python 3.7 by default, as of Feb 2021). A separate version of the base image is built for Python 3.7, and is available at gcr.io/dataflow-geobeam/base-py37. The Python 3.7-compatible examples image is similarly-named gcr.io/dataflow-geobeam/example-py37

# run the geotiff_soilgrid example in dataflow
python -m geobeam.examples.geotiff_soilgrid \
  --gcs_url gs://geobeam/examples/AWCh3_M_sl1_250m_ll.tif \
  --dataset=examples \
  --table=soilgrid \
  --band_column=h3 \
  --runner=DataflowRunner \
  --worker_harness_container_image=gcr.io/dataflow-geobeam/example \
  --experiment=use_runner_v2 \
  --temp_location=<temp bucket> \
  --service_account_email <service account> \
  --region us-central1 \
  --max_num_workers 2 \
  --machine_type c2-standard-30 \
  --merge_blocks 64

Examples

Polygonize Raster

def run(options):
  from geobeam.io import GeotiffSource
  from geobeam.fn import format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadRaster' >> beam.io.Read(GeotiffSource(gcs_url))
        | 'FormatRecord' >> beam.Map(format_record, 'elev', 'float')
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.dem'))

Validate and Simplify Shapefile

def run(options):
  from geobeam.io import ShapefileSource
  from geobeam.fn import make_valid, filter_invalid, format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadShapefile' >> beam.io.Read(ShapefileSource(gcs_url))
        | 'Validate' >> beam.Map(make_valid)
        | 'FilterInvalid' >> beam.Filter(filter_invalid)
        | 'FormatRecord' >> beam.Map(format_record)
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.parcel'))

See geobeam/examples/ for complete examples.

A number of example pipelines are available in the geobeam/examples/ folder. To run them in your Google Cloud project, run the included terraform file to set up the Bigquery dataset and tables used by the example pipelines.

Open up Bigquery GeoViz to visualize your data.

Shapefile Example

The National Flood Hazard Layer loaded from a shapefile. Example pipeline at geobeam/examples/shapefile_nfhl.py

Raster Example

The Digital Elevation Model is a high-resolution model of elevation measurements at 1-meter resolution. (Values converted to centimeters). Example pipeline: geobeam/examples/geotiff_dem.py.

Included Transforms

The geobeam.fn module includes several Beam Transforms that you can use in your pipelines.

Module Description
geobeam.fn.make_valid Attempt to make all geometries valid.
geobeam.fn.filter_invalid Filter out invalid geometries that cannot be made valid
geobeam.fn.format_record Format the (props, geom) tuple received from a FileSource into a dict that can be inserted into the destination table

Execution parameters

Each FileSource accepts several parameters that you can use to configure how your data is loaded and processed. These can be parsed as pipeline arguments and passed into the respective FileSources as seen in the examples pipelines.

Parameter Input type Description Default Required?
skip_reproject All True to skip reprojection during read False No
in_epsg All An EPSG integer to override the input source CRS to reproject from No
band_number Raster The raster band to read from 1 No
include_nodata Raster True to include nodata values False No
centroid_only Raster True to only read pixel centroids False No
merge_blocks Raster Number of block windows to combine during read. Larger values will generate larger, better-connected polygons. No
layer_name Vector Name of layer to read Yes
gdb_name Vector Name of geodatabase directory in a gdb zip archive Yes, for GDB files

License

This is not an officially supported Google product, though support will be provided on a best-effort basis.

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Comments
  • Add get_bigquery_schema_dataflow

    Add get_bigquery_schema_dataflow

    Added get_bigquery_schema_dataflow to create a schema that can read files from Google Cloud Storage and generate schemas that can be used natively with Google Dataflow

    opened by mbforr 7
  • Unable to get worker harness container image

    Unable to get worker harness container image

    Unable to get the worker harness container images used in the examples.

    ~~➜ docker pull gcr.io/dataflow-geobeam/example-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/example~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ➜ docker pull gcr.io/dataflow-geobeam/base
    Using default tag: latest
    Error response from daemon: manifest for gcr.io/dataflow-geobeam/base:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/dataflow-geobeam/base/manifests/latest"
    
    opened by jskontorp 5
  • Encountering errors while trying to run examples geotiff examples

    Encountering errors while trying to run examples geotiff examples

    tl;dr This may be an issue with my environment (I'm running Python 3.9.13), but I've had no success getting any of the examples involving gridded data (e.g., geobeam.examples.geotiff_dem) to run locally. Have these been tested recently?

    I was bumping up against TypeError: only size-1 arrays can be converted to Python scalars [while running 'ElevToCentimeters'], which were fixed by using x.astype(int) where appropriate. Then I hit TypeError: format_record() takes from 1 to 2 positional arguments but 3 were given [while running 'FormatRecords']. (Maybe this line needs to be something like 'FormatRecords' >> beam.Map(format_rasterpolygon_record, 'int', known_args.band_column) instead?) Then I got TypeError: Object of type 'ndarray' is not JSON serializable [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] and stopped to jump in here. If it's just my environment, then I'm making changes needlessly. If it's the code, it seemed better for these fixes to be applied directly in the upstream repo.

    Any advice you have would be much appreciated!!

    bug documentation 
    opened by lzachmann 3
  • ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    I am fairly new to python and Apache beam, however, I used the shapefile_nfhl.py as an example to create a reader for GeoJSON files, therefore I imported the GeoJSONSource (as per documentation) from geobeam.io but when I run the application I get the following error ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    Am I missing something as I did follow the instructions to install geobeam. pip install geobeam

    I have tried this with python 3.7, 3.9 and 3.10, versions 3.7 and 3.9 gives this error where as 3.10 does not work at all - getting issues while installing rasterio.

    I am also running this on macOS Monterey (12.2.1)

    Here is my code:

    def run(pipeline_args, known_args): 
        import apache_beam as beam
        from apache_beam.io.gcp.internal.clients import bigquery as beam_bigquery
        from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
        from geobeam.io import GeoJSONSource
        from geobeam.fn import format_record, make_valid, filter_invalid
    
        pipeline_options = PipelineOptions([
            '--experiments', 'use_beam_bq_sink',
        ] + pipeline_args)
    
        with beam.Pipeline(options=pipeline_options) as p:
            (p
             | beam.io.Read(GeoJSONSource(known_args.gcs_url,
                 layer_name=known_args.layer_name))
             | 'MakeValid' >> beam.Map(make_valid)
             | 'FilterInvalid' >> beam.Filter(filter_invalid)
             | 'FormatRecords' >> beam.Map(format_record)
             | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                 beam_bigquery.TableReference(
                     datasetId=known_args.dataset,
                     tableId=known_args.table),
                 method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                 create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
    
    
    if __name__ == '__main__':
        import logging
        import argparse
    
        logging.getLogger().setLevel(logging.INFO)
    
        parser = argparse.ArgumentParser()
        parser.add_argument('--gcs_url')
        parser.add_argument('--dataset')
        parser.add_argument('--table')
        parser.add_argument('--layer_name')
        parser.add_argument('--in_epsg', type=int, default=None)
        known_args, pipeline_args = parser.parse_known_args()
    
        run(pipeline_args, known_args)```
    opened by migaelhartzenberg 3
  • Issue installing geobeam on GCP CloudShell

    Issue installing geobeam on GCP CloudShell

    Seeing the below issue while installing geobeam on GCP Cloud Shell.

    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-07s73wh5/orjson/
    

    Version of Python used from venv is 3.7.3

    (env) [email protected]:~ $ python
    Python 3.7.3 (default, Jan 22 2021, 20:04:44)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    

    Detailed error message

    Collecting orjson<4.0; python_version >= "3.6" (from apache-beam[gcp]>=2.27.0->geobeam)
      Using cached https://files.pythonhosted.org/packages/75/cd/eac8908d0b4a4b08067bc68c04e52d7601b0ed86bf2a6a3264c46dd51a84/orjson-3.6.3.tar.gz
      Installing build dependencies ... done
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/usr/lib/python3.7/tokenize.py", line 447, in open
            buffer = _builtin_open(filename, 'rb')
        FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-07s73wh5/orjson/setup.py'
    
    opened by hashkanna 1
  • Adding ESRIServerSource and GeoJSONSource

    Adding ESRIServerSource and GeoJSONSource

    Hey @tjwebb wanted to send these over - still need to do some testing but wanted to run them by you first.

    GeoJSONSource - this one should be fairly straightforward as it is a single file and Fiona can read the file natively

    ESRIServerSource - I added a package that can handle the transformation of ESRI JSON to GeoJSON, as well as loop through a layer request since the ESRI REST API generally limits features that can be requested to 1000 or 2000. I can write some of this code natively or we can use the package, but not sure if we want to limit the dependencies. The package in question is here.

    https://github.com/openaddresses/pyesridump

    Also any tips for testing locally would be great!

    opened by mbforr 1
  • Unable to load 5GB tif file to bigquery

    Unable to load 5GB tif file to bigquery

    It works fine for 1GB tif file. While trying to load 2GB ~ 5GB tif file it is failing with multiple errors during write to bigquery.

    If you would like to reproduce the errors, then you could get these datasets from here - https://files.isric.org/soilgrids/former/2017-03-10/data/ BDRLOG_M_250m_ll.tif OCDENS_M_sl1_250m_ll.tif ORCDRC_M_sl1_250m_ll.tif

    "Error processing instruction process_bundle-1256. Original traceback is Traceback (most recent call last): File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute response = task() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in lambda: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 606, in do_instruction return getattr(self, request_type)( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 999, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 228, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 357, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 359, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 829, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/worker/operations.py", line 838, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/common.py", line 1247, in apache_beam.runners.common.DoFnRunner.process_with_sized_restriction File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 886, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1321, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python3.8/site-packages/future/utils/init.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) RuntimeError: BrokenPipeError: [Errno 32] Broken pipe [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)-ptransform-3851']

    opened by aswinramakrishnan 1
  • Not able to docker or gcloud submit

    Not able to docker or gcloud submit

    Hi Travis, Thank you for creating geobeam package for our requirement.

    I am raising an issue here to just keep track.

    While using docker build -

     ---> 56341244044b
    Step 9/23 : RUN wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}
     ---> Running in 72bc532c27b8
    The command '/bin/sh -c wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}' returned a non-zero code: 5
    (global-env) [email protected] dataflow-geobeam % 
    (global-env) [email protected] dataflow-geobeam % 
    (global-env) [email protected] dataflow-geobeam % docker image ls                                                                                                        
    REPOSITORY                                                      TAG        IMAGE ID       CREATED         SIZE
    <none>                                                          <none>     56341244044b   5 minutes ago   2.55GB
    

    while trying to do gcloud submit command -

    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp  -fPIC -DPIC -o .libs/BufferOp.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp -o BufferOp.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferBuilder.lo -MD -MP -MF .deps/BufferBuilder.Tpo -c BufferBuilder.cpp -o BufferBuilder.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp  -fPIC -DPIC -o .libs/BufferParameters.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp -o BufferParameters.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferSubgraph.lo -MD -MP -MF .deps/BufferSubgraph.Tpo -c BufferSubgraph.cpp  -fPIC -DPIC -o .libs/BufferSubgraph.o
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     
    Your build timed out. Use the [--timeout=DURATION] flag to change the timeout threshold.
    ERROR: (gcloud.builds.submit) build ee748b6f-5347-4061-a81b-7f46959086c5 completed with status "TIMEOUT"
    
    documentation customer-reported 
    opened by aswinramakrishnan 1
  • Docstring type of fn.format_record is type but takes in string

    Docstring type of fn.format_record is type but takes in string

    Hi! Cool project that I want to test out but I noticed an inconsistency with the docstring and the code. Not sure which should be followed.

    https://github.com/GoogleCloudPlatform/dataflow-geobeam/blob/21479252be373b795a5c7d6626021b01a042e5de/geobeam/fn.py#L67-L91

    The docstring should be

            band_type (str, optional): Default to int. The data type of the
                raster band column to store in the database.
    ...
            p | beam.Map(geobeam.fn.format_record,
                band_column='elev', band_type='float'
    

    or the code should be

    def format_record(element, band_column=None, band_type=int):
        import json
    
        props, geom = element
        cast = band_type
    

    Thanks!

    opened by jtmiclat 1
  • Create BQ table from shapefile

    Create BQ table from shapefile

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue - https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

    Affecting the examples when reading files from the gs://geobeam bucket. Workaround - download zip files and put them in the own bucket.

    opened by Vadoid 0
  • Create BQ table from SHP schema

    Create BQ table from SHP schema

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • Geobeam Util  get_bigquery_schema_dataflow only for GDB files

    Geobeam Util get_bigquery_schema_dataflow only for GDB files

    As far as I understand current implementation of get_bigquery_schema_dataflow only works for GDB files and doesn't work for SHP files. Should we make the autoschema work for shapefiles as well via Fiona?

    opened by Vadoid 0
  • Updated the following: Dockerfile to get more recent gdal version

    Updated the following: Dockerfile to get more recent gdal version

    beam sdk 2.36.0 --> beam sdk 2.40.0 Addition of metview binary with the fix for debian curl 7.73.0 --> curl 7.83.1 (open ssl needed) geos 3.9.0 --> curl 3.10.3 sqlite 3330000 --> sqlite 3380500 proj 7.2.1 --> proj 9.0.0 (using cmake) openjpeg 2.3.1 --> openjpeg 2.5.0 addition of hdf5 1.10.5 addition of netcdf-c 4.9.0 gdal 3.2.1 --> gdal 3.5.1 (using cmake) making sure numpy always gets installed with gdal addition gcloud components alpha

    added longer timeout to cloudbuild.yaml because the intial build takes 1h20min at least

    opened by bahmandar 0
  • get_bigquery_schema_dataflow() issue and questions

    get_bigquery_schema_dataflow() issue and questions

    Hi, I am trying to use geobeam to ingest a shapefile into BigQuery, and creating the table with a schema from the shapefile if the table does not exist. I came across few issues and questions.

    I attempt this using a modified example shapefile_nfhl.py. And ran with this command.

    python -m shapefile_nfhl --runner DataflowRunner --project my-project --temp_location gs://mybucket-geobeam/data --region australia-southeast1 --worker_harness_container_image gcr.io/dataflow-geobeam/example --experiment use_runner_v2 --service_account_email [email protected] --gcs_url gs://geobeam/examples/510104_20170217.zip --dataset examples --table output_table
    

    Using get_bigquery_schema_dataflow() from geobeam.util is throwing error due to undefined variable.

    NameError: name 'gcs_url' is not defined
    

    I have opened a PR to fix this. #38

    Once the function is fixed, it seems that it does not accept a shapefile. Passing in the GCS URL to the zipped shapefile is throwing this error.

    Traceback (most recent call last):
      File "fiona/_shim.pyx", line 83, in fiona._shim.gdal_open_vector
      File "fiona/_err.pyx", line 291, in fiona._err.exc_wrap_pointer
    fiona._err.CPLE_OpenFailedError: '/vsigs/geobeam/examples/510104_20170217.zip' not recognized as a supported file format.
    

    Am I using the function in a wrong way or (zipped) shapefile is not support for this? For reference, this is the modified template. Thank you!

    opened by muazamkamal 0
  • centroid_only = false error for a particular GeoTIFF dataset

    centroid_only = false error for a particular GeoTIFF dataset

    When ingesting this cropland dataset https://developers.google.com/earth-engine/datasets/catalog/USDA_NASS_CDL?hl=en#citations: if I set the centroid_only parameter to false, I get the following error: Failed 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 319; errors: 1. Please look into the errors[] collection for more details.' reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs-ptransform-31'

    Full steps to reproduce are in the 'Ingesting EE data to BQ' blog.

    opened by remylouisew 0
Releases(v0.1.0)
Owner
Google Cloud Platform
Google Cloud Platform
A Python package for delineating nested surface depressions from digital elevation data.

Welcome to the lidar package lidar is Python package for delineating the nested hierarchy of surface depressions in digital elevation models (DEMs). I

Qiusheng Wu 166 Jan 03, 2023
Download and process satellite imagery in Python using Sentinel Hub services.

Description The sentinelhub Python package allows users to make OGC (WMS and WCS) web requests to download and process satellite images within your Py

Sentinel Hub 659 Dec 23, 2022
LicenseLocation - License Location With Python

LicenseLocation Hi,everyone! ❤ 🧡 💛 💚 💙 💜 This is my first project! ✔ Actual

The Bin 1 Jan 25, 2022
A simple python script that, given a location and a date, uses the Nasa Earth API to show a photo taken by the Landsat 8 satellite. The script must be executed on the command-line.

What does it do? Given a location and a date, it uses the Nasa Earth API to show a photo taken by the Landsat 8 satellite. The script must be executed

Caio 42 Nov 26, 2022
GeoIP Legacy Python API

MaxMind GeoIP Legacy Python Extension API Requirements Python 2.5+ or 3.3+ GeoIP Legacy C Library 1.4.7 or greater Installation With pip: $ pip instal

MaxMind 230 Nov 10, 2022
Logging the position of the car on an sdcard

audi-mmi-3g-gps-logging Logging the position of the car on an sdcard, startup script origin not clear to me, logging setup and time change is what I d

2 May 31, 2022
A Django application that provides country choices for use with forms, flag icons static files, and a country field for models.

Django Countries A Django application that provides country choices for use with forms, flag icons static files, and a country field for models. Insta

Chris Beaven 1.2k Jan 03, 2023
Imports VZD (Latvian State Land Service) open data into postgis enabled database

Python script main.py downloads and imports Latvian addresses into PostgreSQL database. Data contains parishes, counties, cities, towns, and streets.

Kaspars Foigts 7 Oct 26, 2022
Python interface to PROJ (cartographic projections and coordinate transformations library)

pyproj Python interface to PROJ (cartographic projections and coordinate transformations library). Documentation Stable: http://pyproj4.github.io/pypr

832 Dec 31, 2022
Hapi is a Python library for building Conceptual Distributed Model using HBV96 lumped model & Muskingum routing method

Current build status All platforms: Current release info Name Downloads Version Platforms Hapi - Hydrological library for Python Hapi is an open-sourc

Mostafa Farrag 15 Dec 26, 2022
Map Ookla server locations as a Kernel Density Estimation (KDE) geographic map plot.

Ookla Server KDE Plotting This notebook was created to map Ookla server locations as a Kernel Density Estimation (KDE) geographic map plot. Currently,

Jonathan Lo 1 Feb 12, 2022
A compilation of several single-beam bathymetry surveys of the Caribbean

Caribbean - Single-beam bathymetry This dataset is a compilation of several single-beam bathymetry surveys of the Caribbean ocean displaying a wide ra

Fatiando a Terra Datasets 0 Jan 20, 2022
Tile Map Service and OGC Tiles API for QGIS Server

Tiles API Add tiles API to QGIS Server Tiles Map Service API OGC Tiles API Tile Map Service API - TMS The TMS API provides these URLs: /tms/? to get i

3Liz 6 Dec 01, 2021
GebPy is a Python-based, open source tool for the generation of geological data of minerals, rocks and complete lithological sequences.

GebPy is a Python-based, open source tool for the generation of geological data of minerals, rocks and complete lithological sequences. The data can be generated randomly or with respect to user-defi

Maximilian Beeskow 16 Nov 29, 2022
Advanced raster and geometry manipulations

buzzard In a nutshell, the buzzard library provides powerful abstractions to manipulate together images and geometries that come from different kind o

Earthcube Lab 30 Jun 20, 2022
🌐 Local tile server for viewing geospatial raster files with ipyleaflet or folium

🌐 Local Tile Server for Geospatial Rasters Need to visualize a rather large (gigabytes) raster you have locally? This is for you. A Flask application

Bane Sullivan 192 Jan 04, 2023
Expose a GDAL file as a HTTP accessible on-the-fly COG

cogserver Expose any GDAL recognized raster file as a HTTP accessible on-the-fly COG (Cloud Optimized GeoTIFF) The on-the-fly COG file is not material

Even Rouault 73 Aug 04, 2022
Python 台灣行政區地圖 (2021)

Python 台灣行政區地圖 (2021) 以 python 讀取政府開放平台的 ShapeFile 地圖資訊。歡迎引用或是協作 另有縣市資訊、村里資訊與各種行政地圖資訊 例如: 直轄市、縣市界線(TWD97經緯度) 鄉鎮市區界線(TWD97經緯度) | 政府資料開放平臺: https://data

WeselyOng 12 Sep 27, 2022
Platform for building statistical models of cities and regions

UrbanSim UrbanSim is a platform for building statistical models of cities and regions. These models help forecast long-range patterns in real estate d

Urban Data Science Toolkit 419 Dec 30, 2022
Bacon - Band-limited Coordinate Networks for Multiscale Scene Representation

BACON: Band-limited Coordinate Networks for Multiscale Scene Representation Project Page | Video | Paper Official PyTorch implementation of BACON. BAC

Stanford Computational Imaging Lab 144 Dec 29, 2022