Template for a Dataflow Flex Template in Python

Overview

Dataflow Flex Template in Python

This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build Dataflow jobs to run in STOIX using Dataflow runner.

The code is based on the same example data as Google Cloud Python Quickstart, "King Lear" which is a tragedy written by William Shakespeare.

The Dataflow job reads the file content, count occurencies of each word and inserts it to a BigQuery table. The schedule date is also added to the table name producing a sharded table for the output.

Source data:

Template maintained by STOIX.

Configuration

The job is configured with the following pipeline options:

  • stoix_scheduled - Scheduled datetime as RFC3339
  • input_file - Text to read
  • output_dataset - BigQuery dataset for output table
  • output_table_prefix - BigQuery output table name prefix
  • project - Google Cloud project id

When using Dataflow runner, stoix_scheduled is automatically set and other pipeline options can be added as described in the Dataflow runner README.

Test the code

Tox is used to format, test and lint the code. Make sure to install it with pip install tox and then just run tox within the project folder.

Run pipeline

In order to work with the code locally, you can use Python virtual environments. Make sure to use Python version 3.7.10 as it is the version supported by Google Dataflow.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .

Run on local machine

See quickstart python for further description of arguments.

python -m main \
    --region europe-north1 \
    --runner DirectRunner \
    --stoix_scheduled 2021-01-01T00:00:00Z \
    --input_file gs://dataflow-samples/shakespeare/kinglear.txt \
    --output_table_prefix kinglear \
    --output_dataset 
   
     \
    --project 
    
      \
    --temp_location gs://
     
      /tmp/

     
    
   

Build Docker image for STOIX

In order to run the pipeline the Flex Template needs to be packaged in a Docker image and pushed to a Docker image repository. In this example Docker Hub is used.

Set the tag to the name and version of your pipeline, e.g: stoix/count-words:1.0.0.

$ docker build --tag stoix/count-words:1.0.0 .

Then upload the image to the Docker image repository.

$ docker push stoix/count-words:1.0.0

Run Dataflow on STOIX

Now the Dataflow Flex Template job can be ran using Dataflow runner. Add a new job with the image stoix/dataflow-runner and the following environment variables:

  • GCP_PROJECT_ID:
  • GCP_REGION: europe-north1
  • GCP_SERVICE_ACCOUNT: BASE64 encoded service account JSON
  • JOB_IMAGE: stoix/count-words:1.0.0
  • JOB_NAME_PREFIX: count-words
  • JOB_PARAM_INPUT_FILE: gs://dataflow-samples/shakespeare/kinglear.txt
  • JOB_PARAM_OUTPUT_DATASET: dataflow
  • JOB_PARAM_OUTPUT_TABLE_PREFIX: kinglear
  • JOB_SDK_LANGUAGE: python

Note: When running this in production, set GCP_SERVICE_ACCOUNT as a secret instead of environment variable.

License

MIT

Owner
STOIX
STOIX
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Fitting thermodynamic models with pycalphad

ESPEI ESPEI, or Extensible Self-optimizing Phase Equilibria Infrastructure, is a tool for thermodynamic database development within the CALPHAD method

Phases Research Lab 42 Sep 12, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022
DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dy

2 Nov 08, 2021
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

Najibulloh Asror 2 Feb 10, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021