Data Orchestration Platform

Related tags

Miscellaneousdop
Overview

Table of contents

What is DOP

Design Concept

DOP is designed to simplify the orchestration effort across many connected components using a configuration file without the need to write any code. We have a vision to make orchestration easier to manage and more accessible to a wider group of people.

Here are some of the key design concept behind DOP,

  • Built on top of Apache Airflow - Utilises it’s DAG capabilities with interactive GUI
  • DAGs without code - YAML + SQL
  • Native capabilities (SQL) - Materialisation, Assertion and Invocation
  • Extensible via plugins - DBT job, Spark job, Egress job, Triggers, etc
  • Easy to setup and deploy - fully automated dev environment and easy to deploy
  • Open Source - open sourced under the MIT license

Please note that this project is heavily optimised to run with GCP (Google Cloud Platform) services which is our current focus. By focusing on one cloud provider, it allows us to really improve on end user experience through automation

A Typical DOP Orchestration Flow

Typical DOP Flow

Prerequisites - Run in Docker

Note that all the IAM related prerequisites will be available as a Terraform template soon!

For DOP Native Features

  1. Download and install Docker https://docs.docker.com/get-docker/ (if you are on Windows, please follow instruction here as there are some additional steps required for it to work https://docs.docker.com/docker-for-windows/install/)
  2. Download and install Google Cloud Platform (GCP) SDK following instructions here https://cloud.google.com/sdk/docs/install.
  3. Create a dedicated service account for docker with limited permissions for the development GCP project, the Docker instance is not designed to be connected to the production environment
    1. Call it dop-docker-user@<your GCP project id> and create it in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  4. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on thedevelopment project just for the dop-docker-user service account in order to enable Service Account Impersonation.
    Grant service account user
  5. Authenticating with your GCP environment by typing in gcloud auth application-default login in your terminal and following instructions. Make sure you proceed to the stage where application_default_credentials.json is created on your machine (For windows users, make a note of the path, this will be required on a later stage)
  6. Clone this repository to your machine.

For DBT

  1. Setup a service account for your GCP project called dop-dbt-user in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
  2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  3. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on the development project just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Instructions for Setting things up

Run Airflow with DOP in Docker - Mac

See README in the service project setup and follow instructions.

Once it's setup, you should see example DOP DAGs such as dop__example_covid19 Airflow in Docker

Run Airflow with DOP in Docker - Windows

This is currently working in progress, however the instructions on what needs to be done is in the Makefile

Run on Composer

Prerequisites

  1. Create a dedicate service account for Composer and call it dop-composer-user with following roles at project level
    • roles/bigquery.dataEditor
    • roles/bigquery.jobUser
    • roles/composer.worker
    • roles/compute.viewer
  2. Create a dedicated service account for DBT with limited permissions.
    1. [Already done in here if it’s DEV] Call it dop-dbt-user@<GCP project id> and create in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. [Already done in here if it’s DEV] Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
    3. The dop-composer-user will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Create Composer Cluster

  1. Use the service account already created dop-composer-user instead of the default service account
  2. Use the following environment variables
    DOP_PROJECT_ID : {REPLACE WITH THE GCP PROJECT ID WHERE DOP WILL PERSIST ALL DATA TO}
    DOP_LOCATION : {REPLACE WITH GCP REGION LOCATION WHRE DOP WILL PERSIST ALL DATA TO}
    DOP_SERVICE_PROJECT_PATH := {REPLACE WITH THE ABSOLUTE PATH OF THE Service Project, i.e. /home/airflow/gcs/dags/dop_{service project name}
    DOP_INFRA_PROJECT_ID := {REPLACE WITH THE GCP INFRASTRUCTURE PROJECT ID WHERE BUILD ARTIFACTS ARE STORED, i.e. a DBT docker image stored in GCR}
    
    and optionally
    DOP_GCR_PULL_SECRET_NAME:= {This maybe needed if the project storing the gcr images are not he same as where Cloud Composer runs, however this might be a better alternative https://medium.com/google-cloud/using-single-docker-repository-with-multiple-gke-projects-1672689f780c}
    
  3. Add the following Python Packages
    dataclasses==0.7
    
  4. Finally create a new node pool with the following k8 label
    key: cloud.google.com/gke-nodepool
    value: kubernetes-task-pool
    

Deployment

See Service Project README

Misc

Service Account Impersonation

Impersonation is a GCP feature allows a user / service account to impersonate as another service account.
This is a very useful feature and offers the following benefits

  • When doing development locally, especially with automation involved (i.e using Docker), it is very risky to interact with GCP services by using your user account directly because it may have a lot of permissions. By impersonate as another service account with less permissions, it is a lot safer (least privilege)
  • There is no credential needs to be downloaded, all permissions are linked to the user account. If an employee leaves the company, access to GCP will be revoked immediately because the impersonation process is no longer possible

The following diagram explains how we use Impersonation in DOP when it runs in Docker DOP Docker Account Impersonation

And when running DBT jobs on production, we are also using this technique to use the composer service account to impersonate as the dop-dbt-user service account so that service account keys are not required.

There are two very google articles explaining how impersonation works and why using it

You might also like...
Cross-platform config and manager for click console utilities.

climan Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https

YourCity is a platform to match people to their prefect city.
YourCity is a platform to match people to their prefect city.

YourCity YourCity is a city matching App that matches users to their ideal city. It is a fullstack React App made with a Redux state manager and a bac

A multi-platform fuzzer for poking at userland binaries and servers

litefuzz A multi-platform fuzzer for poking at userland binaries and servers litefuzz intro why how it works what it does what it doesn't do support p

A platform for developers 👩‍💻  who wants to share their programs and projects.
A platform for developers 👩‍💻 who wants to share their programs and projects.

Fest-Practice-2021 This project is excluded from Hacktoberfest 2021. Please use this as a testing repo/project. A platform for developers 👩‍💻 who wa

Speed up your typing by some exercises in the multi-platform(Windows/Ubuntu).

Introduction This project purpose is speed up your typing by some exercises in the multi-platform(Windows/Ubuntu). Build Environment Software Environm

An Airdrop alternative for cross-platform users only for desktop with Python

PyDrop An Airdrop alternative for cross-platform users only for desktop with Python, -version 1.0 with less effort, just as a practice. ##############

Platform Tree for Xiaomi Redmi Note 7/7S (lavender)
Platform Tree for Xiaomi Redmi Note 7/7S (lavender)

The Xiaomi Redmi Note 7 (codenamed "lavender") is a mid-range smartphone from Xiaomi announced in January 2019. Device specifications Device Xiaomi Re

A Classroom Engagement Platform

Project Introduction This is project introduction Setup Setting up Postgres This is the most tricky part when setting up the application. You will nee

Traffic flow test platform, especially for reinforcement learning
Traffic flow test platform, especially for reinforcement learning

Traffic Flow Test Platform Traffic flow test platform, especially for reinforcement learning, named TFTP. A traffic signal control framework that can

Comments
  • Release DOP v0.3.0

    Release DOP v0.3.0

    A number of new features where added in this version

    DOP v0.3.0 — 2021-08-11

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    opened by dinigo 0
Releases(v0.3.0)
  • v0.3.0(Aug 17, 2021)

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Mar 30, 2021)

Owner
Datatonic
We accelerate business impact through Machine Learning and Analytics
Datatonic
Lightweight library for accessing data and configuration

accsr This lightweight library contains utilities for managing, loading, uploading, opening and generally wrangling data and configurations. It was ba

appliedAI Initiative 7 Mar 09, 2022
Python library for datamining glitch information from Gen 1 Pokémon GameBoy ROMs

g1utils This is a Python library for datamining information about various glitches (glitch Pokémon, glitch maps, etc.) from Gen 1 Pokémon ROMs. TODO A

1 Jan 13, 2022
Adam with minor modifications which give significant improvement

BAdam Modification of Adam [1] optimizer with increased stability and better performance. Tricks used: Decoupled weight decay as in AdamW [2]. Such de

19 May 11, 2022
Painel simples com consulta de cep,CNPJ,placa e ip

Painel mpm Um painel simples com consultas de IP, CNPJ, CEP e PLACA Início 🌐 apt update && apt upgrade -y pkg i python git pip install requests Insta

8 Feb 27, 2022
An awesome script to convert the University Of Oviedo web calendar to Google or Outlook calendars.

autoUniCalendar Un script en Python para convertir el calendario de la intranet de la Universidad de Oviedo en un calendario de Outlook o Google Calen

Bimo99B9 14 Sep 28, 2022
Fixed waypoint(pose) navigation for turtlebot simulation.

Turtlebot-NavigationStack-Fixed-Waypoints fixed waypoint(pose) navigation for turtlebot simulation. Task Details Task Permformed using Navigation Stac

Shanmukha Vishnu 1 Apr 08, 2022
Fastest Semantle solver this side of the Mississippi

semantle Fastest Semantle solver this side of the Mississippi. Roughly 3 average turns to win Measured against (part of) the word2vec-google-news-300

Frank Odom 8 Dec 26, 2022
Yandex Media Browser

Браузер медиа для плагина Yandex Station Включайте музыку, плейлисты и радио на Яндекс.Станции из Home Assistant! Скриншот Корневой раздел: Библиотека

Alexander Ryazanov 35 Dec 19, 2022
Rofi script to minimize / unminimize multiple windows in qtile

Qminimize Rofi script to minimize / unminimize multiple windows in qtile Additional requirements : EWMH module fuzzywuzzy module How to use it : - Clo

9 Sep 18, 2022
MIT version of the PyMca XRF Toolkit

PyMca This is the MIT version of the PyMca XRF Toolkit. Please read the LICENSE file for details. Installation Ready-to-use packages are available for

V. Armando Solé 43 Nov 23, 2022
A test repository to build a python package and publish the package to Artifact Registry using GCB

A test repository to build a python package and publish the package to Artifact Registry using GCB. Then have the package be a dependency in a GCF function.

1 Feb 09, 2022
A fast python implementation of DTU MVS 2014 evaluation

DTUeval-python A python implementation of DTU MVS 2014 evaluation. It only takes 1min for each mesh evaluation. And the gap between the two implementa

82 Dec 27, 2022
Wunderland desktop wallpaper and Microsoft Teams background.

Wunderland Professional Impress your colleagues, friends and family with this edition of the "Wunderland" wallpaper. With the nostalgic feel of the or

3 Dec 14, 2022
A Python version of Canvacord

A copy of canvacord made in python! Installation Run any of these commands in terminal: Mac / Linux pip install canvacord Windows python -m pip insta

10 Mar 28, 2022
dbt adapter for Firebolt

dbt-firebolt dbt adapter for Firebolt dbt-firebolt supports dbt 0.21 and newer Installation First, download the JDBC driver and place it wherever you'

23 Dec 14, 2022
BestBuy Script Designed to purchase any item when it becomes available.

prerequisites: Selnium; undetected-chromedriver. This Script is designed to order an Item provided a link from BestBuy.com only.

Bransen Smith 0 Jan 12, 2022
An open-source systems and controls toolbox for Python3

harold A control systems package for Python=3.6. Introduction This package is written with the ambition of providing a full-fledged control systems s

Ilhan Polat 157 Dec 05, 2022
Slientruss3d : Python for stable truss analysis tool

slientruss3d : Python for stable truss analysis tool Desciption slientruss3d is a python package which can solve the resistances, internal forces and

3 Dec 26, 2022
A bunch of codes for procedurally modeling and texturing futuristic cities.

Procedural Futuristic City This is our final project for CPSC 479. We created a procedural futuristic city complete with Worley noise procedural textu

1 Dec 22, 2021
A MCPI hack with many features.

Morpheus 2.0 A MCPI hack with many features To Use: You will need to install the keyboard, pysimplegui, and MCPI python modules and you will need to e

11 Oct 11, 2022