ETL flow framework based on Yaml configs in Python

Last update: Jul 06, 2022

Related tags

Overview

ETL framework based on Yaml configs in Python

A light framework for creating data streams. Setting up streams through configuration in the Yaml file. There is a schedule, task pools, concurrency limitation. Works quickly, does not require a lot of resources. Runs on Windows and Linux. Flow run in parallel via threading library. Internally SQLite Database. Native data transformation. There is a web interface.

At the moment there are connectors to sources

CSV file
SQLite
Postgres
MySQL
Yandex Metrika Management API
Yandex Metrika Stats API
Yandex Metrika Logs API
Yandex Direct API
Yandex Direct Report API
Criteo
Google Sheets

Storages

Save to csv file
Clickhouse

Documentation

Requirements

python >=3.9
virtual environment

Settings

It is highly recommended to install in a virtual environment.

Flowmaster needs a home, '{HOME}/FlowMaster' is the default,
but you can lay foundation somewhere else if you prefer
(optional)

For Windows

setx FLOWMASTER_HOME "{YOUR_PATH}"

For Linux

export FLOWMASTER_HOME={YOUR_PATH}

Installing

pip install flowmaster==0.7.1

# For install web UI.
pip install flowmaster[webui]==0.7.1

# Optional libraries.
pip install flowmaster[clickhouse,postgres,mysql,yandexdirect,yandexmetrika,criteo,googlesheets]==0.7.1

Run

flowmaster run --help
flowmaster run

WEB UI

http://localhost:8822

CHANGELOG

Support

Telegram support chat

Author

Pavel Maksimov

My contacts Telegram, Facebook

Удачи тебе, друг! Поставь звездочку ;)

You might also like...

signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

44 Oct 14, 2022

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Jan 9, 2023

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

22 Dec 27, 2022

PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

61 Oct 2, 2022

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

35 Jan 4, 2023

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 2, 2022

Comments

No such file or directory: '/home/ubuntu/FlowMaster/pools.yaml'

Привет, очень хороший проект, однако столкнулся со следующей проблемой при устанвоке библиотеки

с ванильным python pip такого пакета вообще не видно
при установке через conda установка проходит замечательно, однако при запуске получаю

(base) [email protected]:~/FlowMaster$ flowmaster run
Traceback (most recent call last):
  File "/home/ubuntu/miniforge3/bin/flowmaster", line 5, in <module>
    from flowmaster.__main__ import app
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/__main__.py", line 9, in <module>
    import flowmaster.cli.notebook
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/cli/notebook.py", line 5, in <module>
    from flowmaster.service import (
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/service.py", line 11, in <module>
    from flowmaster.operators.etl.policy import ETLNotebook
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/__init__.py", line 3, in <module>
    from flowmaster.operators.etl.providers.abstract import ProviderAbstract, ExportAbstract
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/providers/__init__.py", line 4, in <module>
    from flowmaster.operators.etl.providers.criteo import CriteoProvider
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/providers/criteo/__init__.py", line 2, in <module>
    from flowmaster.operators.etl.providers.criteo.export import (
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/providers/criteo/export.py", line 8, in <module>
    from flowmaster.executors import SleepIteration
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/executors/__init__.py", line 16, in <module>
    from flowmaster.pool import pools
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/pool.py", line 106, in <module>
    pools_dict = YamlHelper.parse_file(str(Settings.POOL_CONFIG_FILEPATH))
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/utils/yaml_helper.py", line 14, in parse_file
    with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FlowMaster/pools.yaml'

Что я делаю не так?(

opened by micweeks 1

Releases(0.7.1)

0.7.1(Aug 29, 2021)
prevented planned of tasks from one instance of the operator class

fixed error GeneratorExit

fixed transform array type for Clickhouse loader

Source code(tar.gz)
Source code(zip)
0.6.1(Jun 22, 2021)
Redesigned executor

New

add politics 'time_limit_seconds_from_worktime', 'soft_time_limit_seconds'.

add provider 'flowmaster'

Fixing

fix schedule (interval seconds mode)

add logging 'loguru'

fix clear_statuses_of_lost_items

fix allow_execute_flow

change command 'db reset'

There are backward incompatible changes

new field 'expires_utc' in FlowItem

rename command 'run' to 'run_local' and rename command 'run_thread' to 'run'

add new class ExecutorIterationTask.

change, moving and rename class ThreadExecutor to ThreadAsyncExecutor.

change and rename class SleepTask to SleepIteration.

change and rename class TaskPool to NextIterationInPools.

ETLOperator return ExecutorIterationTask.

rename func order_flow to ordering_flow_tasks.

rename func start_executor to sync_executor.

rename field FlowItem.config_hash to FlowItem.notebook_hash

change FLOW_CONFIGS_DIR and rename FLOW_CONFIGS_DIR to NOTEBOOKS_DIR

rename objects config to notebook

add class Settings

Source code(tar.gz)
Source code(zip)
0.5.0(May 25, 2021)

Source code(tar.gz)
Source code(zip)
0.3.1(May 15, 2021)
There are backward incompatible changes

Add local executor

Fix Yandex Direct provider

Refactoring

Source code(tar.gz)
Source code(zip)
0.2.2(May 13, 2021)

Add provider Yandex Direct Refactoring

Incompatible changes
Source code(tar.gz)
Source code(zip)
0.1.3(May 2, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(May 1, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Павел Максимов

Python Data Engineer, Python Developer, ETL, Разработчик рекомендательных систем

GitHub Repository

Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

1 Dec 26, 2022

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

5 Sep 27, 2021

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

2 Jul 01, 2022

Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

2 Apr 07, 2022

Visions provides an extensible suite of tools to support common data analysis operations

Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis

168 Dec 28, 2022

A variant of LinUCB bandit algorithm with local differential privacy guarantee

Contents LDP LinUCB Description Model Architecture Dataset Environment Requirements Script Description Script and Sample Code Script Parameters Launch

4 Oct 25, 2022

Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

625 Jan 02, 2023

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

16 Nov 04, 2022

Data analysis and visualisation projects from a range of individual projects and applications

Python-Data-Analysis-and-Visualisation-Projects Data analysis and visualisation projects from a range of individual projects and applications. Python

1 Jan 25, 2022

Business Intelligence (BI) in Python, OLAP

Open Mining Business Intelligence (BI) Application Server written in Python Requirements Python 2.7 (Backend) Lua 5.2 or LuaJIT 5.1 (OML backend) Mong

1.2k Dec 27, 2022

Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

14 Aug 30, 2022

Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

3 Oct 03, 2022

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

1 Feb 03, 2022

ETL flow framework based on Yaml configs in Python

Related tags

Overview

ETL framework based on Yaml configs in Python

Documentation

Requirements

Settings

Installing

Run

WEB UI

CHANGELOG

Support

Author

You might also like...

signac-flow - manage workflows with signac

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Randomisation-based inference in Python based on data resampling and permutation.

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

PyChemia, Python Framework for Materials Discovery and Design

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Comments

No such file or directory: '/home/ubuntu/FlowMaster/pools.yaml'

Releases(0.7.1)

0.7.1(Aug 29, 2021)

0.6.1(Jun 22, 2021)

New

Fixing

There are backward incompatible changes

0.5.0(May 25, 2021)

0.3.1(May 15, 2021)

0.2.2(May 13, 2021)

0.1.3(May 2, 2021)

0.1.0(May 1, 2021)

Owner

Павел Максимов

Tools for the analysis, simulation, and presentation of Lorentz TEM data.

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Investigating EV charging data

Visions provides an extensible suite of tools to support common data analysis operations

A variant of LinUCB bandit algorithm with local differential privacy guarantee

Active Learning demo using two small datasets

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Data analysis and visualisation projects from a range of individual projects and applications

Business Intelligence (BI) in Python, OLAP

Stitch together Nanopore tiled amplicon data without polishing a reference

Fancy data functions that will make your life as a data scientist easier.

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

A Python module for clustering creators of social media content into networks

Data processing with Pandas.