redun aims to be a more expressive and efficient workflow framework

Overview

redun

yet another redundant workflow engine

redun aims to be a more expressive and efficient workflow framework, built on top of the popular Python programming language. It takes the somewhat contrarian view that writing dataflows directly is unnecessarily restrictive, and by doing so we lose abstractions we have come to rely on in most modern high-level languages (control flow, compositiblity, recursion, high order functions, etc). redun's key insight is that workflows can be expressed as lazy expressions, that are then evaluated by a scheduler which performs automatic parallelization, caching, and data provenance logging.

redun's key features are:

  • Workflows are defined by lazy expressions that when evaluated emit dynamic directed acyclic graphs (DAGs), enabling complex data flows.
  • Incremental computation that is reactive to both data changes as well as code changes.
  • Workflow tasks can be executed on a variety of compute backend (threads, processes, AWS batch jobs, Spark jobs, etc).
  • Data changes are detected for in memory values as well as external data sources such as files and object stores using file hashing.
  • Code changes are detected by hashing individual Python functions and comparing against historical call graph recordings.
  • Past intermediate results are cached centrally and reused across workflows.
  • Past call graphs can be used as a data lineage record and can be queried for debugging and auditing.

See the docs, tutorial, and influences for more.

About the name: The name "redun" is self deprecating (there are A LOT of workflow engines), but it is also a reference to its original inspiration, the redo build system.

Install

pip install redun

See developing for more information on working with the code.

Postgres backend

To use postgres as a recording backend, use

pip install redun[postgres]

The above assumes the following dependencies are installed:

  • pg_config (in the postgresql-devel package; on ubuntu: apt-get install libpq-dev)
  • gcc (on ubuntu or similar sudo apt-get install gcc)

Use cases

redun's general approach to defining workflows makes it a good choice for implementing workflows for a wide-variety of use cases:

Small taste

Here is a quick example of using redun for a familar workflow, compiling a C program (full example). In general, any kind of data processing could be done within each task (e.g. reading and writing CSVs, DataFrames, databases, APIs).

File: """ Compile one C file into an object file. """ os.system(f"gcc -c {c_file.path}") return File(c_file.path.replace(".c", ".o")) @task() def link(prog_path: str, o_files: List[File]) -> File: """ Link several object files together into one program. """ o_files=" ".join(o_file.path for o_file in o_files) os.system(f"gcc -o {prog_path} {o_files}") return File(prog_path) @task() def make_prog(prog_path: str, c_files: List[File]) -> File: """ Compile one program from its source C files. """ o_files = [ compile(c_file) for c_file in c_files ] prog_file = link(prog_path, o_files) return prog_file # Definition of programs and their source C files. files = { "prog": [ File("prog.c"), File("lib.c"), ], "prog2": [ File("prog2.c"), File("lib.c"), ], } @task() def make(files : Dict[str, List[File]] = files) -> List[File]: """ Top-level task for compiling all the programs in the project. """ progs = [ make_prog(prog_path, c_files) for prog_path, c_files in files.items() ] return progs ">
# make.py

import os
from typing import Dict, List

from redun import task, File


redun_namespace = "redun.examples.compile"


@task()
def compile(c_file: File) -> File:
    """
    Compile one C file into an object file.
    """
    os.system(f"gcc -c {c_file.path}")
    return File(c_file.path.replace(".c", ".o"))


@task()
def link(prog_path: str, o_files: List[File]) -> File:
    """
    Link several object files together into one program.
    """
    o_files=" ".join(o_file.path for o_file in o_files)
    os.system(f"gcc -o {prog_path} {o_files}")
    return File(prog_path)


@task()
def make_prog(prog_path: str, c_files: List[File]) -> File:
    """
    Compile one program from its source C files.
    """
    o_files = [
        compile(c_file)
        for c_file in c_files
    ]
    prog_file = link(prog_path, o_files)
    return prog_file


# Definition of programs and their source C files.
files = {
    "prog": [
        File("prog.c"),
        File("lib.c"),
    ],
    "prog2": [
        File("prog2.c"),
        File("lib.c"),
    ],
}


@task()
def make(files : Dict[str, List[File]] = files) -> List[File]:
    """
    Top-level task for compiling all the programs in the project.
    """
    progs = [
        make_prog(prog_path, c_files)
        for prog_path, c_files in files.items()
    ]
    return progs

Notice, that besides the @task decorator, the code follows typical Python conventions and is organized like a sequential program.

We can run the workflow using the redun run command:

redun run make.py make

[redun] redun :: version 0.4.15
[redun] config dir: /Users/rasmus/projects/redun/examples/compile/.redun
[redun] Upgrading db from version -1.0 to 2.0...
[redun] Start Execution 69c40fe5-c081-4ca6-b232-e56a0a679d42:  redun run make.py make
[redun] Run    Job 72bdb973:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)], 'prog2': [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)]}) on default
[redun] Run    Job 096be12b:  redun.examples.compile.make_prog(prog_path='prog', c_files=[File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)]) on default
[redun] Run    Job 32ed5cf8:  redun.examples.compile.make_prog(prog_path='prog2', c_files=[File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)]) on default
[redun] Run    Job dfdd2ee2:  redun.examples.compile.compile(c_file=File(path=prog.c, hash=dfa3aba7)) on default
[redun] Run    Job 225f924d:  redun.examples.compile.compile(c_file=File(path=lib.c, hash=a2e6cbd9)) on default
[redun] Run    Job 3f9ea7ae:  redun.examples.compile.compile(c_file=File(path=prog2.c, hash=c748e4c7)) on default
[redun] Run    Job a8b21ec0:  redun.examples.compile.link(prog_path='prog', o_files=[File(path=prog.o, hash=4934098e), File(path=lib.o, hash=7caa7f9c)]) on default
[redun] Run    Job 5707a358:  redun.examples.compile.link(prog_path='prog2', o_files=[File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=7caa7f9c)]) on default
[redun]
[redun] | JOB STATUS 2021/06/18 10:34:29
[redun] | TASK                             PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                    0       0       0       0       8       8
[redun] | redun.examples.compile.compile         0       0       0       0       3       3
[redun] | redun.examples.compile.link            0       0       0       0       2       2
[redun] | redun.examples.compile.make            0       0       0       0       1       1
[redun] | redun.examples.compile.make_prog       0       0       0       0       2       2
[redun]
[File(path=prog, hash=a8d14a5e), File(path=prog2, hash=04bfff2f)]

This should have taken three C source files (lib.c, prog.c, and prog2.c), compiled them to three object files (lib.o, prog.o, prog2.o), and then linked them into two binaries (prog and prog2). Specifically, redun automatically determined the following dataflow DAG and performed the compiling and linking steps in separate threads:

Using the redun log command, we can see the full job tree of the most recent execution (denoted -):

redun log -

Exec 69c40fe5-c081-4ca6-b232-e56a0a679d42 [ DONE ] 2021-06-18 10:34:28:  run make.py make
Duration: 0:00:01.47

Jobs: 8 (DONE: 8, CACHED: 0, FAILED: 0)
--------------------------------------------------------------------------------
Job 72bdb973 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)], 'prog2': [File(path=prog2.c, hash=c748e4c7), Fil
  Job 096be12b [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make_prog('prog', [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)])
    Job dfdd2ee2 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=prog.c, hash=dfa3aba7))
    Job 225f924d [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=lib.c, hash=a2e6cbd9))
    Job a8b21ec0 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.link('prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=7caa7f9c)])
  Job 32ed5cf8 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make_prog('prog2', [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)])
    Job 3f9ea7ae [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=prog2.c, hash=c748e4c7))
    Job 5707a358 [ DONE ] 2021-06-18 10:34:29:  redun.examples.compile.link('prog2', [File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=7caa7f9c)])

Notice, redun automatically detected that lib.c only needed to be compiled once and that its result can be reused (a form of common subexpression elimination).

Using the --file option, we can see all files (or URLs) that were read, r, or written, w, by the workflow:

redun log --file

File 2b6a7ce0 2021-06-18 11:41:42 r  lib.c
File d90885ad 2021-06-18 11:41:42 rw lib.o
File 2f43c23c 2021-06-18 11:41:42 w  prog
File dfa3aba7 2021-06-18 10:34:28 r  prog.c
File 4934098e 2021-06-18 10:34:28 rw prog.o
File b4537ad7 2021-06-18 11:41:42 w  prog2
File c748e4c7 2021-06-18 10:34:28 r  prog2.c
File cd0b6b7e 2021-06-18 10:34:28 rw prog2.o

We can also look at the provenance of a single file, such as the binary prog:

link(prog_path, o_files) prog_path = 'prog' o_files = [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)] prog_path <-- argument of make_prog(prog_path, c_files) <-- origin o_files <-- derives from compile_result = File(path=lib.o, hash=d90885ad) compile_result_2 = <4934098e> File(path=prog.o, hash=4934098e) compile_result <-- <45054a8f> compile(c_file) c_file = <2b6a7ce0> File(path=lib.c, hash=2b6a7ce0) c_file <-- argument of make_prog(prog_path, c_files) <-- argument of make(files) <-- origin compile_result_2 <-- <8d85cebc> compile(c_file_2) c_file_2 = File(path=prog.c, hash=dfa3aba7) c_file_2 <-- argument of <74cceb4e> make_prog(prog_path, c_files) <-- argument of <45400ab5> make(files) <-- origin ">
redun log prog

File 2f43c23c 2021-06-18 11:41:42 w  prog
Produced by Job a8b21ec0

  Job a8b21ec0-e60b-4486-bcf4-4422be265608 [ DONE ] 2021-06-18 11:41:42:  redun.examples.compile.link('prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)])
  Traceback: Exec 4a2b624d > (1 Job) > Job 2f8b4b5f make_prog > Job a8b21ec0 link
  Duration: 0:00:00.24

    CallNode 6c56c8d472dc1d07cfd2634893043130b401dc84 redun.examples.compile.link
      Args:   'prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]
      Result: File(path=prog, hash=2f43c23c)

    Task a20ef6dc2ab4ed89869514707f94fe18c15f8f66 redun.examples.compile.link

      def link(prog_path: str, o_files: List[File]) -> File:
          """
          Link several object files together into one program.
          """
          o_files=" ".join(o_file.path for o_file in o_files)
          os.system(f"gcc -o {prog_path} {o_files}")
          return File(prog_path)


    Upstream dataflow:

      result = File(path=prog, hash=2f43c23c)

      result <-- <6c56c8d4> link(prog_path, o_files)
        prog_path = 
          
            'prog'
        o_files   = 
           
             [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]

      prog_path <-- argument of 
            
              make_prog(prog_path, c_files)
                <-- origin

      o_files <-- derives from
        compile_result   = 
             
               File(path=lib.o, hash=d90885ad)
        compile_result_2 = <4934098e> File(path=prog.o, hash=4934098e)

      compile_result <-- <45054a8f> compile(c_file)
        c_file = <2b6a7ce0> File(path=lib.c, hash=2b6a7ce0)

      c_file <-- argument of 
              
                make_prog(prog_path, c_files) <-- argument of 
               
                 make(files) <-- origin compile_result_2 <-- <8d85cebc> compile(c_file_2) c_file_2 = 
                
                  File(path=prog.c, hash=dfa3aba7) c_file_2 <-- argument of <74cceb4e> make_prog(prog_path, c_files) <-- argument of <45400ab5> make(files) <-- origin 
                
               
              
             
            
           
          

This output shows the original link task source code responsible for creating the program prog, as well as the full derivation, denoted "upstream dataflow". See the full example for a deeper explanation of this output. To understand more about the data structure that powers these kind of queries, see call graphs.

We can change one of the input files, such as lib.c, and rerun the workflow. Due to redun's automatic incremental compute, only the minimal tasks are rerun:

redun run make.py make

[redun] redun :: version 0.4.15
[redun] config dir: /Users/rasmus/projects/redun/examples/compile/.redun
[redun] Start Execution 4a2b624d-b6c7-41cb-acca-ec440c2434db:  redun run make.py make
[redun] Run    Job 84d14769:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=2b6a7ce0)], 'prog2': [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=2b6a7ce0)]}) on default
[redun] Run    Job 2f8b4b5f:  redun.examples.compile.make_prog(prog_path='prog', c_files=[File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=2b6a7ce0)]) on default
[redun] Run    Job 4ae4eaf6:  redun.examples.compile.make_prog(prog_path='prog2', c_files=[File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=2b6a7ce0)]) on default
[redun] Cached Job 049a0006:  redun.examples.compile.compile(c_file=File(path=prog.c, hash=dfa3aba7)) (eval_hash=434cbbfe)
[redun] Run    Job 0f8df953:  redun.examples.compile.compile(c_file=File(path=lib.c, hash=2b6a7ce0)) on default
[redun] Cached Job 98d24081:  redun.examples.compile.compile(c_file=File(path=prog2.c, hash=c748e4c7)) (eval_hash=96ab0a2b)
[redun] Run    Job 8c95f048:  redun.examples.compile.link(prog_path='prog', o_files=[File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]) on default
[redun] Run    Job 9006bd19:  redun.examples.compile.link(prog_path='prog2', o_files=[File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=d90885ad)]) on default
[redun]
[redun] | JOB STATUS 2021/06/18 11:41:43
[redun] | TASK                             PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                    0       0       0       2       6       8
[redun] | redun.examples.compile.compile         0       0       0       2       1       3
[redun] | redun.examples.compile.link            0       0       0       0       2       2
[redun] | redun.examples.compile.make            0       0       0       0       1       1
[redun] | redun.examples.compile.make_prog       0       0       0       0       2       2
[redun]
[File(path=prog, hash=2f43c23c), File(path=prog2, hash=b4537ad7)]

Notice, two of the compile jobs are cached (prog.c and prog2.c), but compiling the library lib.c and the downstream link steps correctly rerun.

Check out the examples for more example workflows and features of redun. Also, see the design notes for more information on redun's design.

Mixed compute backends

In the above example, each task ran in its own thread. However, more generally each task can run in its own process, Docker container, AWS Batch job, or Spark job. With minimal configuration, users can lightly annotate where they would like each task to run. redun will automatically handle the data and code movement as well as backend scheduling:

@task(executor="process")
def a_process_task(a):
    # This task runs in its own process.
    b = a_batch_task(a)
    c = a_spark_task(b)
    return c

@task(executor="batch", memory=4, vcpus=5)
def a_batch_task(a):
    # This task runs in its own AWS Batch job.
    # ...

@task(executor="spark")
def a_spark_task(b):
    # This task runs in its own Spark job.
    sc = get_spark_context()
    # ...

See the executor documentation for more.

What's the trick?

How did redun automatically perform parallel compute, caching, and data provenance in the example above? The trick is that redun builds up an expression graph representing the workflow and evaluates the expressions using graph reduction. For example, the workflow above went through the following evaluation process:

For a more in-depth walk-through, see the scheduler tutorial.

Why not another workflow engine?

redun focuses on making multi-domain scientific pipelines easy to develop and deploy. The automatic parallelism, caching, code and data reactivity, as well as data provenance features makes it a great fit for such work. However, redun does not attempt to solve all possible workflow problems, so it's perfectly reasonable to supplement it with other tools. For example, while redun provides a very expressive way to define task parallelism, it does not attempt to perform the kind of fine-grain data parallelism more commonly provided by Spark or Dask. Fortunately, redun does not perform any "dirty tricks" (e.g. complex static analysis or call stack manipulation), and so we have found it possible to safely combine redun with other frameworks (e.g. pyspark, pytorch, Dask, etc) to achieve the benefits of each tool.

Lastly, redun does not provide its own compute cluster, but instead builds upon other systems that do, such as cloud provider services for batch Docker jobs or Spark jobs.

For more details on how redun compares to other related ideas, see the influences section.

Owner
insitro
insitro
C++ Environment InitiatorVisual Studio Code C / C++ Environment Initiator

Visual Studio Code C / C++ Environment Initiator Latest Version : v 1.0.1(2021/11/08) .exe link here About : Visual Studio Code에서 C/C++환경을 MinGW GCC/G

Junho Yoon 2 Dec 19, 2021
Nick Craig-Wood's Website

Nick Craig-Wood's public website This directory tree is used to build all the different docs for Nick Craig-Wood's website. The content here is (c) Ni

Nick Craig-Wood 2 Sep 02, 2022
Uproot - A script to bring deeply nested files or directories to the surface

UPROOT Bring deeply nested files or folders to the surface Uproot helps convert

Ted 2 Jan 15, 2022
Runtime fault injection platform by Daniele Rizzieri (2021)

GDBitflip [v1.04] Runtime fault injection platform by Daniele Rizzieri (2021) This platform executes N times a binary and during each execution it inj

Daniele Rizzieri 1 Dec 07, 2021
A tool to help the Poly copy-reading process! :D

PolyBot A tool to help the Poly copy-reading process! :D Let's face it-computers are better are repeatitive tasks. And, in spite of what one may want

1 Jan 10, 2022
Zeus is an open source flight intellingence tool which supports more than 13,000+ airlines and 250+ countries.

Zeus Zeus is an open source flight intellingence tool which supports more than 13,000+ airlines and 250+ countries. Any flight worldwide, at your fing

DeVickey 1 Oct 22, 2021
Modern API wrapper for Genshin Impact built on asyncio and pydantic.

genshin.py Modern API wrapper for Genshin Impact built on asyncio and pydantic.

sadru 212 Jan 06, 2023
Integration between the awesome window manager and the firefox web browser.

Integration between the awesome window manager and the firefox web browser.

contribuewwt 3 Feb 02, 2022
Problem statements on System Design and Software Architecture as part of Arpit's System Design Masterclass

Problem statements on System Design and Software Architecture as part of Arpit's System Design Masterclass

Relog 1.1k Jan 04, 2023
通过简单的卷积神经网络直接预测出验证码图片中滑块的位置

使用说明 1. 在本地测试 运行python3 prdict_one.py即可,默认需要预测的图片路径位于testImg文件夹下的test1.png 运行python3 predict_folder.py预测testImg下的所有图片 2. 部署到服务器 运行python3 run_a_server

12 Mar 08, 2022
Automatically deletes Capital One Eno virtual cards for when you've made a couple too many.

eno-delete Automatically deletes Capital One Eno virtual cards for when you've made a couple too many. Warning: Program will delete ALL virtual cards

h3x 3 Sep 29, 2022
Python library and cli util for https://www.zerochan.net/

Zerochan Library for Zerochan.net with pics parsing and downloader included! Features CLI utility for pics downloading from zerochan.net Library for c

kiriharu 10 Oct 11, 2022
Your Google Recon is Now Automated

GRecon : GRecon (Greei-Conn) is a simple python tool that automates the process of Google Based Recon AKA Google Dorking The current Version 1.0 Run 7

adnane-tebbaa 189 Dec 21, 2022
E5自动续期

AutoApi v6.3 (2021-2-18) ———— E5自动续期 AutoApi系列: AutoApi(v1.0) 、 AutoApiSecret(v2.0) 、 AutoApiSR(v3.0) 、 AutoApiS(v4.0) 、 AutoApiP(v5.0) 说明 E5自动续期程序,但是

34 Feb 20, 2021
Registro Online (100% Python-Mysql)

Registro elettronico scritto in python, utilizzando database Mysql e Collegando Registro elettronico scritto in PHP

Sergiy Grimoldi 1 Dec 20, 2021
Turn a raspberry pi into a Bluetooth Midi device

PiBluetoothMidSetup This will change serveral system wide packages/configurations Do not run this on your primary machine or anything you don't know h

MyLab6 40 Sep 19, 2022
Python Excuse Generator

Excuse Generator Python Excuse Generator This project is an excuse generator that provides the user with an excuse as to why they weren't paying atten

Collin Sanders 5 Jul 07, 2022
A small C compiler written in Python for learning purposes

A small C compiler written in Python. Generates x64 Intel-format assembly, which is then assembled and linked by nasm and ld.

Scattered Thoughts 3 Oct 22, 2021
Calibre Libgen Non-fiction / Sci-tech store plugin

CalibreLibgenSci A Libgen Non-Fiction/Sci-tech store plugin for Calibre Installation Download the latest zip file release from here Open Calibre Navig

IDDQD 9 Dec 27, 2022
Osintgram by Datalux but i fixed some errors i found and made it look cleaner

OSINTgram-V2 OSINTgram-V2 is made from Osintgram which is made by Datalux originally but i took the script and fixed some errors i found and made the

2 Feb 02, 2022