Introduction

This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition from development stage to production stage as smooth as possible.
Zeppelin is good at data pipeline development (Spark, Flink, Hive, Python, Shell and etc), while Airflow is the de-facto standard of Job orchestration.

How to run it

Step 1. Initialize enviromenment.

Run this following commands to initialize environment.

Download spark which is used by Zeppelin
Download zeppelin airflow plugins

git clone https://github.com/zjffdu/zeppelin_airflow.git
cd zeppelin_airflow
./init.sh

Step 2 Start Zeppelin + Airflow via docker-compose

docker-compose -f docker-compose-LocalExecutor.yml up -d

Step 3. Use Zeppelin + Airflow

Open http://localhost:8085 for Zeppelin http://localhost:8083 for Airflow

There's one dag zeppelin_example in Airflow. This dag just run 3 Zeppelin notes:

Python Tutorial/01. IPython Basics
Spark Tutorial/02. Spark Basics Features
Spark Tutorial/03. Spark SQL (PySpark)

You can enable it, then Airflow would run these Zeppelin notes.

Actually Zeppelin would not run these notes directly, instead it would clone note and run the cloned note.

More features would come soon, stay tuned.

Show you how to integrate Zeppelin with Airflow

Related tags

Overview

Introduction

How to run it

Step 1. Initialize enviromenment.

Step 2 Start Zeppelin + Airflow via docker-compose

Step 3. Use Zeppelin + Airflow

More features would come soon, stay tuned.

Owner

Jeff Zhang

A multi-platform GUI for bit-based analysis, processing, and visualization

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Semi-Automated Data Processing

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Predictive Modeling & Analytics on Home Equity Line of Credit

Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Important dataframe statistics with a single command

Making the DAEN information accessible.

Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Creating a statistical model to predict 10 year treasury yields

Hg002-qc-snakemake - HG002 QC Snakemake

Calculate multilateral price indices in Python (with Pandas and PySpark).

Validation and inference over LinkML instance data using souffle

The micro-framework to create dataframes from functions.

Titanic data analysis for python