Full automated data pipeline using docker images

Overview

Create postgres tables from CSV files

This first section is only relate to creating tables from CSV files using postgres container alone. Just one of my experiments. If you interest, you can just follow these steps (only if your working environment support bash):

sh scripts/prep.sh

The prep.sh will handle everything for you by doing follwing:

# Start postgres db container
docker-compose -f postgres.yaml up -d
# Sleep to make sure the container is fully up running
sleep 3 

# I have problem with mouting csv files via docker compose, so here we go
# Copy csv and setup.sql to create required tables
docker cp ./csv/ my_postgres:
docker cp ./scripts/setup.sql my_postgres:setup.sql

# Execute the script in postgres db
docker exec -it my_postgres psql -p5432 --dbname=postgres --username=postgres --file=setup.sql 

# Shutdown the container
docker-compose -f postgres.yaml down --remove-orphans

I had problem with mount volumn that I can't mount the files under csv and scripts folders. Which still can be improved with a proper mount. But let's skip it for now to save time.

Initial Setup/Start Airflow container

This section will use a separate docker-compose.yaml than the above test. It will be relate due to the fact that we want to use airflow to schedule the tasks above (create table and load data). To do so, do the following. First prepare folders. You can call a new folder specifically for this if you want.

# (optional) mkdir airflow && cd airflow
mkdir ./dags ./logs ./plugins

Next we need the airflow docker-compose.yaml in our airflow directory

curl -O https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml

Next make sure we will have a proper permission to initial Airflow

.env ">
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Then we must initial Airflow instance

docker-compose up airflow-init

Wait until the initial finished then (you can use -d to detach if you want)

docker-compose up

Now you will be able to connect to Airflow GUI via http://localhost:8080/

Create Airflow DAG task

First thing, you need to setup connection for postgres database. Go to tab Admin > Connection > +, wow you have to fill details of the connection:

Connection Id: postgres_default
Connection Type: 'Postgres'
Host: 
   
    
Schema: postgres (default)
Login: 
    
     
Password: 
     
      
Port: 
      

      
     
    
   

Click "Test" button to check your connection then save. Now click at the Airflow icon to return to home page. You should see task name "create_postgres_tables". Try to run by clicking start button select "Trigger DAG".

You might also like...
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of Data Science or those who are already in the field and looking to solve a real-world project with python.

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

Lale is a Python library for semi-automated data science.
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Releases(airflow-postgres-dag)
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

David Cournapeau 76 Nov 30, 2022
Show you how to integrate Zeppelin with Airflow

Introduction This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition f

Jeff Zhang 11 Dec 30, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
A fast, flexible, and performant feature selection package for python.

linselect A fast, flexible, and performant feature selection package for python. Package in a nutshell It's built on stepwise linear regression When p

88 Dec 06, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
📊 Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock

Paulina Khew 21 Aug 30, 2022
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
Feature Detection Based Template Matching

Feature Detection Based Template Matching The classification of the photos was made using the OpenCv template Matching method. Installation Use the pa

Muhammet Erem 2 Nov 18, 2021
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022