Full automated data pipeline using docker images

Overview

Create postgres tables from CSV files

This first section is only relate to creating tables from CSV files using postgres container alone. Just one of my experiments. If you interest, you can just follow these steps (only if your working environment support bash):

sh scripts/prep.sh

The prep.sh will handle everything for you by doing follwing:

# Start postgres db container
docker-compose -f postgres.yaml up -d
# Sleep to make sure the container is fully up running
sleep 3 

# I have problem with mouting csv files via docker compose, so here we go
# Copy csv and setup.sql to create required tables
docker cp ./csv/ my_postgres:
docker cp ./scripts/setup.sql my_postgres:setup.sql

# Execute the script in postgres db
docker exec -it my_postgres psql -p5432 --dbname=postgres --username=postgres --file=setup.sql 

# Shutdown the container
docker-compose -f postgres.yaml down --remove-orphans

I had problem with mount volumn that I can't mount the files under csv and scripts folders. Which still can be improved with a proper mount. But let's skip it for now to save time.

Initial Setup/Start Airflow container

This section will use a separate docker-compose.yaml than the above test. It will be relate due to the fact that we want to use airflow to schedule the tasks above (create table and load data). To do so, do the following. First prepare folders. You can call a new folder specifically for this if you want.

# (optional) mkdir airflow && cd airflow
mkdir ./dags ./logs ./plugins

Next we need the airflow docker-compose.yaml in our airflow directory

curl -O https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml

Next make sure we will have a proper permission to initial Airflow

.env ">
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Then we must initial Airflow instance

docker-compose up airflow-init

Wait until the initial finished then (you can use -d to detach if you want)

docker-compose up

Now you will be able to connect to Airflow GUI via http://localhost:8080/

Create Airflow DAG task

First thing, you need to setup connection for postgres database. Go to tab Admin > Connection > +, wow you have to fill details of the connection:

Connection Id: postgres_default
Connection Type: 'Postgres'
Host: 
   
    
Schema: postgres (default)
Login: 
    
     
Password: 
     
      
Port: 
      

      
     
    
   

Click "Test" button to check your connection then save. Now click at the Airflow icon to return to home page. You should see task name "create_postgres_tables". Try to run by clicking start button select "Trigger DAG".

You might also like...
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of Data Science or those who are already in the field and looking to solve a real-world project with python.

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

Lale is a Python library for semi-automated data science.
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Releases(airflow-postgres-dag)
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
Clean and reusable data-sciency notebooks.

KPACUBO KPACUBO is a set Jupyter notebooks focused on the best practices in both software development and data science, namely, code reuse, explicit d

Matvey Morozov 1 Jan 28, 2022
Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Bamboolib - a GUI for pandas DataFrames

Community repository of bamboolib bamboolib is joining forces with Databricks. For more information, please read our announcement. Please note that th

Tobias Krabel 863 Jan 08, 2023
Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

2 Oct 23, 2021
Statistical Rethinking course winter 2022

Statistical Rethinking (2022 Edition) Instructor: Richard McElreath Lectures: Uploaded Playlist and pre-recorded, two per week Discussion: Online, F

Richard McElreath 3.9k Dec 31, 2022
The micro-framework to create dataframes from functions.

The micro-framework to create dataframes from functions.

Stitch Fix Technology 762 Jan 07, 2023
Exploratory data analysis

Exploratory data analysis An Exploratory data analysis APP TAPIWA CHAMBOKO 🚀 About Me I'm a full stack developer experienced in deploying artificial

tapiwa chamboko 1 Nov 07, 2021
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

CyberArk 250 Jan 03, 2023
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 01, 2023
Pip install minimal-pandas-api-for-polars

Minimal Pandas API for Polars Install From PyPI: pip install minimal-pandas-api-for-polars Example Usage (see tests/test_minimal_pandas_api_for_polars

Austin Ray 6 Oct 16, 2022
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 07, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
Python reader for Linked Data in HDF5 files

Linked Data are becoming more popular for user-created metadata in HDF5 files.

The HDF Group 8 May 17, 2022
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022