PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Overview

PrimaryBid

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Part1

This part involves ingesting an application lifecycle raw data in .csv formats (“CC Application Lifecycle.csv”). The data is transformed to return various Application stages as column names, and the time of stage completion, as values against each customer ID via python.

Files included in this section include:

  • Solution Directory:
    • application_etl.py (Contains transformation class for application lifecycle raw data)
    • run_application_etl.py (Ingest and executes transformations for application lifecycle raw data)
  • Test Directory:
    • test_application_etl.py (runs a series of test for objects in the transformation class)
    • Input Directory (Contains all the input test files)
    • Output Directory (Contains all the output test files)

Execution:

  1. Execute run_application_etl.py to obtain output file for transformed application lifecycle data.

Modifications:

  1. Extra transformation, bug fixes and other modification can be added in application_etl.py as an object.
  2. For new transformations (new functions), add a test for the function in test_application_etl.py and execute it with pytest -vv.
  3. Call the object in run_application_etl.py after test passes to return desired output.

Part2

This part presents an architectural design to ingest data from a MongoDB database - into a Redshift data platform. The solution accomodates the addition of more data sources in the near future. The DDL scripts which form part of the solution is resusable for ingesting and loading data into redshift.

Files included in this section establishes the creation of target tables for the data ingestion process:

  • dwh.cfg (Infrastucture parameters and configuration)
  • DDL_queries.py (DDL queries to drop, creat, copy/insert data into Redshift)
  • table_setup_load.py (Class to manage the establish connection to database setup and teardown of tables in Redshift)
  • execute_ddl_process.py (script to execute processes in table_setup_load class)
  • test_execute_ddl_process.py (script to test the setup and teardown of resources.)
  • requirement.txt (key libraries needed to execute .py scripts)
  • makefile (file to automate process of installing and testing libraries and .py scripts respectively.)

Execution:

  1. Execute execute_ddl_process.py to create and load data into target tables from S3.

Modifications:

  1. Bucket file sources and other config paramters can be added in dwh.cfg
  2. New DDl queries which includes ingesting data from multiple tables from aggregations/joins can be added in DDL_queries.py.
  3. For other functions not captured in this section work, custom functions can be added in table_setup_load.py
  4. Before executing scripts for production environments, test the modifications by executing test_execute_ddl_process.py

The architecture below highlights the processes involved in ingesting data from various data sources into redshift

  • Architeture

Data Architecture

Owner
Emmanuel Boateng Sifah
Computer scientist, Doctoral researcher, Solutions engineer, Data scientist, Data analyst and Data engineer
Emmanuel Boateng Sifah
Pip install minimal-pandas-api-for-polars

Minimal Pandas API for Polars Install From PyPI: pip install minimal-pandas-api-for-polars Example Usage (see tests/test_minimal_pandas_api_for_polars

Austin Ray 6 Oct 16, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Python reader for Linked Data in HDF5 files

Linked Data are becoming more popular for user-created metadata in HDF5 files.

The HDF Group 8 May 17, 2022
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021