The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards

Overview

Project Scopes

The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards. To do that, a conceptual data model and a data pipeline will be defined.

Architecture

Data are uploaded to Google Cloud Storage bucket. GCS will act as the data lake where all raw files are stored. Data will then be loaded to staging tables on BigQuery. The ETL process will take data from those staging tables and create data mart tables. An Airflow instance can be deployed on a Google Compute Engine or locally to orchestrate the pipeline.

Here are the justifications for the technologies used:

  • Google Cloud Storage: act as the data lake, vertically scalable.
  • Google Big Query: act as data base engine for data warehousing, data mart and ETL processes. BigQuery is a serverless solution that can easily and effectively process petabytes scale dataset.
  • Apache Airflow: orchestrate the workflow by issuing command line to load data to BigQuery or SQL queries for ETL process. Airflow does not have to process any data by itself, thus allowing the architecture to scale.

Data Model

The database is designed following a star-schema principal with 1 fact table and 5 dimensions tables.

image

  • F_IMMIGRATION_DATA: contains immigration information such as arrival date, departure date, visa type, gender, country of origin, etc.
  • D_TIME: contains dimensions for date column
  • D_PORT: contains port_id and port_name
  • D_AIRPORT: contains airports within a state
  • D_STATE: contains state_id and state_name
  • D_COUNTRY: contains country_id and country_name
  • D_WEATHER: contains average weather for a state
  • D_CITY_DEMO: contains demographic information for a city

Data pipeline

This project uses Airflow for orchestration.

image

A DummyOperator start_pipeline kick off the pipeline followed by 4 load operations. Those operations load data from GCS bucket to BigQuery tables. The immigration_data is loaded as parquet files while the others are csv formatted. There are operations to check rows after loading to BigQuery.

Next the pipeline loads 3 master data object from the I94 Data dictionary. Then the F_IMMIGRATION_DATA table is created and check to make sure that there is no duplicates. Other dimension tables are also created and the pipelines finishes.

Scenarios

Data increase by 100x

Currently infrastructure can easily supports 100x increase in data size. GCS and BigQuery can handle petabytes scale data. Airflow is not a bottle neck since it only issue commands to other services.

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

Schedule dag to be run daily at 7 AM. Setup dag retry, email/slack notification on failures.

Make it available to 100+ people

BigQuery is auto-scaling so if 100+ people need to access, it can handle that easily. If more people or services need access to the database, we can add steps to write to a NoSQL database like Data Store or Cassandra, or write to a SQL one that supports horizontal scaling like BigTable.

Project Instructions

GCP setup

Follow the following steps:

  • Create a project on GCP
  • Enable billing by adding a credit card (you have free credits worth $300)
  • Navigate to IAM and create a service account
  • Grant the account project owner. It is convenient for this project, but not recommended for production system. You should keep your key somewhere safe.

Create a bucket on your project and upload the data with the following structure:

gs://cloud-data-lake-gcp/airports/:
gs://cloud-data-lake-gcp/airports/airport-codes_csv.csv
gs://cloud-data-lake-gcp/airports/airport_codes.json

gs://cloud-data-lake-gcp/cities/:
gs://cloud-data-lake-gcp/cities/us-cities-demographics.csv
gs://cloud-data-lake-gcp/cities/us_cities_demo.json

gs://cloud-data-lake-gcp/immigration_data/:
gs://cloud-data-lake-gcp/immigration_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00001-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00002-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00003-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00004-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00005-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00006-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00007-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00008-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00009-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00010-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00011-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00012-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00013-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet

gs://cloud-data-lake-gcp/master_data/:
gs://cloud-data-lake-gcp/master_data/
gs://cloud-data-lake-gcp/master_data/I94ADDR.csv
gs://cloud-data-lake-gcp/master_data/I94CIT_I94RES.csv
gs://cloud-data-lake-gcp/master_data/I94PORT.csv

gs://cloud-data-lake-gcp/weather/:
gs://cloud-data-lake-gcp/weather/GlobalLandTemperaturesByCity.csv
gs://cloud-data-lake-gcp/weather/temperature_by_city.json

You can copy the data to your own bucket by running the following:

gsutil cp -r gs://cloud-data-lake-gcp/ gs://{your_bucket_name}

Local setup

Clone the project, create environment, install required packages by running the following:

Install docker if it's not already installed. You can find the resources to do that here.

Install the Astronomer CLI following the instructions here.

Run the following commands to bring up the Airflow instance:

astro d start

You can look at the logs by running make logs if you need to debug something. You can access and manage the pipeline by typing the following address to a browser:

localhost:8080/admin/

If everything is setup correctly, you will see the following screen:

image

Navigate to Admin -> Connections and paste in the credentials for the following two connections: bigquery_default and google_cloud_default

image

Navigate to the main dag on path dags\cloud-data-lake-pipeline.py and change the following parameters with your own setup:

project_id = 'cloud-data-lake'
staging_dataset = 'IMMIGRATION_DWH_STAGING'
dwh_dataset = 'IMMIGRATION_DWH'
gs_bucket = 'cloud-data-lake-gcp'

You can then trigger the dag and the pipeline will run.

The data warehouse

The final data warehouse looks like this: img

Owner
Shweta_kumawat
AI software @ Computer programmer, Deep Learning, Computer Vision Researcher and Developer. Implement AI products and machine learning solutions
Shweta_kumawat
The best (and now open source) Discord selfbot.

React Selfbot Yes, for real Why am I making this open source? Because can't stop calling my product a rat, tokenlogger and what else not. But there is

30 Nov 13, 2022
Simple Craigslist wrapper

python-craigslist A simple Craigslist wrapper. License: MIT-Zero. Disclaimer I don't work for or have any affiliation with Craigslist. This module was

Julio M. Alegria 370 Dec 22, 2022
VideoMergeDcBot1 - Video Merge Dc Bot for telegram

VIDEO MERGE BOT An Telegram Bot Demo 👉 @VideoMergeDcBot To Merge multiple Video

Selfie SD 2 Feb 04, 2022
Pyspark sam - Analyze Big Sequence Alignments with PySpark in AWS EMR

pyspark_sam This repo hosts my code for the article "Analyze Big Sequence Alignm

Sixing Huang 4 Dec 09, 2022
Criando Lambda Functions para Ingerir Dados de APIs com AWS CDK

LIVE001 - AWS Lambda para Ingerir Dados de APIs Fazer o deploy de uma função lambda com infraestrutura como código Lambda vai numa API externa e extra

Andre Sionek 12 Nov 20, 2022
🔮 Uncover some followers of a private instagram account

Private Instagram Chaining 🔮 Uncover part of followers of an instagram private account I have this private instagram account julianakhao. I need to g

аэт 69 Dec 17, 2022
Python based league of legends orbwalker

League of Legends Orbwalker Usage Install python3 Create a python3 venv Install the requirements pip install -r requirements.txt Get in game and run m

Inusha 43 Dec 12, 2022
A modular telegram Python bot running on python3 with an sqlalchemy database.

TG_Bot A modular telegram Python bot running on python3 with an sqlalchemy database. Originally a simple group management bot with multiple admin feat

Movindu Bandara 1 Nov 02, 2021
Account Profiles Dumper for Fortnite.

Fortnite Profile Dumper This program allows you to dump your Fortnite account profiles. How to use it? After starting the FortniteProfileDumper.py, yo

PRO100KatYT 12 Jul 28, 2022
A modular Telegram Python bot running on python3 with a sqlalchemy, redislab, mongo database, telethon, and pyrogram.

Zeldris Robot A modular Telegram Python bot running on python3 with a sqlalchemy, redislab, mongo database, telethon, and pyrogram. How to set up/depl

IDNCoderX 42 Dec 21, 2022
Aqui está disponível GRATUITAMENTE, um bot de discord feito em python, saiba que, terá que criar seu bot como aplicação, e utilizar seu próprio token, e lembrando, é um bot básico, não se utiliza Cogs nem slash commands nele!

BotDiscordPython Aqui está disponível GRATUITAMENTE, um bot de discord feito em python, saiba que, terá que criar seu bot como aplicação, e utilizar s

Matheus Muguet 4 Feb 05, 2022
A python script that automatically farms the Discord bot 'Dank Memer'.

Dank Farmer A python script that automatically farms the Discord bot 'Dank Memer'. Requirements pynput Disclaimer DO NOT use if you are not willing to

2 Dec 30, 2021
A fast and expressive Craigslist API wrapper

pycraigslist A fast and expressive Craigslist API wrapper. ⚠ As of September 2021, it is believed that Craigslist added a rate-limiter. It is advised

Ira Horecka 24 Dec 28, 2022
A Python library for inserting an reverse shell attached to Telegram in any Python application.

py tel reverse shell the reverse shell in your telgram! What is this? This program is a Python library that you can use to put an inverted shell conne

Torham 12 Dec 28, 2022
Instagram bot for promoting ROKA trainee soldier(just like me)'s consolation letters.

Instagram_bot (필자를 포함한) 모든 대한민국 훈련병들을 위한 인스타그램 인편지기입니다. Instagram bot for promoting ROKA trainee soldier(just like me)'s consolation letters. 들어가기 (Ge

Lee, Jongjun 2 Nov 21, 2021
Nflmetrics - Johns Hopkins Spring 2022 Sports Analytics research project about NFL Draft Metrics

nflmetrics GitHub repo for Johns Hopkins Spring 2022 Sports Analytics research p

Anish Kulkarni 4 Feb 24, 2022
Unauthenticated enumeration of services, roles, and users in an AWS account or in every AWS account in existence.

Quiet Riot 🎶 C'mon, Feel The Noise 🎶 An enumeration tool for scalable, unauthenticated validation of AWS principals; including AWS Acccount IDs, roo

Wes Ladd 89 Jan 05, 2023
Program that automates the bump of the Disboard Bot. Done 100% in Python with PyAutoGUI library

Auto-Discord-Bump Program that automates the bump of the Disboard Bot done 100% in python with PyAutoGUI How to configue You will need 3 things before

Mateus 1 Dec 19, 2021
Pdisk Uploader Bot

pdisk-bot pdisk uploader telegram bot How To Use Configs TG_BOT_TOKEN - Get bot token from @BotFather API_ID - From my.telegram.org API_HASH - From my

lokaman chendekar 25 Oct 21, 2022
This is to notify you via Discord whenever there is a new beacon.

BeaconNotifier-Discord This is to notify you via Discord whenever there is a new beacon. Make sure you have python3 installed Steps: Create a Discord

26 Dec 28, 2022