Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Overview

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI

Objetivos

  • Criar infraestrutura como código
  • Utuilizando um cluster Kubernetes na Azure
    • Ingestão dos dados do Enade 2017 com python para o datalake na Azure
    • Transformar os dados da camada bronze para camada silver usando delta format
    • Enrriquecer os dados da camada silver para camada gold usando delta format
  • Utilizar Azure Synapse Serveless SQL Poll para servir os dados

Arquitetura

arquitetura

Passos

Criar infra

source infra/00-variables

bash infra/01-create-rg.sh

bash infra/02-create-cluster-k8s.sh

bash infra/03-create-lake.sh

bash infra/04-create-synapse.sh

bash infra/05-access-assignments.sh

Preparar k8s

Baixar kubeconfig file

bash infra/02-get-kubeconfig.sh

Para facilitar os comandos usar um alias

alias k=kubectl

Criar namespace

k create namespace processing
k create namespace ingestion

Criar Service Account e Role Bing

k apply -f k8s/crb-spark.yaml

Criar secrets

k create secret generic azure-service-account --from-env-file=.env --namespace processing
k create secret generic azure-service-account --from-env-file=.env --namespace ingestion

Intalar Spark Operator

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

helm repo update

helm install spark spark-operator/spark-operator --set image.tag=v1beta2-1.2.3-3.1.1 --namespace processing

Ingestion app

Ingestion Image

docker build ingestion -f ingestion/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4-ingestion --network=host

docker push otaciliopsf/cde-bootcamp:desafio-mod4-ingestion

Apply ingestion job

k8s/ingestion-job.yaml k apply -f k8s/ingestion-job.yaml ">
# primeiro mudar o nome unico do pod
cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/ingestion-job.yaml

k apply -f k8s/ingestion-job.yaml

Logs

ING_POD_NAME=$(cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")

k logs $ING_POD_NAME -n ingestion --follow

Spark

Criar Job Image

docker build spark -f spark/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4

docker push otaciliopsf/cde-bootcamp:desafio-mod4

Apply job

k8s/spark-job.yaml k apply -f k8s/spark-job.yaml ">
# primeiro muda o nome unico da Spark Application
cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/spark-job.yaml

k apply -f k8s/spark-job.yaml

logs

SPARK_APP_NAME=$(cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")'-driver'

k logs $SPARK_APP_NAME -n processing --follow

Azure Synapse Serveless SQL Poll

Acessar o Synapse workspace através do link gerado

bash infra/04-get-workspace-url.sh

Para começar a usar siga os passos

steps-synapse

Rodar o conteudo do script create-synapse-view.sql no Synapse workspace para criar a view da tabela no lake

Pronto, o Synapse esta pronto para receber as querys.

Limpando os recursos

bash infra/99-delete-service-principal.sh

bash infra/99-delete-rg.sh

Conclusão

Seguindo os passos citados é possivel realizar querys direto na camada gold do delta lake utilizando o Synapse

Owner
Otacilio Filho
Data Engineer Azure | Python | Spark | Databricks
Otacilio Filho
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown

915 Dec 26, 2022
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Python reader for Linked Data in HDF5 files

Linked Data are becoming more popular for user-created metadata in HDF5 files.

The HDF Group 8 May 17, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
This is a python script to navigate and extract the FSD50K dataset

FSD50K navigator This is a script I use to navigate the sound dataset from FSK50K.

sweemeng 2 Nov 23, 2021
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

NCL (Neighborhood-enrighed Contrastive Learning) This is the official PyTorch implementation for the paper: Zihan Lin*, Changxin Tian*, Yupeng Hou* Wa

RUCAIBox 73 Jan 03, 2023
PyClustering is a Python, C++ data mining library.

pyclustering is a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each

Andrei Novikov 1k Jan 05, 2023
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022