Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Overview

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI

Objetivos

  • Criar infraestrutura como código
  • Utuilizando um cluster Kubernetes na Azure
    • Ingestão dos dados do Enade 2017 com python para o datalake na Azure
    • Transformar os dados da camada bronze para camada silver usando delta format
    • Enrriquecer os dados da camada silver para camada gold usando delta format
  • Utilizar Azure Synapse Serveless SQL Poll para servir os dados

Arquitetura

arquitetura

Passos

Criar infra

source infra/00-variables

bash infra/01-create-rg.sh

bash infra/02-create-cluster-k8s.sh

bash infra/03-create-lake.sh

bash infra/04-create-synapse.sh

bash infra/05-access-assignments.sh

Preparar k8s

Baixar kubeconfig file

bash infra/02-get-kubeconfig.sh

Para facilitar os comandos usar um alias

alias k=kubectl

Criar namespace

k create namespace processing
k create namespace ingestion

Criar Service Account e Role Bing

k apply -f k8s/crb-spark.yaml

Criar secrets

k create secret generic azure-service-account --from-env-file=.env --namespace processing
k create secret generic azure-service-account --from-env-file=.env --namespace ingestion

Intalar Spark Operator

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

helm repo update

helm install spark spark-operator/spark-operator --set image.tag=v1beta2-1.2.3-3.1.1 --namespace processing

Ingestion app

Ingestion Image

docker build ingestion -f ingestion/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4-ingestion --network=host

docker push otaciliopsf/cde-bootcamp:desafio-mod4-ingestion

Apply ingestion job

k8s/ingestion-job.yaml k apply -f k8s/ingestion-job.yaml ">
# primeiro mudar o nome unico do pod
cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/ingestion-job.yaml

k apply -f k8s/ingestion-job.yaml

Logs

ING_POD_NAME=$(cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")

k logs $ING_POD_NAME -n ingestion --follow

Spark

Criar Job Image

docker build spark -f spark/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4

docker push otaciliopsf/cde-bootcamp:desafio-mod4

Apply job

k8s/spark-job.yaml k apply -f k8s/spark-job.yaml ">
# primeiro muda o nome unico da Spark Application
cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/spark-job.yaml

k apply -f k8s/spark-job.yaml

logs

SPARK_APP_NAME=$(cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")'-driver'

k logs $SPARK_APP_NAME -n processing --follow

Azure Synapse Serveless SQL Poll

Acessar o Synapse workspace através do link gerado

bash infra/04-get-workspace-url.sh

Para começar a usar siga os passos

steps-synapse

Rodar o conteudo do script create-synapse-view.sql no Synapse workspace para criar a view da tabela no lake

Pronto, o Synapse esta pronto para receber as querys.

Limpando os recursos

bash infra/99-delete-service-principal.sh

bash infra/99-delete-rg.sh

Conclusão

Seguindo os passos citados é possivel realizar querys direto na camada gold do delta lake utilizando o Synapse

Owner
Otacilio Filho
Data Engineer Azure | Python | Spark | Databricks
Otacilio Filho
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

Nico Van den Hooff 17 Aug 21, 2022
Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview docs tests package Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era

Tensorwerk 193 Nov 29, 2022
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Jan 02, 2023
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 3 Dec 14, 2022
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

Infrastructure Team 93 Jan 03, 2023
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

CRISPRanalysis InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family. In this work, we present a workflow

2 Jan 31, 2022
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021