A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
Pampy: The Pattern Matching for Python you always dreamed of.

Pampy: Pattern Matching for Python Pampy is pretty small (150 lines), reasonably fast, and often makes your code more readable and hence easier to rea

Claudio Santini 3.5k Dec 30, 2022
This project is about for notifying moderators about uploaded photos on server.

This project is about for notifying moderators (people who moderate data from photos) about uploaded photos on server.

1 Nov 24, 2021
Dungeon Dice Rolls is an aplication that the user can roll dices (d4, d6, d8, d10, d12, d20 and d100) and store the results in one of the 6 arrays.

Dungeon Dice Rolls is an aplication that the user can roll dices (d4, d6, d8, d10, d12, d20 and d100) and store the results in one of the 6 arrays.

Bracero 1 Dec 31, 2021
A web app that is written entirely in Python

University Project About I made this web app to finish a project assigned by my teacher. It is written entirely in Python, thanks to streamlit to make

15 Nov 27, 2022
A site that went kinda viral that lets you put Bernie Sanders in places

Bernie In Places An app that accidentally went viral! Read the story in WIRED here Install First, create a python virtual environment, and install all

310 Aug 22, 2022
Python dictionaries with advanced dot notation access

from box import Box movie_box = Box({ "Robin Hood: Men in Tights": { "imdb stars": 6.7, "length": 104 } }) movie_box.Robin_Hood_Men_in_Tights.imdb_s

Chris Griffith 2.1k Dec 28, 2022
Airbrake Python

airbrake-python Note. Python 3.4+ are advised to use new Airbrake Python notifier which supports async API and code hunks. Python 2.7 users should con

Airbrake 51 Dec 22, 2022
Medical appointments No-Show classifier

Medical Appointments No-shows Why do 20% of patients miss their scheduled appointments? A person makes a doctor appointment, receives all the instruct

4 Apr 20, 2022
Syntax highlighting for yarn.lock and bun.lockb files

Yarn.lock Syntax Highlighting Syntax highlighting for yarn.lock and bun.lockb files Installation Plugin is not publushed yet on Package Control, to in

Alexander Kuznetsov 4 Jul 06, 2022
My solutions for Advent of Code 2021 🌟🎄

🌟 Advent of Code 2021 🎄 My solutions for Advent of Code 2021. About · What is Advent of Code? · Contents · Usage · Table of puzzles (TODO: add final

Amanda P. Pinha 2 Dec 05, 2022
Projeto de análise de dados com SQL

Project-Analizyng-International-Debt-Statistics- Projeto de análise de dados com SQL - Plataforma Data Camp Descrição do Projeto : Não é que nós human

Lorrayne Silva 1 Feb 01, 2022
A flexible free and unlimited python tool to translate between different languages in a simple way using multiple translators.

deep-translator Translation for humans A flexible FREE and UNLIMITED tool to translate between different languages in a simple way using multiple tran

Nidhal Baccouri 806 Jan 04, 2023
SpellingBeeSolver - This program generates solutions to NYT style spelling bee problems.

SpellingBeeSolver This program generates solutions to NYT style spelling bee problems. The initial version of this program is being written in Python

1 Jan 01, 2022
Solcast Integration for Home Assistant

Solcast Solar Home Assistant(https://www.home-assistant.io/) Component This custom component integrates the Solcast API into Home Assistant. Modified

Greg 45 Dec 20, 2022
A simple code for processing images to local binary pattern.

This figure is gotten from this link https://link.springer.com/chapter/10.1007/978-3-030-01449-0_24 LBP-Local-Binary-Pattern A simple code for process

Happy N. Monday 3 Feb 15, 2022
Lightweight Scheduled Blocks Checker for Current Epoch. No cardano-node Required, data is taken from blockfrost.io

ReLeaderLogs For Cardano Stakepool Operators: Lightweight Scheduled Blocks Checker for Current Epoch. No cardano-node Required, data is taken from blo

SNAKE (Cardano Stakepool) 2 Oct 19, 2021
Rufus port to linux, writed on Python3

Rufus-for-Linux Rufus port to linux, writed on Python3 Программа будет иметь тот же интерфейс что и оригинал, и тот же функционал. Программа создается

6 Jan 07, 2022
This program is meant to take the pain out of generating nice bash PS1 prompts.

TOC PS1 Installation / Quickstart License Other Docs Examples PS1 Command Help PS1 ↑ This program is meant to take the pain out of generating nice bas

Steven Hollingsworth 6 Jun 19, 2022
Scripts used in the RayStation medical radiation dosimetry treatment planning system

Med Phys Scripts These are scripts that I, the medical physics assistant at Cookeville Regional Medical Center, wrote for use in our radiation therapy

Kaley White 2 Oct 19, 2022
In this repo i inherit the pos module and added QR code to pos receipt

odoo-pos-inherit In this repo i inherit the pos module and added QR code to pos receipt 1- Create new Odoo Module using command line $ python odoo-bin

5 Apr 09, 2022