A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
Some usefull scripts for the Nastran's 145 solution (Flutter Analysis) using the pyNastran package.

nastran-aero-flutter This project is intended to analyse the Supersonic Panel Flutter using the NASTRAN software. The project uses the pyNastran and t

zuckberj 11 Nov 16, 2022
Яндекс тренировки по алгоритмам. Июнь 2021

Young&&Yandex Тренировки по алгоритмам Если вы хотите попасть на летнюю стажировку в Яндекс, но пока не уверены в своих силах, приходите на наши трени

Podlevskiy Viktor 6 Sep 03, 2021
Timetable scripts for python

Timetable Scripts timetable_to_json: https://beta.elektronplus.pl/timetable classes_taught_by_teacher: a.adam (aa) ['1Tc', '1Td', '3Te', '3Ti', '4Tf',

Elektron++ 2 Jan 02, 2022
pgvector support for Python

pgvector-python pgvector support for Python Great for online recommendations 🎉 Supports Django, SQLAlchemy, Psycopg 2, Psycopg 3, and asyncpg Install

Andrew Kane 37 Dec 20, 2022
NotesToCommands - a fully customizable notes / command template program, allowing users to instantly execute terminal commands

NotesToCommands is a fully customizable notes / command template program, allowing users to instantly execute terminal commands with dynamic arguments grouped into sections in their notes/files. It w

zxro 5 Jul 02, 2022
An evolutionary multi-agent platform based on mesa and NEAT

An evolutionary multi-agent platform based on mesa and NEAT

Valerio1988 6 Dec 04, 2022
Bionic is Python Framework for crafting beautiful, fast user experiences for web and is free and open source.

Bionic is Python Framework for crafting beautiful, fast user experiences for web and is free and open source. Getting Started This is an example of ho

14 Apr 10, 2022
A python server markup language

PSML - Python server markup language How to install: python install.py

LMFS 6 May 18, 2022
A sage package for working with circular genomes represented by signed or unsigned permutations

Circular genome tools (cgt) A sage package for working with circular genomes represented by signed or unsigned permutations. It includes tools for con

Joshua Stevenson 1 Mar 10, 2022
Vaccine for STOP/DJVU ransomware, prevents encryption

STOP/DJVU Ransomware Vaccine Prevents STOP/DJVU Ransomware from encrypting your files. This tool does not prevent the infection itself. STOP ransomwar

Karsten Hahn 16 May 31, 2022
Function Plotter✨

Function-Plotter Build With : Python PyQt5 unittest matplotlib Getting Started This is an list of needed instructions to set up your project locally,

Ahmed Lotfy 3 Jan 06, 2022
Task dispatcher for Postgres

Features a task being ran as an OS process supports task queue with priority and process limit per node fully database driven (a worker and task can b

2 Dec 06, 2021
msgqywx 使用企业微信的应用消息推送实时信息

msgqywx 使用企业微信的应用消息推送实时信息

Demon Finch 8 Dec 18, 2022
ArinjoyTheDev 1 Jul 17, 2022
Hospitality app for ERPNext to manage hotels & restaurants.

Hospitality ERPNext Hospitality module is designed to handle workflows for Hotels and Restaurants. Manage Restaurants The Restaurant module in ERPNext

Frappe 19 Dec 26, 2022
Djangoblog - A blogging site where people can make their accout and write blogs and read other author's blogs

This a blogging site where people can make their accout and write blogs and read other author's blogs.

1 Jan 26, 2022
Find all social media accounts with a username!

Aliens_eye FIND ALL SOCIAL MEDIA ACCOUNTS WITH A USERNAME! OSINT To install: Open terminal and type: git clone https://github.com/BLINKING-IDIOT/Alien

Aaron Thomas 84 Dec 28, 2022
A collection of existing KGQA datasets in the form of the huggingface datasets library, aiming to provide an easy-to-use access to them.

KGQA Datasets Brief Introduction This repository is a collection of existing KGQA datasets in the form of the huggingface datasets library, aiming to

Semantic Systems research group 21 Jan 06, 2023
Ramadhan countdown - Simple daily reminder about upcoming Ramadhan

Ramadhan Countdown Bot Simple bot for displaying daily reminder about Islamic pr

Abdurrahman Shofy Adianto 1 Feb 06, 2022
🤖🤖 Jarvis is an virtual assistant which can some tasks easy for you like surfing on web opening an app and much more... 🤖🤖

Jarvis 🤖 🤖 Jarvis is an virtual assistant which can some tasks easy for you like surfing on web opening an app and much more... 🤖 🤖 Developer : su

1 Nov 08, 2021