Projects that implement various aspects of Data Engineering.

Last update: Oct 14, 2021

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

AWS S3 - Source datasets.

AWS Redshift
>for staging extracted data
>for storing the resultant data model (facts and dimensions)

Data model designed for this project consists of a star schema.

Table and attribute details are -

Fact Table
songplays: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables
users: user_id, first_name, last_name, gender, level
songs: song_id, title, artist_id, year, duration
artists: artist_id, name, location, lattitude, longitude
time: start_time, hour, day, week, month, year, weekday

Source datasets to be extracted into dimension model are -

There are two json files for

Song data: s3://udacity-dend/song_data - Data for all songs with their respective artists available in application library.

Log data: s3://udacity-dend/log_data - Data for user events and activity activity on the application.

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Create sql_queries.py - to design and build tables for proposed data model

Run create_tables.py - to create tables by implementing the database queries from sql_queries.py

Run etl.py - to implement the data pipeline built over the data model which extract, stage and load data from AWS S3 to DWH on AWS Redshift

Design and fire analytical queries on the populated data model to gain insights of user events over streaming application

Projects that implement various aspects of Data Engineering.

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

Data model designed for this project consists of a star schema.

Table and attribute details are -

Source datasets to be extracted into dimension model are -

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Owner

An easy-to-use feature store

Pyspark project that able to do joins on the spark data frames.

ASOUL直播间弹幕抓取&&数据分析

Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

scikit-survival is a Python module for survival analysis built on top of scikit-learn.

Lale is a Python library for semi-automated data science.

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

A fast, flexible, and performant feature selection package for python.

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

INFO-H515 - Big Data Scalable Analytics

BAyesian Model-Building Interface (Bambi) in Python.

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

Creating a statistical model to predict 10 year treasury yields

Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

WithPipe is a simple utility for functional piping in Python.

Zipline, a Pythonic Algorithmic Trading Library

Implementation in Python of the reliability measures such as Omega.

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories