The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -
- AWS S3 - Source datasets.
- AWS Redshift
>for staging extracted data
>for storing the resultant data model (facts and dimensions)
- Fact Table
songplays: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- Dimension Tables
users: user_id, first_name, last_name, gender, level
songs: song_id, title, artist_id, year, duration
artists: artist_id, name, location, lattitude, longitude
time: start_time, hour, day, week, month, year, weekday
There are two json files for
- Song data: s3://udacity-dend/song_data - Data for all songs with their respective artists available in application library.
- Log data: s3://udacity-dend/log_data - Data for user events and activity activity on the application.
- Create sql_queries.py - to design and build tables for proposed data model
- Run create_tables.py - to create tables by implementing the database queries from sql_queries.py
- Run etl.py - to implement the data pipeline built over the data model which extract, stage and load data from AWS S3 to DWH on AWS Redshift
- Design and fire analytical queries on the populated data model to gain insights of user events over streaming application