Pyspark Spotify ETL

Description

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

The purpose of this is to help those that want to become Data Engineers, like myself, create their first project.

Essentials

Extra libraries that must be imported: sys, json, datetime.

ETL Execution

Install all the necessary libraries from the Pipfile.
Read the "Token_request_instructions" to get your own refresh token. In case you don't want that you can get one from this website https://developer.spotify.com/console/get-recently-played/ which will have to be changed every hour.
Add your you postgreSQL credentials in the engine variable. In case you'll be using another RDBMS, use this website https://docs.sqlalchemy.org/en/14/core/engines.html.
Create SQL Database/Table (Optional).
Create a bash file. This file is were you'll write down the path to Spark, Python and your script. If this isn't created you'll get the "ModuleNotFoundError" for each module you import inside your script. (Think of this as the ETL's own ~/.bash_profile)
Create a new crontab or use the existing one if you want the job to run on midnight every day.

Extras

To verify that your scheduled job is working you can change the crontab to "* * * * *".
Here is the website https://developer.spotify.com/documentation/general/guides/scopes/ with other Spotify scopes in case you don't want to use "recently played tracks".
Thank you Karolina Sowinska for your DE Beginners guide.

Pyspark Spotify ETL

Related tags

Overview

Pyspark Spotify ETL

Owner

Calculate multilateral price indices in Python (with Pandas and PySpark).

This repository contains some analysis of possible nerdle answers

This is a python script to navigate and extract the FSD50K dataset

follow-analyzer helps GitHub users analyze their following and followers relationship

Exploratory data analysis

Import, connect and transform data into Excel

Powerful, efficient particle trajectory analysis in scientific Python.

WaveFake: A Data Set to Facilitate Audio DeepFake Detection

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

nrgpy is the Python package for processing NRG Data Files

Pyspark project that able to do joins on the spark data frames.

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

PipeChain is a utility library for creating functional pipelines.

Candlestick Pattern Recognition with Python and TA-Lib

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

🌍 Create 3d-printable STLs from satellite elevation data 🌏

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Cleaning and analysing aggregated UK political polling data.