ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

Generate lookml for views from dbt models

A tax calculator for stocks and dividends activities.

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

An extension to pandas dataframes describe function.

Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

My first Python project is a simple Mad Libs program.

Validation and inference over LinkML instance data using souffle

PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

A crude Hy handle on Pandas library

PyIOmica (pyiomica) is a Python package for omics analyses.

Performance analysis of predictive (alpha) stock factors

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Bamboolib - a GUI for pandas DataFrames

Single machine, multiple cards training; mix-precision training; DALI data loader.

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

An easy-to-use feature store

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.