Processing NYC Taxi Data using PySpark ETL pipeline

Description

This is an project to extract, transform, and load large amount of data from NYC Taxi Rides database (Hosted on AWS S3). It extracts data from CSV files of large size (~2GB per month) and applies transformations such as datatype conversions, drop unuseful rows/columns, etc. Finally, the data is written back in parquet format. This saves time for tasks such as machine learning. It also saves a huge amount of space (~97% space reduction from csv to parquet) making it easy to store for downstream tasks.

How to use it (Using GCP as the cloud service of choice)

Setup a bucket on Google Cloud Storage
Use get_raw_data.sh to download raw data from s3 in the form of CSV files to the GCS bucket
Setup a GCP dataproc service
SSH into the master node and copy the entire project folder to the Persistent Disk
Edit the configuration file for application
Submit the job: submit-spark main.py --filename [raw_data_filename] or Execute submit_job.sh with appropriate args

Project structure

root/
|---bash/
    |---create_cluster.sh
    |---install.sh
|---configs/
    |---app_config.json
    |---cols_config.json
|---jobs/
    |---etl_tasks.py
    |---transformations.py
|   get_raw_data.sh
|   main.py
|   requirements.txt
|   submit_job.sh

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Related tags

Overview

Processing NYC Taxi Data using PySpark ETL pipeline

Description

How to use it (Using GCP as the cloud service of choice)

Project structure

Owner

Unnikrishnan

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

Template for a Dataflow Flex Template in Python

Yet Another Workflow Parser for SecurityHub

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

NumPy aware dynamic Python compiler using LLVM

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

Modular analysis tools for neurophysiology data

Data pipelines built with polars

A DSL for data-driven computational pipelines

Calculate multilateral price indices in Python (with Pandas and PySpark).

4CAT: Capture and Analysis Toolkit

A data analysis using python and pandas to showcase trends in school performance.

An extension to pandas dataframes describe function.

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Visions provides an extensible suite of tools to support common data analysis operations

Approximate Nearest Neighbor Search for Sparse Data in Python!

Probabilistic reasoning and statistical analysis in TensorFlow

Python package for analyzing sensor-collected human motion data

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Time ranges with python