PizzaOrders_DataPipeline

There is a Tony who is owning a New Pizza shop.
He knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire
so he had one more genius idea to combine with it - he was going to Uberize it - and so Pizza Runner was launched!

Tony started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Tony’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.

Now he wants to know how is his business going on he needs some answers to his questions from the data. but the data which is stored is not in an appropriate format. He Approaches a Data Engineer to process and store the data for him and get the answers to his question

The data are stored in the different CSV files

customer_orders.csv
Columns=>order_id,customer_id,pizza_id,exclusions,extras,order_time
pizza_names.csv
Columns=> pizza_id,pizza_name
pizza_recipes.csv
Columns=>pizza_id,toppings
pizza_toppings.csv
Columns=>topping_id,topping_name
runner_orders.csv
Columns=>order_id,runner_id,pickup_time,distance,duration,cancellation
runners.csv
Columns=> runner_id,registration_date

The Answers the Tony wanted for

How many pizzas were ordered?
How many unique customer orders were made?
How many successful orders were delivered by each runner?
How many of each type of pizza was delivered?
How many Vegetarian and Meatlovers were ordered by each customer?
What was the maximum number of pizzas delivered in a single order?
For each customer, how many delivered pizzas had at least 1 change and how many had no changes?
How many pizzas were delivered that had both exclusions and extras?
What was the total volume of pizzas ordered for each hour of the day?
Wh/at was the volume of orders for each day of the week?

Requirements

Store the data In MY SQL table
Using Sqoop Store the Data in Hive
Using the PySpark the get the Results for the question
Store the Results in Seperate Table
Automate entire process in the Airflow

Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

Related tags

Overview

PizzaOrders_DataPipeline

The data are stored in the different CSV files

The Answers the Tony wanted for

Requirements

AirFlow Output

Owner

Melwin Varghese P

Tools for analyzing data collected with a custom unity-based VR for insects.

Top 50 best selling books on amazon

Project under the certification "Data Analysis with Python" on FreeCodeCamp

Pip install minimal-pandas-api-for-polars

scikit-survival is a Python module for survival analysis built on top of scikit-learn.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Statistical Rethinking course winter 2022

A Python package for modular causal inference analysis and model evaluations

cLoops2: full stack analysis tool for chromatin interactions

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

A Python package for the mathematical modeling of infectious diseases via compartmental models

Catalogue data - A Python Scripts to prepare catalogue data

Randomisation-based inference in Python based on data resampling and permutation.

AWS Glue ETL Code Samples

A data structure that extends pyspark.sql.DataFrame with metadata information.

Pyspark project that able to do joins on the spark data frames.

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems