This mini project showcase how to build and debug Apache Spark application using Python

Overview

Spark Python

by Denny Imanuel

This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

Spark on Localhost

Requirement

  1. PyCharm IDE - You need to install PyCharm IDE
  2. Java JDK - You need to install Java JDK and set JAVA_HOME env
  3. Python - You need to install Python and set PYTHONPATH env
  4. Spark Hadoop - You need to install Spark Hadoop and set HADOOP_HOME and SPARK_HOME env

For more info: https://dotnet.microsoft.com/en-us/learn/data/spark-tutorial/install-spark

Run Config

To run Spark app run Spark Submit command or create a new 'Run Config' under Shell Script as follows:

\SparkPython\venv\Scripts\python.exe" spark-submit --class SparkPython SparkPython.py">
set PYSPARK_PYTHON "
    
     \SparkPython\venv\Scripts\python.exe"
spark-submit --class SparkPython SparkPython.py

    

Build Config

To build Spark app run Spark Submit command or create a new 'Build Config' under Python Debug Server as follows:

venv\Scripts\activate
pip install pydevd-pycharm~=
    

   

Debug Config

To debug Spark app create 'Debug Config' using standard Python configuration file and then insert following code. In order to debug run above 'Build Config' first, set breakpoint, and then run this 'Debug Config':

import pydevd_pycharm
pydevd_pycharm.settrace('localhost', port=8888, stdoutToServer=True, stderrToServer=True)

Spark on Docker

Requirement

  1. Rider IDE / Visual Studio - You need to install Rider IDE or Visual Studio
  2. Docker Desktop - You need to install Docker Desktop to run Docker
  3. Spark Image - Make sure you pull same version of Spark image as your local Spark:

docker pull bitnami/spark:3.1.2

Spark Clusters

Docker Compose below will run Spark cluster in master and worker node. First comment the debug line(6,7) and then pack the venv folder into venv.tar.gz and then submit both SparkPython.py file and venv.tar.gz to Spark cluster.

docker-compose up
spark-submit --master spark://localhost:7070 --class SparkPython SparkPython.py --archives venv.tar.gz

Output Result

If the Spark application is successfully build it should print out result table as follows:

Owner
Denny Imanuel
This repos shows how to develop mini application using various kind of framework in different programing languages (C#, Java, Python, Angular, React, Vue, etc)
Denny Imanuel
MoRecon - A tool for reconstructing missing frames in motion capture data.

MoRecon - A tool for reconstructing missing frames in motion capture data.

Yuki Nishidate 38 Dec 03, 2022
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
CS50 pset9: Using flask API to create a web application to exchange stocks' shares.

C$50 Finance In this guide we want to implement a website via which users can “register”, “login” “buy” and “sell” stocks, like below: Background If y

1 Jan 24, 2022
MDAnalysis is a Python library to analyze molecular dynamics simulations.

MDAnalysis Repository README [*] MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale,

MDAnalysis 933 Dec 28, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Andrés Suárez 3 Nov 08, 2022