Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

This project was completed in an attempt to become better acquainted with the latest big data tools. Further details can be found on my blog here.

The world is producing an exponentially increasing amount of digital data, and the tools we use to derive insights from data are evolving just as rapidly.

In recent years, a new architecture called the Data Lakehouse has begun to gain prominence as an enterprise solution to storing and processing big data. This trend piqued my interest and led to my exploration of some of the key underlying technologies fueling the revolution.

Of particular focus are two open-source technologies: Delta Lake and Apache Spark. Delta Lake provides a metadata layer to data lakes, bringing ACID transaction guarantees and time travel to a heretofore messy approach to data science at scale. Apache Spark offers a distributed processing engine for a diverse set of workloads (e.g., SQL queries, machine learning, stream processing), which can be programmed in Python, R, Scala, etc.

It is my belief that these technologies―among several others further detailed on my blog―will play a major role in how businesses leverage the power of data going forward. As such, this research prepares me well to confront many emerging data engineering and data science challenges.

The demonstration linked below is deployed using the Binder service, which processes a Jupyter notebook in the cloud, based on a custom Docker image described by the supporting files in this repository.

Live Link:

Contained in this repository:

Jupyter notebook demonstrating Apache Spark and Delta Lake
Files to construct a custom Docker image deployed using Binder
- Dockerfile
- docker-compose.yml
- requirements.txt

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Related tags

Overview

Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

Live Link:

Contained in this repository:

Owner

A Python module for clustering creators of social media content into networks

Fitting thermodynamic models with pycalphad

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Common bioinformatics database construction

Python beta calculator that retrieves stock and market data and provides linear regressions.

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

The lastest all in one bombing tool coded in python uses tbomb api

An extension to pandas dataframes describe function.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

Example Of Splunk Search Query With Python And Splunk Python SDK

Conduits - A Declarative Pipelining Tool For Pandas

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

Manage large and heterogeneous data spaces on the file system.

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Binance Kline Data With Python

MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications