BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Last update: Jan 06, 2022

Related tags

Overview

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Introduction

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Installation

Please download BigDL Packages or pip install BigDL (conda)

How to run Program on Spark

Usage: spark-submit-with-bigdl.sh + [options] + file.py

Options:

master MASTER URL: spark, yarn, k8s, local.
local[k]: Run Spark locally with k worker threads as logical cores on your machine.
File.py: File for executing program.

System configuration

Program run on system includes:

System/Host Processor: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
CPU(s): 48
Core(s) per socket: 12
Socket(s): 2
Memory: 183 G (free)

Data Description and Run Model

It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. The MNIST data is split into three parts: 60,000 data points of training data, 10,000 points of test data.

With this BigDL Problem, We use LSTM model for MNIST digit classification problem.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Related tags

Overview

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Introduction

Installation

How to run Program on Spark

System configuration

Data Description and Run Model

BigDL Performance Evaluation

Execution running time

Computation Evaluation (SPEED UP)

Owner

Vo Cong Thanh

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Hidden Markov Models in Python, with scikit-learn like API

Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Port of dplyr and other related R packages in python, using pipda.

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Python implementation of Principal Component Analysis

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

A 2-dimensional physics engine written in Cairo

Analyzing Covid-19 Outbreaks in Ontario

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Data-sets from the survey and analysis

Data Analysis for First Year Laboratory at Imperial College, London.

DaCe is a parallel programming framework that takes code in Python/NumPy and other programming languages

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Random dataframe and database table generator

Data science/Analysis Health Care Portfolio

Fast, flexible and easy to use probabilistic modelling in Python.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas