INFO-H515 - Big Data Scalable Analytics

Last update: Dec 11, 2022

Related tags

Overview

INFO-H515 - Big Data Scalable Analytics

Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group

Exercise classes - Overview

This repository contains the material for the exercise classes of the ULB/VUB Big Data Analytics master course (first semester 2022) - Advanced analytics part.

These hands-on sessions provide:

Session 1 : An introduction to Spark and its Machine Learning (ML) library. The case study for the first session is a churn prediction problem: How to predict which customers will quit a subscription to a given service? The session covers the basics for loading and formatting a dataset for training an ML algorithm using Spark ML library, and illustrates the use of different Spark ML algorithms and accuracy metrics to address the prediction problem.
Sessions 2 and 4: An in-depth coverage of the use of the Map/Reduce programming model for distributing machine learning algorithms, and their implementation in Spark. Sessions 2, 3, and 4 cover, respectively, the Map/Reduce implementations from scratch of
- Session 2: Linear regression (ordinary least squares and stochastic gradient descent). The algorithms are applied on an artificial dataset, and illustrate the numpy and Map/Reduce implementations for OLS and SGD.
- Session 3: Streaming analytics with Recursive Least Squares and model racing. The algorithms are implemented using Spark Streaming, on a data stream coming from a Kafka broker. The RLS approach is then compared with established ML approaches.
- Session 4: Recommender system with alternating least squares, using as a case study a movie recommendation problem.
After detailing the Map/Reduce techniques for solving these problems, each session ends with an example on how to use the corresponding algorithm with Spark ML, and get insights into how Spark distributes the task using the Spark user interface.
Session 5: An overview of a deep learning framework (Keras/Tensorflow), and its use for image classification using convolutional neural networks.

The material is available as a set of Jupyter notebooks.

Clone this repository

From the command line, use

git clone https://github.com/Yannael/BigDataAnalytics_INFOH515

If using the course cluster, you will have to use SFTP to send this folder to the cluster.

Environment setup

These notebooks rely on different technologies and frameworks for Big Data and machine learning (Spark, Kafka, Keras and Tensorflow). We summarize below different ways to have your environment set up.

Local setup (Linux)

Python

Install Anaconda Python (see https://www.anaconda.com/download/, choose the latest Linux distribution (Python 3.9 at the writing of these instructions).

Make sure the binaries are in your PATH. Anaconda installer proposes to add them at the end of the installation process. If you decline, you may later add

export ANACONDA_HOME=where_you_installed_anaconda
export PATH=$ANACONDA_HOME/bin:$PATH

to your .bash_rc.

Spark

Download from https://spark.apache.org/downloads.html (Use version 3.2.0 (October 2020), prebuilt for Apache Hadoop 3.3). Untar and add executables to your PATH, as well as Python libraries to PYTHONPATH

export SPARK_HOME=where_you_untarred_spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYTHONPATH="$SPARK_HOME/python/lib/pyspark.zip:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip"

Kafka

Download from https://kafka.apache.org/downloads, and untar archive. Start with

export KAFKA_HOME=where_you_untarred_kafka
nohup $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties  > $HOME/zookeeper.log 2>&1 &
nohup $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties > $HOME/kafka.log 2>&1 &

Keras and tensorflow

Install with pip

pip install tensorflow
pip install keras

Notebook

The notebook is part of Anaconda. Start Jupyter notebook with

jupyter notebook

and open in the browser at 127.0.0.1:8888

Docker

In order to ease the setting-up of the environment, we also prepared a Docker container that provides a ready-to-use environment. See docker folder for installing Docker, downloading the course container, and get started with it.

Note that the Dockerfile script essentially follows the steps for the 'local' installation.

Check if your setup is working

After setting up your environment (either in a Docker or your own machine) you should be able to run the notebook and scripts in Check_Setup

Spark - Test with Check_Setup notebook

Open notebook from Check_Setup/Demo_RDD_local.ipynb
Run all cells

Follow instructions in Check_Setup/Demo_RDD_local.ipynb to have access to Spark UI.

Kafka - Test with Check_Setup scripts

Run the script Check_Setup/0_kafka_startup.sh to start Zookeeper and Kafka.
Run the script Check_Setup/1_kafka_test_topic.sh to check whether a topic can be created and deleted successfully.
In two separate terminals:
1. Start first Check_Setup/2_kafka_test_sender.sh, and try sending some messages, by entering some text and concluding the message with the Enter key.
2. Start first Check_Setup/3_kafka_test_receiver.sh, and check that the messages sent by the sender are correctly received.

INFO-H515 - Big Data Scalable Analytics

Related tags

Overview

INFO-H515 - Big Data Scalable Analytics

Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group

Exercise classes - Overview

Clone this repository

Environment setup

Local setup (Linux)

Python

Spark

Kafka

Keras and tensorflow

Notebook

Docker

Check if your setup is working

Spark - Test with Check_Setup notebook

Kafka - Test with Check_Setup scripts

FAQ

Owner

Yann-Aël Le Borgne

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Data collection, enhancement, and metrics calculation.

Calculate multilateral price indices in Python (with Pandas and PySpark).

Ejercicios Panda usando Pandas

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Data analysis and visualisation projects from a range of individual projects and applications

A program that uses an API and a AI model to get info of sotcks

Top 50 best selling books on amazon

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

A set of tools to analyse the output from TraDIS analyses