INFO-H515 - Big Data Scalable Analytics

Overview

INFO-H515 - Big Data Scalable Analytics

Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group

Exercise classes - Overview

This repository contains the material for the exercise classes of the ULB/VUB Big Data Analytics master course (first semester 2022) - Advanced analytics part.

These hands-on sessions provide:

  • Session 1 : An introduction to Spark and its Machine Learning (ML) library. The case study for the first session is a churn prediction problem: How to predict which customers will quit a subscription to a given service? The session covers the basics for loading and formatting a dataset for training an ML algorithm using Spark ML library, and illustrates the use of different Spark ML algorithms and accuracy metrics to address the prediction problem.

  • Sessions 2 and 4: An in-depth coverage of the use of the Map/Reduce programming model for distributing machine learning algorithms, and their implementation in Spark. Sessions 2, 3, and 4 cover, respectively, the Map/Reduce implementations from scratch of

    • Session 2: Linear regression (ordinary least squares and stochastic gradient descent). The algorithms are applied on an artificial dataset, and illustrate the numpy and Map/Reduce implementations for OLS and SGD.
    • Session 3: Streaming analytics with Recursive Least Squares and model racing. The algorithms are implemented using Spark Streaming, on a data stream coming from a Kafka broker. The RLS approach is then compared with established ML approaches.
    • Session 4: Recommender system with alternating least squares, using as a case study a movie recommendation problem.

    After detailing the Map/Reduce techniques for solving these problems, each session ends with an example on how to use the corresponding algorithm with Spark ML, and get insights into how Spark distributes the task using the Spark user interface.

  • Session 5: An overview of a deep learning framework (Keras/Tensorflow), and its use for image classification using convolutional neural networks.

The material is available as a set of Jupyter notebooks.

Clone this repository

From the command line, use

git clone https://github.com/Yannael/BigDataAnalytics_INFOH515

If using the course cluster, you will have to use SFTP to send this folder to the cluster.

Environment setup

These notebooks rely on different technologies and frameworks for Big Data and machine learning (Spark, Kafka, Keras and Tensorflow). We summarize below different ways to have your environment set up.

Local setup (Linux)

Python

Install Anaconda Python (see https://www.anaconda.com/download/, choose the latest Linux distribution (Python 3.9 at the writing of these instructions).

Make sure the binaries are in your PATH. Anaconda installer proposes to add them at the end of the installation process. If you decline, you may later add

export ANACONDA_HOME=where_you_installed_anaconda
export PATH=$ANACONDA_HOME/bin:$PATH

to your .bash_rc.

Spark

Download from https://spark.apache.org/downloads.html (Use version 3.2.0 (October 2020), prebuilt for Apache Hadoop 3.3). Untar and add executables to your PATH, as well as Python libraries to PYTHONPATH

export SPARK_HOME=where_you_untarred_spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYTHONPATH="$SPARK_HOME/python/lib/pyspark.zip:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip"

Kafka

Download from https://kafka.apache.org/downloads, and untar archive. Start with

export KAFKA_HOME=where_you_untarred_kafka
nohup $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties  > $HOME/zookeeper.log 2>&1 &
nohup $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties > $HOME/kafka.log 2>&1 &

Keras and tensorflow

Install with pip

pip install tensorflow
pip install keras

Notebook

The notebook is part of Anaconda. Start Jupyter notebook with

jupyter notebook

and open in the browser at 127.0.0.1:8888

Docker

In order to ease the setting-up of the environment, we also prepared a Docker container that provides a ready-to-use environment. See docker folder for installing Docker, downloading the course container, and get started with it.

Note that the Dockerfile script essentially follows the steps for the 'local' installation.

Check if your setup is working

After setting up your environment (either in a Docker or your own machine) you should be able to run the notebook and scripts in Check_Setup

Spark - Test with Check_Setup notebook

  • Open notebook from Check_Setup/Demo_RDD_local.ipynb
  • Run all cells

Follow instructions in Check_Setup/Demo_RDD_local.ipynb to have access to Spark UI.

Kafka - Test with Check_Setup scripts

  1. Run the script Check_Setup/0_kafka_startup.sh to start Zookeeper and Kafka.
  2. Run the script Check_Setup/1_kafka_test_topic.sh to check whether a topic can be created and deleted successfully.
  3. In two separate terminals:
    1. Start first Check_Setup/2_kafka_test_sender.sh, and try sending some messages, by entering some text and concluding the message with the Enter key.
    2. Start first Check_Setup/3_kafka_test_receiver.sh, and check that the messages sent by the sender are correctly received.

FAQ

Owner
Yann-Aël Le Borgne
Postdoc @ Machine Learning Group - Computer Science Department - Université Libre de Bruxelles - Belgium
Yann-Aël Le Borgne
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
Universal data analysis tools for atmospheric sciences

U_analysis Universal data analysis tools for atmospheric sciences Script written in python 3. This file defines multiple functions that can be used fo

Luis Ackermann 1 Oct 10, 2021
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
NumPy aware dynamic Python compiler using LLVM

Numba A Just-In-Time Compiler for Numerical Functions in Python Numba is an open source, NumPy-aware optimizing compiler for Python sponsored by Anaco

Numba 8.2k Jan 07, 2023
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

Nico Van den Hooff 17 Aug 21, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022