Python Automated Machine Learning library for tabular data.

Overview

Read the Docs Lines of code GitHub issues GitHub Repo stars GitHub contributors


Logo

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Science tasks.
πŸ“š Explore the docs Β»

🐞 Report Bug Β· πŸ†• Request Feature

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About the project

Disclaimer

This library is an open-source research project and is not part of any official SAP products.

What's this?

This is a simple but accurate Automated Machine Learning library. Based on SAP HANA powerful in-memory algorithms, it provides high accuracy in multiple machine learning tasks. Our library also uses numerous data preprocessing functions to automate routine data cleaning tasks. So, hana_automl goes through all AutoML steps and makes Data Science work easier.

What is SAP HANA?

From www.sap.com: SAP HANA is a high-performance in-memory database that speeds data-driven, real-time decisions and actions.

Web app

https://share.streamlit.io/dan0nchik/sap-hana-automl/main/web.py

Documentation

https://sap-hana-automl.readthedocs.io/en/latest/index.html

Benchmarks

https://github.com/dan0nchik/SAP-HANA-AutoML/blob/main/comparison_openml.ipynb

ML tasks:

  • Binary classification
  • Regression
  • Multiclass classification
  • Forecasting

Steps automated:

  • Data exploration
  • Data preparation
  • Feature engineering
  • Model selection
  • Model training
  • Hyperparameter tuning

πŸ‘‡ By the end of summer 2021, blue part will be fully automated by our library Logo

Clients

Streamlit client Streamlit client

Built With

Getting Started

To get a package up and running, follow these simple steps.

Prerequisites

Make sure you have the following:

  1. βœ… Setup SAP HANA (skip this step if you have an instance with PAL enabled). There are 2 ways to do that.
    In HANA Cloud:

    • Create a free trial account
    • Setup an instance
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.

    In Virtual Machine:

    • Rent a virtual machine in Azure, AWS, Google Cloud, etc.
    • Install HANA instance there or on your PC (if you have >32 Gb RAM).
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.
  2. βœ… Installed software

  • Python > 3.6
    Skip this step if python --version returns > 3.6
  • Cython
    pip3 install Cython

Installation

There are 2 ways to install the library

  • Stable: from pypi
    pip3 install hana_automl
  • Latest: from the repository
    pip3 install https://github.com/dan0nchik/SAP-HANA-AutoML/archive/dev.zip
    Note: latest version may contain bugs, be careful!

After installation

Check that PAL (Predictive Analysis Library) is installed and roles are granted

  • Read docs section about that.
  • If you don't want to read docs, run this code
    from hana_automl.utils.scripts import setup_user
    from hana_ml.dataframe import ConnectionContext
    
    cc = ConnectionContext(address='address', user='user', password='password', port=39015)
    
    # replace with credentials of user that will be created or granted a role to run PAL.
    setup_user(connection_context=cc, username='user', password="password")

Usage

From code

Our library in a few lines of code

Connect to database.

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext(address='address',
                     user='username',
                     password='password',
                     port=1234)

Create AutoML model and fit it.

from hana_automl.automl import AutoML

model = AutoML(cc)
model.fit(
  file_path='path to training dataset', # it may be HANA table/view, or pandas DataFrame
  steps=10, # number of iterations
  target='target', # column to predict
  time_limit=120 # time limit in seconds
)

Predict.

model.predict(
file_path='path to test dataset',
id_column='ID',
verbose=1
)

For more examples, please refer to the Documentation

How to run Streamlit client

  1. Clone repository: git clone https://github.com/dan0nchik/SAP-HANA-AutoML.git
  2. Install dependencies: pip3 install -r requirements.txt
  3. Run GUI: streamlit run ./web.py

Roadmap

See the open issues for a list of proposed features (and known issues). Feel free to report any bugs :)

Contributing

Any contributions you make are greatly appreciated πŸ‘ !

  1. Fork the Project

  2. Create your Feature Branch (git checkout -b feature/NewFeature)

  3. Install dependencies

    pip3 install Cython
    pip3 install -r requirements.txt
  4. Create credentials.py file in tests directory Your files should look like this:

    SAP-HANA-AutoML
    β”‚   README.md
    β”‚   all other files   
    β”‚   .....
    |
    └───tests
        β”‚   test files...
        β”‚   credentials.py
    

    Copy and paste this piece of code there and replace it with your credentials:

    host = "host"
    user = "username"
    password = "password"
    port = 39015 # or any port you need
    schema = "your schema"

    Don't worry, this file is in .gitignore, so your credentials won't be seen by anyone.

  5. Make some changes

  6. Write tests that cover your code in tests directory

  7. Run tests (under SAP-HANA-AutoML directory)

    pytest
  8. Commit your changes (git commit -m 'Add some amazing features')

  9. Push to the branch (git push origin feature/AmazingFeature)

  10. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.
Don't really understand license? Check out the MIT license summary.

Contact

Authors: @While-true-codeanything, @DbusAI, @dan0nchik

Project Link: https://github.com/dan0nchik/SAP-HANA-AutoML

Owner
Daniel Khromov
Learning Swift, C#, and Data Science
Daniel Khromov
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

Salesforce 2.8k Jan 05, 2023
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. Documenta

2.5k Jan 07, 2023
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 5.2k Jan 04, 2023
K-means clustering is a method used for clustering analysis, especially in data mining and statistics.

K Means Algorithm What is K Means This algorithm is an iterative algorithm that partitions the dataset according to their features into K number of pr

1 Nov 01, 2021
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
This repo implements a Topological SLAM: Deep Visual Odometry with Long Term Place Recognition (Loop Closure Detection)

This repo implements a topological SLAM system. Deep Visual Odometry (DF-VO) and Visual Place Recognition are combined to form the topological SLAM system.

Best of Australian Centre for Robotic Vision (ACRV) 32 Jun 23, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023
An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

Seldon Core: Blazing Fast, Industry-Ready ML An open source platform to deploy your machine learning models on Kubernetes at massive scale. Overview S

Seldon 3.5k Jan 01, 2023
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
Production Grade Machine Learning Service

This project is made to help you scale from a basic Machine Learning project for research purposes to a production grade Machine Learning web service

Abdullah Zaiter 10 Apr 04, 2022
Land Cover Classification Random Forest

You can perform Land Cover Classification on Satellite Images using Random Forest and visualize the result using Earthpy package. Make sure to install the required packages and such as

Dr. Sander Ali Khowaja 1 Jan 21, 2022
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical alg

RICOS Co. Ltd. 179 Dec 21, 2022
The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

MLOps The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it insid

Maykon Schots 25 Nov 27, 2022
dirty_cat is a Python module for machine-learning on dirty categorical variables.

dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.

637 Dec 29, 2022
My capstone project for Udacity's Machine Learning Nanodegree

MLND-Capstone My capstone project for Udacity's Machine Learning Nanodegree Lane Detection with Deep Learning In this project, I use a deep learning-b

Michael Virgo 407 Dec 12, 2022
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 09, 2022
(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

EungChang-Mason-Lee 203 Dec 26, 2022