Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

Overview

MLJAR Automated Machine Learning for Humans

Build Status Coverage Status PyPI version PyPI pyversions


Documentation: https://supervised.mljar.com/

Source Code: https://github.com/mljar/mljar-supervised


Table of Contents

Automated Machine Learning 🚀

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model 🏆 . It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).

The mljar-supervised will help you with:

  • explaining and understanding your data (Automatic Exploratory Data Analysis),
  • trying many different machine learning models (Algorithm Selection and Hyper-Parameters tuning),
  • creating Markdown reports from analysis with details about all models (Atomatic-Documentation),
  • saving, re-running and loading the analysis and ML models.

It has four built-in modes of work:

  • Explain mode, which is ideal for explaining and understanding the data, with many data explanations, like decision trees visualization, linear models coefficients display, permutation importances and SHAP explanations of data,
  • Perform for building ML pipelines to use in production,
  • Compete mode that trains highly-tuned ML models with ensembling and stacking, with a purpose to use in ML competitions.
  • Optuna mode that can be used to search for highly-tuned ML models, should be used when the performance is the most important, and computation time is not limited (it is available from version 0.10.0)

Of course, you can further customize the details of each mode to meet the requirements.

What's good in it? 💥

  • It is using many algorithms: Baseline, Linear, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Networks, and Nearest Neighbors.
  • It can compute Ensemble based on greedy algorithm from Caruana paper.
  • It can stack models to build level 2 ensemble (available in Compete mode or after setting stack_models parameter).
  • It can do features preprocessing, like: missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.
  • It can do advanced features engineering, like: Golden Features, Features Selection, Text and Time Transformations.
  • It can tune hyper-parameters with not-so-random-search algorithm (random-search over defined set of values) and hill climbing to fine-tune final models.
  • It can compute the Baseline for your data. That you will know if you need Machine Learning or not!
  • It has extensive explanations. This package is training simple Decision Trees with max_depth <= 5, so you can easily visualize them with amazing dtreeviz to better understand your data.
  • The mljar-supervised is using simple linear regression and include its coefficients in the summary report, so you can check which features are used the most in the linear model.
  • It cares about explainability of models: for every algorithm, the feature importance is computed based on permutation. Additionally, for every algorithm the SHAP explanations are computed: feature importance, dependence plots, and decision plots (explanations can be switched off with explain_level parameter).
  • There is automatic documnetation for every ML experiment run with AutoML. The mljar-supervised creates markdown reports from AutoML training full of ML details, metrics and charts.

Automatic Documentation

The AutoML Report

The report from running AutoML will contain the table with infomation about each model score and time needed to train the model. For each model there is a link, which you can click to see model's details. The performance of all ML models is presented as scatter and box plots so you can visually inspect which algorithms perform the best 🏆 .

AutoML leaderboard

The Decision Tree Report

The example for Decision Tree summary with trees visualization. For classification tasks additional metrics are provided:

  • confusion matrix
  • threshold (optimized in the case of binary classification task)
  • F1 score
  • Accuracy
  • Precision, Recall, MCC

Decision Tree summary

The LightGBM Report

The example for LightGBM summary:

Decision Tree summary

Available Modes 📚

In the docs you can find details about AutoML modes are presented in the table .

Explain

automl = AutoML(mode="Explain")

It is aimed to be used when the user wants to explain and understand the data.

  • It is using 75%/25% train/test split.
  • It is using: Baseline, Linear, Decision Tree, Random Forest, Xgboost, Neural Network algorithms and ensemble.
  • It has full explanations: learning curves, importance plots, and SHAP plots.

Perform

automl = AutoML(mode="Perform")

It should be used when the user wants to train a model that will be used in real-life use cases.

  • It is using 5-fold CV.
  • It is using: Linear, Random Forest, LightGBM, Xgboost, CatBoost and Neural Network. It uses ensembling.
  • It has learning curves and importance plots in reports.

Compete

automl = AutoML(mode="Compete")

It should be used for machine learning competitions.

  • It adapts the validation strategy depending on dataset size and total_time_limit. It can be: train/test split (80/20), 5-fold CV or 10-fold CV.
  • It is using: Linear, Decision Tree, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Network and Nearest Neighbors. It uses ensemble and stacking.
  • It has only learning curves in the reports.

Optuna

automl = AutoML(mode="Optuna", optuna_time_budget=3600)

It should be used when the performance is the most important and time is not limited.

  • It is using 10-fold CV
  • It is using: Random Forest, Extra Trees, LightGBM, Xgboost, and CatBoost. Those algorithms are tuned by Optuna framework for optuna_time_budget seconds, each. Algorithms are tuned with original data, without advanced feature engineering.
  • It is using advanced feature engineering, stacking and ensembling. The hyperparameters found for original data are reused with those steps.
  • It produces learning curves in the reports.

Examples

👉 Binary Classification Example

There is a simple interface available with fit and predict methods.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
    skipinitialspace=True,
)
X_train, X_test, y_train, y_test = train_test_split(
    df[df.columns[:-1]], df["income"], test_size=0.25
)

automl = AutoML()
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

AutoML fit will print:

Create directory AutoML_1
AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will optimize for metric: logloss
1_Baseline final logloss 0.5519845471086654 time 0.08 seconds
2_DecisionTree final logloss 0.3655910192804364 time 10.28 seconds
3_Linear final logloss 0.38139916864708445 time 3.19 seconds
4_Default_RandomForest final logloss 0.2975204390214936 time 79.19 seconds
5_Default_Xgboost final logloss 0.2731086827200411 time 5.17 seconds
6_Default_NeuralNetwork final logloss 0.319812276905242 time 21.19 seconds
Ensemble final logloss 0.2731086821194617 time 1.43 seconds
  • the AutoML results in Markdown report
  • the Xgboost Markdown report, please take a look at amazing dependence plots produced by SHAP package 💖
  • the Decision Tree Markdown report, please take a look at beautiful tree visualization
  • the Logistic Regression Markdown report, please take a look at coefficients table, and you can compare the SHAP plots between (Xgboost, Decision Tree and Logistic Regression)

👉 Multi-Class Classification Example

The example code for classification of the optical recognition of handwritten digits dataset. Running this code in less than 30 minutes will result in test accuracy ~98%.

import pandas as pd 
# scikit learn utilites
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# mljar-supervised package
from supervised.automl import AutoML

# load the data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(digits.data), digits.target, stratify=digits.target, test_size=0.25,
    random_state=123
)

# train models with AutoML
automl = AutoML(mode="Perform")
automl.fit(X_train, y_train)

# compute the accuracy on test data
predictions = automl.predict_all(X_test)
print(predictions.head())
print("Test accuracy:", accuracy_score(y_test, predictions["label"].astype(int)))

👉 Regression Example

Regression example on Boston house prices data. On test data it scores ~ 10.85 mean squared error (MSE).

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML # mljar-supervised

# Load the data
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(housing.data, columns=housing.feature_names),
    housing.target,
    test_size=0.25,
    random_state=123,
)

# train models with AutoML
automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

# compute the MSE on test data
predictions = automl.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, predictions))

👉 More Examples

Documentation 📚

For details please check mljar-supervised docs.

Installation 📦

From PyPi repository:

pip install mljar-supervised

From source code:

git clone https://github.com/mljar/mljar-supervised.git
cd mljar-supervised
python setup.py install

Installation for development

git clone https://github.com/mljar/mljar-supervised.git
virtualenv venv --python=python3.6
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements_dev.txt

Running in the docker:

FROM python:3.7-slim-buster
RUN apt-get update && apt-get -y update
RUN apt-get install -y build-essential python3-pip python3-dev
RUN pip3 -q install pip --upgrade
RUN pip3 install mljar-supervised jupyter
CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]

Contributing

To get started take a look at our Contribution Guide for information about our process and where you can fit in!

Contributors

License 👔

The mljar-supervised is provided with MIT license.

MLJAR ❤️

The mljar-supervised is an open-source project created by MLJAR. We care about ease of use in the Machine Learning. The mljar.com provides a beautiful and simple user interface for building machine learning models.

Comments
  • model structure differences between ensemble/stacked/ensemble_stacked

    model structure differences between ensemble/stacked/ensemble_stacked

    Good day! I am reading your manual now but can't tell the model structure differences between ensemble/stacked/ensemble_stacked...
    image

    Following pictures are json files from the example code and the questions are listed below, could you please help to answer them?

    1. The meaning of "repeat" here.
    2. How can I understand the model structure for these three pictures?

    ensemble.json image

    Optuna_extratrees_stacked/framework.json image

    Ensemble_stacked/ensemble.json image

    Best regards

    opened by Tonywhitemin 29
  • I have been facing this issue for 2 days. I have no Idea what's causing it.

    I have been facing this issue for 2 days. I have no Idea what's causing it.

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-2-ea67f5362246> in <module>
          3 from sklearn.model_selection import train_test_split
          4 from sklearn.metrics import mean_squared_error
    ----> 5 from supervised.automl import AutoML # mljar-supervised
    
    /opt/conda/lib/python3.7/site-packages/supervised/__init__.py in <module>
          1 __version__ = "0.7.15"
          2 
    ----> 3 from supervised.automl import AutoML
    
    /opt/conda/lib/python3.7/site-packages/supervised/automl.py in <module>
          1 import logging
          2 
    ----> 3 from supervised.base_automl import BaseAutoML
          4 
          5 from supervised.utils.config import LOG_LEVEL
    
    /opt/conda/lib/python3.7/site-packages/supervised/base_automl.py in <module>
         17 from sklearn.metrics import r2_score, accuracy_score
         18 
    ---> 19 from supervised.algorithms.registry import AlgorithmsRegistry
         20 from supervised.algorithms.registry import BINARY_CLASSIFICATION
         21 from supervised.algorithms.registry import MULTICLASS_CLASSIFICATION
    
    /opt/conda/lib/python3.7/site-packages/supervised/algorithms/registry.py in <module>
         62 # Import algorithm to be registered
         63 import supervised.algorithms.random_forest
    ---> 64 import supervised.algorithms.xgboost
         65 import supervised.algorithms.decision_tree
         66 import supervised.algorithms.baseline
    
    /opt/conda/lib/python3.7/site-packages/supervised/algorithms/xgboost.py in <module>
          4 import pandas as pd
          5 import os
    ----> 6 import xgboost as xgb
          7 
          8 from supervised.algorithms.algorithm import BaseAlgorithm
    
    /opt/conda/lib/python3.7/site-packages/xgboost/__init__.py in <module>
          7 import os
          8 
    ----> 9 from .core import DMatrix, DeviceQuantileDMatrix, Booster
         10 from .training import train, cv
         11 from . import rabit  # noqa
    
    /opt/conda/lib/python3.7/site-packages/xgboost/core.py in <module>
         17 import scipy.sparse
         18 
    ---> 19 from .compat import (
         20     STRING_TYPES, DataFrame, py_str,
         21     PANDAS_INSTALLED,
    
    /opt/conda/lib/python3.7/site-packages/xgboost/compat.py in <module>
        106 # cudf
        107 try:
    --> 108     from cudf import concat as CUDF_concat
        109 except ImportError:
        110     CUDF_concat = None
    
    /opt/conda/lib/python3.7/site-packages/cudf/__init__.py in <module>
          9 import rmm
         10 
    ---> 11 from cudf import core, datasets, testing
         12 from cudf._version import get_versions
         13 from cudf.core import (
    
    /opt/conda/lib/python3.7/site-packages/cudf/core/__init__.py in <module>
          1 # Copyright (c) 2018-2019, NVIDIA CORPORATION.
    ----> 2 from cudf.core import buffer, column
          3 from cudf.core.buffer import Buffer
          4 from cudf.core.dataframe import DataFrame, from_pandas, merge
          5 from cudf.core.index import (
    
    /opt/conda/lib/python3.7/site-packages/cudf/core/column/__init__.py in <module>
          1 # Copyright (c) 2020, NVIDIA CORPORATION.
          2 
    ----> 3 from cudf.core.column.categorical import CategoricalColumn
          4 from cudf.core.column.column import (
          5     ColumnBase,
    
    /opt/conda/lib/python3.7/site-packages/cudf/core/column/categorical.py in <module>
          6 
          7 import cudf
    ----> 8 from cudf import _lib as libcudf
          9 from cudf._lib.transform import bools_to_mask
         10 from cudf.core.buffer import Buffer
    
    /opt/conda/lib/python3.7/site-packages/cudf/_lib/__init__.py in <module>
          2 import numpy as np
          3 
    ----> 4 from . import (
          5     avro,
          6     binaryop,
    
    cudf/_lib/gpuarrow.pyx in init cudf._lib.gpuarrow()
    
    AttributeError: module 'pyarrow.lib' has no attribute 'IpcWriteOptions'
    
    installation 
    opened by kingabzpro 27
  • Custom eval metric

    Custom eval metric

    Hello,

    I was wondering if it's possible to add fully custom eval metrics.

    My specific use case is one where I would like to add up values of an arbitrary vector for all predictions that exceed a (percentile) threshold. In general, however, it would be great to have the option to decouple the eval metric used for fitting from the one used for evaluation/tuning.

    Ideally, the user would be able to supply a function such as those in sklearn.metrics, accepting target values and predictions and returning a float. Whether to assume minimisation or maximisation (or have an extra parameter) isn't particularly important, imho.

    Thoughts?

    enhancement 
    opened by ecod3r 26
  • Can not load saved model

    Can not load saved model

    When I reloaded my model to do prediction, I got the following error:


    KeyError Traceback (most recent call last) ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in load(self, path) 184 ): --> 185 ens = Ensemble.load(path, model_subpath, models_map) 186 self._models += [ens]

    ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/ensemble.py in load(results_path, model_subpath, models_map) 436 ensemble.selected_models += [ --> 437 {"model": models_map[m["model"]], "repeat": m["repeat"]} 438 ]

    KeyError: '15_LightGBM'

    During handling of the above exception, another exception occurred:

    AutoMLException Traceback (most recent call last) in ----> 1 automl.predict(X_test)

    ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/automl.py in predict(self, X) 346 AutoMLException: Model has not yet been fitted. 347 """ --> 348 return self._predict(X) 349 350 def predict_proba(self, X):

    ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in _predict(self, X) 1298 def _predict(self, X): 1299 -> 1300 predictions = self._base_predict(X) 1301 # Return predictions 1302 # If classification task the result is in column 'label'

    ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in _base_predict(self, X, model) 1230 if model is None: 1231 if self._best_model is None: -> 1232 self.load(self.results_path) 1233 model = self._best_model 1234

    ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in load(self, path) 211 212 except Exception as e: --> 213 raise AutoMLException(f"Cannot load AutoML directory. {str(e)}") 214 215 def get_leaderboard(

    AutoMLException: Cannot load AutoML directory. '15_LightGBM'

    I refit it, it said This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()' I used the same method to train 5 models, the other 3 models are okay, two had this error.

    I used pip install -q -U git+https://github.com/mljar/[email protected] to reinstall your package. I think there are bugs when you updated LightGBM.

    bug 
    opened by xuzhang5788 21
  • Support for r2 metric in Optuna mode

    Support for r2 metric in Optuna mode

    Currently, r2 metric evaluation is not supported in the tuner/optuna/tuner.py file.

    if eval_metric.name not in ["auc", "logloss", "rmse", "mae", "mape"]: raise AutoMLException(f"Metric {eval_metric.name} is not supported")

    When I manually add 'r2' to the list, I encounter the following error.

    Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1054, in _fit trained = self.train_model(params) File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 356, in train_model mf.train(results_path, model_subpath) File "/usr/local/lib/python3.8/dist-packages/supervised/model_framework.py", line 185, in train self.learner_params = optuna_tuner.optimize( File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/tuner.py", line 106, in optimize objective = LightgbmObjective( File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/lightgbm.py", line 61, in __init__ self.eval_metric_name = metric_name_mapping[ml_task][self.eval_metric.name] KeyError: 'r2'

    Is this a known limitation, and if so, is there a way to work around it?

    enhancement help wanted 
    opened by Possums 21
  • unable to load models

    unable to load models

    Hello, i train some models and give the folder to save the models. but when i try to load the model by below command it's give me error

    automl = AutoML(
      mode="Compete",
      model_time_limit=(15)*60,
      n_jobs=-1,
      results_path="/media/autosk4/",
      explain_level=0,  
      algorithms=["LightGBM","CatBoost"],
      start_random_models=2
    )
    
    _`2021-04-17 09:17:50,775 supervised.exceptions ERROR Cannot load AutoML directory. '1_Default_LightGBM_GoldenFeatures'
    
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    ~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in load(self, path)
        185                 ):
    --> 186                     ens = Ensemble.load(path, model_subpath, models_map)
        187                     self._models += [ens]
    
    ~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/ensemble.py in load(results_path, model_subpath, models_map)
        436             ensemble.selected_models += [
    --> 437                 {"model": models_map[m["model"]], "repeat": m["repeat"]}
        438             ]
    
    KeyError: '1_Default_LightGBM_GoldenFeatures'
    
    During handling of the above exception, another exception occurred:
    
    AutoMLException                           Traceback (most recent call last)
    <ipython-input-6-437ae6b31a0f> in <module>
          6                 algorithms=["LightGBM","CatBoost"],start_random_models=2)
          7 
    ----> 8 predictions = automl.predict(X_test)
          9 
         10 predictions[X_test['momkene_out']!=2]=0
    
    ~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/automl.py in predict(self, X)
        344             AutoMLException: Model has not yet been fitted.
        345         """
    --> 346         return self._predict(X)
        347 
        348     def predict_proba(self, X):
    
    ~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in _predict(self, X)
       1298     def _predict(self, X):
       1299 
    -> 1300         predictions = self._base_predict(X)
       1301         # Return predictions
       1302         # If classification task the result is in column 'label'
    
    ~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in _base_predict(self, X, model)
       1230         if model is None:
       1231             if self._best_model is None:
    -> 1232                 self.load(self.results_path)
       1233             model = self._best_model
       1234 
    
    ~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in load(self, path)
        212 
        213         except Exception as e:
    --> 214             raise AutoMLException(f"Cannot load AutoML directory. {str(e)}")
        215 
        216     def get_leaderboard(
    
    AutoMLException: Cannot load AutoML directory. '1_Default_LightGBM_GoldenFeatures'
    

    and these are files in 1_default_light_... folder

    framework.json		     learner_fold_2_training.log
    learner_fold_0.lightgbm      learning_curves.png
    learner_fold_0_training.log  predictions_out_of_folds.csv
    learner_fold_1.lightgbm      README.html
    learner_fold_1_training.log  README.md
    learner_fold_2.lightgbm      status.txt
    
    bug 
    opened by nasergh 18
  • Custom CV strategy

    Custom CV strategy

    I have trained an automl model, where the ensemble seems to work well with the test set. Thus, I want to try my own CV scheme in a 'leave one year out' way (removing a year, training on other years, and testing on selected year).

    For this, I need to be able to re-train the ensemble again like an scikit-learn pipeline. How can I retain the ensamble itself? The '.fit' function does not seem to work like in sklearn estimator convention (getting a numpy array as input).

    enhancement 
    opened by drorhilman 16
  • Saving mljar automl model for future use

    Saving mljar automl model for future use

    Hi, traditionally I had been using pickle package to save models in pkl file and re-use them continuously on live data. I see mljar model has to json and from json methods. Could you please create small poc or example with documentation as how could we re-use it for daily / live data? Thanks. :)

    bug enhancement 
    opened by vivek2319 14
  • Error in kaggle notebook kernel

    Error in kaggle notebook kernel

    There is problem with pyarrow verion in kaggle notebook. There is an error:

     [Errno 2] No such file or directory: 'AutoML_1/y.parquet'
    

    See kaggle comment for details: https://www.kaggle.com/mt77pp/mljar-autoeda-automl-prediction/comments#1197484

    bug 
    opened by pplonski 13
  • Ensemble model only using 2 models to ensemble

    Ensemble model only using 2 models to ensemble

    When I look at the readme.md file of the Ensemble folder, it shows only 2 models out of so many others that it used to ensemble. Is there a reason for this? Also, when I look at the Ensemble_stacked, it shows just 1, "Ensemble" model as the one used for stack_ensemble.

    opened by alitirmizi23 11
  • Prediction time is taking longer than expected

    Prediction time is taking longer than expected

    Hi! I've banged my head against the wall for a couple of days and can't solve this. Prediction times are taking much longer than is to be expected. Running an AutoML model for a regression task takes upwards of 3 seconds for a single prediction.

    I believe this is because the predict method for the AutoML class loads every model that was saved every time you ask for a prediction. It would be much more optimal to load all models on call init, and call their prediction methods without having to load them every time.

    enhancement 
    opened by salomonMuriel 11
  • ERROR:supervised.exceptions:No models produced.

    ERROR:supervised.exceptions:No models produced.

    I have done the installation in various ways at Google Cop, but all of them show the following results. help me plaese..

    ERROR:supervised.exceptions:No models produced. Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new. AutoML directory: AutoML_3 The task is multiclass_classification with evaluation metric logloss AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network'] AutoML will ensemble available models AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble'] 'Baseline' Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 195, in generate_params return self.simple_algorithms_params(models_cnt) File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 382, in simple_algorithms_params params = self._get_model_params(model_type, seed=i + 1) File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 954, in _get_model_params model_info = AlgorithmsRegistry.registry[self._ml_task][model_type] KeyError: 'Baseline'

    Skip simple_algorithms because no parameters were generated. 'Xgboost' Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 197, in generate_params return self.default_params(models_cnt) File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 431, in default_params if self.skip_if_rows_cols_limit(model_type): File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 400, in skip_if_rows_cols_limit max_rows_limit = AlgorithmsRegistry.get_max_rows_limit( File "/usr/local/lib/python3.8/dist-packages/supervised/algorithms/registry.py", line 51, in get_max_rows_limit return AlgorithmsRegistry.registry[ml_task]algorithm_name]["additional"][ KeyError: 'Xgboost'

    Skip default_algorithms because no parameters were generated.

    AutoMLException Traceback (most recent call last) in ----> 1 automl.fit(X, y)

    2 frames /usr/local/lib/python3.8/dist-packages/supervised/base_automl.py in _fit(self, X, y, sample_weight, cv) 1049 if "hill_climbing" in step or step in ["ensemble", "stack"]: 1050 if len(self._models) == 0: -> 1051 raise AutoMLException( 1052 "No models produced. \nPlease check your data or" 1053 " submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new."

    AutoMLException: No models produced. Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new.

    opened by moonjoo98 5
  • standardize the project using `pipenv`

    standardize the project using `pipenv`

    At present, setting up the development environment is difficult and inefficient.

    • Various packages are not compatible with the latest version of python(3.11.x)
      • pip install numba does not work with python 3.11.x (screenshot attached below)
    • pytest version is also outdated, which no longer works. Updating pytest to latest version solves the problem (5.3.5 to 7.2.0)

    We can use package manager like pipenv to track all the dependencies used in the project.

    References

    pipenv documentation - https://pipenv.pypa.io/en/latest/

    I'm using the latest version of python viz 3.11.0

    Screenshot of pip install numba

    image

    opened by nkilm 0
  • No matching distribution found for catboost>=0.24.4

    No matching distribution found for catboost>=0.24.4

    Steps to reproduce this error

    • create virtualenv
    • install requirements using pip install -r requirements.txt

    Following message will be displayed. image

    Had to install catboost separately using pip install catboost.

    opened by nkilm 0
  • change Arial font in base_automl.py to avoid issues in linux

    change Arial font in base_automl.py to avoid issues in linux

    in base_automl.py there is the following configuration : font-family: Arial

    it causes ubunto docker containers running mljar to write the following message extensively:

    WARNING | matplotlib.font_manager - findfont: Font family 'Arial' not found.

    can this font be changes to some non-office font so this message will disappear ?

    enhancement 
    opened by yairVanti 6
  • No Shap outputs

    No Shap outputs

    Hi, I'm not seeing any shap outputs when using the following:

    # Initialize AutoML in Explain Mode
    automl = AutoML(mode="Explain", 
                    explain_level=2,
                   ml_task='multiclass_classification')
    automl.fit(X, y)
    

    This in spte of shap being properly installed. What I get out of the above code is the following:

    AutoML directory: AutoML_7
    The task is multiclass_classification with evaluation metric logloss
    AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Neural Network']
    AutoML will ensemble available models
    AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
    * Step simple_algorithms will try to check up to 3 models
    1_Baseline logloss 3.229533 trained in 25.56 seconds
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    2_DecisionTree logloss 2.15877 trained in 59.34 seconds
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    lbfgs failed to converge (status=1):
    STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
    
    Increase the number of iterations (max_iter) or scale the data as shown in:
        https://scikit-learn.org/stable/modules/preprocessing.html
    Please also refer to the documentation for alternative solver options:
        https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
    3_Linear logloss 1.707406 trained in 47.68 seconds
    * Step default_algorithms will try to check up to 2 models
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    4_Default_NeuralNetwork logloss 4.045366 trained in 7.02 seconds
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    5_Default_RandomForest logloss 1.858415 trained in 75.39 seconds
    * Step ensemble will try to check up to 1 model
    Ensemble logloss 1.288517 trained in 0.56 seconds
    AutoML fit time: 226.47 seconds
    AutoML best model: Ensemble
    AutoML(explain_level=2, ml_task='multiclass_classification')
    
    bug help wanted 
    opened by dbrami 3
  • ensemble.json not found error when training in Compete mode with total_time_limit

    ensemble.json not found error when training in Compete mode with total_time_limit

    After training Compete mode, I'm getting this error when trying to load the model

    automl = AutoML(mode='Compete', results_path=model_path, total_time_limit=24*3600, eval_metric=sign_penalty)
    automl_trained = AutoML(results_path=model_path)
    automl_predictions = automl_trained.predict(X_test)
    
    FileNotFoundError                         Traceback (most recent call last)
    File c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py:199, in BaseAutoML.load(self, path)
        196 if model_subpath.endswith("Ensemble") or model_subpath.endswith(
        197     "Ensemble_Stacked"
        198 ):
    --> 199     ens = Ensemble.load(path, model_subpath, models_map)
        200     self._models += [ens]
    
    File c:\ProgramData\Anaconda3\lib\site-packages\supervised\ensemble.py:435, in Ensemble.load(results_path, model_subpath, models_map)
        433 logger.info(f"Loading ensemble from {model_path}")
    --> 435 json_desc = json.load(open(os.path.join(model_path, "ensemble.json")))
        437 ensemble = Ensemble(json_desc.get("optimize_metric"), json_desc.get("ml_task"))
    
    FileNotFoundError: [Errno 2] No such file or directory: 'trained_models/Compete_%_change_close_BTCUSDT_spot_15m_custom_loss+2h\\Ensemble\\ensemble.json'
    
    During handling of the above exception, another exception occurred:
    
    AutoMLException                           Traceback (most recent call last)
    c:\dev\Python\Mastermind\mastermind\training\LAB_MLJAR_custom_loss.ipynb Cell 15 in <cell line: 2>()
          [1](vscode-notebook-cell:/c%3A/dev/Python/Mastermind/mastermind/training/LAB_MLJAR_custom_loss.ipynb#X20sZmlsZQ%3D%3D?line=0) automl_trained = AutoML(results_path=model_path)
    ----> [2](vscode-notebook-cell:/c%3A/dev/Python/Mastermind/mastermind/training/LAB_MLJAR_custom_loss.ipynb#X20sZmlsZQ%3D%3D?line=1) automl_predictions = automl_trained.predict(X_test)
          [3](vscode-notebook-cell:/c%3A/dev/Python/Mastermind/mastermind/training/LAB_MLJAR_custom_loss.ipynb#X20sZmlsZQ%3D%3D?line=2) pd.Series(automl_predictions).describe()
    
    File c:\ProgramData\Anaconda3\lib\site-packages\supervised\automl.py:387, in AutoML.predict(self, X)
    ...
        223         self.n_classes = self._data_info["n_classes"]
        225 except Exception as e:
    --> 226     raise AutoMLException(f"Cannot load AutoML directory. {str(e)}")
    
    AutoMLException: Cannot load AutoML directory. [Errno 2] No such file or directory: 'trained_models/Compete_%_change_close_BTCUSDT_spot_15m_custom_loss+2h\\Ensemble\\ensemble.json'
    

    The errors.md file:

    ## Error for Ensemble
    
    The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    Traceback (most recent call last):
      File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 1083, in _fit
        trained = self.ensemble_step(
      File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 401, in ensemble_step
        self.ensemble.fit(oofs, target, sample_weight)
      File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\ensemble.py", line 237, in fit
        if self.metric.improvement(previous=min_score, current=score):
    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    
    
    Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new
    
    ## Error for Ensemble_Stacked
    
    The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    Traceback (most recent call last):
      File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 1083, in _fit
        trained = self.ensemble_step(
      File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 401, in ensemble_step
        self.ensemble.fit(oofs, target, sample_weight)
      File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\ensemble.py", line 237, in fit
        if self.metric.improvement(previous=min_score, current=score):
    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    
    
    Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new
    
    
    opened by Karlheinzniebuhr 4
Releases(v0.11.5)
  • v0.11.5(Dec 30, 2022)

    Bug fixes and updates

    • #595 replace boston example dataset with California housing dataset, replace mse metric with squared_error for tree based algorithms from sklearn
    • #596 change the import method for dtreeviz package
    Source code(tar.gz)
    Source code(zip)
  • v0.11.4(Dec 14, 2022)

  • v0.11.3(Aug 16, 2022)

  • v0.11.2(Mar 2, 2022)

    Enhancements

    • #523 Add type hints to AutoML class, thank you @DanielR59
    • #519 save train&validation index to file in train/test split, thanks @filipsPL @MaciekEO

    Bug fixes

    • #496 fix exception in baseline mode, thanks @DanielR59 @moshe-rl
    • #522 fixed requirements issue, thanks @DanielR59 @MaciekEO
    • #514 remove warning, thanks @MaciekEO
    • #511 disable EDA, thanks @MaciekEO
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Sep 6, 2021)

    Bug fixes

    • #463 change multiprocessing to Parallel with loky
    • #462 handle large data for tree visualization in regression
    • #419 remove/hide warnings
    • #411 loose dependencies for numpy and scipy
    Source code(tar.gz)
    Source code(zip)
  • 0.10.4(Jun 8, 2021)

    Enhancements

    • #81 add scatter plot predicted vs target in regression
    • #158 add ROC curve for binary classification
    • #336 add visualization for Optuna results
    • #352 add support for Colab
    • #374 update seaborn
    • #378 set golden features number
    • #379 switch off boost_on_errors step in Optuna mode
    • #380 add custom cross validation strategy
    • #386 add correlation heatmap
    • #387 add residual plot
    • #389 add feature importance heatmap
    • #390 add custom eval metric
    • #393 update sklearn

    Bug fixes

    • #308 fix error in kaggle kernel
    • #353, #355, #366, #368, #376, #382, #383, #384 fixes

    Docs

    • #391 add info about hyperparameters optimization methods

    Big thank you for help for: @ecoskian, @xuzhang5788, @xiaobo, @RafaD5, @drorhilman, @strelzoff-erdc, @muxuezi, @tresoldi THANK YOU !!!

    Source code(tar.gz)
    Source code(zip)
  • 0.10.3(Apr 1, 2021)

    Enhancements

    • #343 set seed in Optuna
    • #344 set eval_metric directly in all algorithms
    • #350 add estimated train time in Optuna mode
    • #342 add optuna_verbose param in AutoML()
    • #354 add KNN in Optuna
    • #356 and Neural Network in Optuna
    • #357, #348 use mljar wrapper for Random Forest and Extra Trees
    • #358 add extra_tree param in LightGBM
    • #359 switch off feature engineering in Optuna mode - only highly tuned models are produced
    • #361 list all eval_metric in error message
    • #362 add accuracy eval_metric
    • #340 support for r2

    Bug fixes

    • #347 dont include Optuna tuning time in total_time_limit
    • #360 missing auc scores for training in CatBoost
    Source code(tar.gz)
    Source code(zip)
  • 0.10.2(Mar 17, 2021)

  • 0.10.1(Mar 16, 2021)

    Enhancements

    • #332 We added Optuna framework for hyperparameters tuning. It can be used by setting mode="Optuna" in AutoML. You can read more details at blog post: https://mljar.com/blog/automl-optuna/
    Source code(tar.gz)
    Source code(zip)
  • 0.9.1(Mar 2, 2021)

    Enhancements

    • #179 add need_retrain() method to detect performance decrease
    • #226 extract rules from decision tree
    • #310 add support for MAPE
    • #312 optimize prediction time
    • #313 set stacking time threshold depending on best model train time
    • #320 search for model with prediction time constraint
    • #322 n_jobs as a parameter
    • #328 disable stacking for small (nrows < 500) datasets

    Bug fixes

    • #214 move directory after training
    • #246 raise exception when small time limit and no models are trained
    • #247 proper display for optimize AUC and R2
    • #306 add mix_encoding argument in AutoML constructor
    • #308 fix dependencies error in kaggle notebook
    • #314 bug fix in hill climbing in Perform mode
    • #323 fix catboost bug with tree limit
    • #324 #325 bug for feature importance for small data
    Source code(tar.gz)
    Source code(zip)
  • 0.8.8(Feb 3, 2021)

  • 0.8.4(Jan 29, 2021)

  • 0.8.0(Jan 22, 2021)

    Enhancements

    • #300 Add step with k-means additional features
    • #299 Add Boost On Errors step
    • #154 Sample weight available
    • #229 Sort leaderboard (disabled for now for debug purposes)

    Bug fixes

    • #301 Fix storing unique keys in mljar tuner only for trained models
    • #275 #248 small fixes
    Source code(tar.gz)
    Source code(zip)
  • 0.7.19(Jan 12, 2021)

  • 0.7.18(Jan 11, 2021)

  • 0.7.17(Jan 11, 2021)

  • 0.7.16(Jan 10, 2021)

    Bug fixes

    • #283 Don use Random Feature model

    Enhancements

    • #284 Check time for features selection
    • #286 Add R2 score
    • #288 Improve algorithms order in not_so_random step
    Source code(tar.gz)
    Source code(zip)
  • 0.7.15(Dec 17, 2020)

  • 0.7.13(Dec 11, 2020)

  • 0.7.12(Dec 8, 2020)

    Enhancements

    • #223 Support for repeated validation
    • #266 Adjust validation for small datasets

    Bug fixes

    • #265 fix validation warning
    • #264 fix EDA tests
    • #261 better error message for missing golden features

    Dependencies

    • #260 update fastparquet to 0.4.1
    Source code(tar.gz)
    Source code(zip)
  • 0.7.11(Dec 3, 2020)

  • 0.7.10(Dec 1, 2020)

    Enhancements

    • #250 New strategies for categorical encoding
    • #257 Control algorithm order in not-so-random step

    Bug fixes

    • #255 Fix overwrite in adjusted models
    Source code(tar.gz)
    Source code(zip)
  • 0.7.9(Nov 30, 2020)

  • 0.7.8(Nov 27, 2020)

    Enhancements

    • #249 Adjust validation type based on data
    • #251 add more eval_metrics in regression
    • #252 add traceback to error reports

    Bug fixes

    • #253 Fix error when text data has missing values in test fold
    Source code(tar.gz)
    Source code(zip)
  • 0.7.7(Nov 26, 2020)

    Enhancements

    • #73 Optimize AUC

    Bug fixes

    • #136 RMSE in Extra Trees and Random Forest
    • #243 Switch off Xgboost and CatBoost for multiclass with many classes (in extreme switch of Extra Trees and Random Forest)
    • #245 Fix ordering of prediction columns
    Source code(tar.gz)
    Source code(zip)
  • 0.7.6(Nov 24, 2020)

    Enhancements

    • #240 Change algorithm execution order for default algorithms

    Bug fixes:

    • #236 Wrong labels for target predictions in the case of -1, 1 target
    • #238 Object of type float32 is not JSON serializable
    • #239 Value Error: Input contains NaN in numpy training array
    Source code(tar.gz)
    Source code(zip)
  • 0.7.5(Nov 23, 2020)

  • 0.7.4(Nov 23, 2020)

    Enhancements

    • #184 Change Keras+TF Neural Networks to scikit-learn MLP
    • #233 Limit staking number of classes and models
    • #232 Remove Linear model from Compete mode
    • #208 Improve importance computation for large number of columns
    • #205 Remove small learning rates for Xgboost

    Bug fixes:

    • #231 Restricted characters in feature_neams in Xgboost
    • #227 Fix strings in golden_features.json - thank you @SuryaThiru!
    • #215 Assure at least 20 samples (or k_folds) for each class

    Docs update:

    • #213 Update docs in AutoML - thank you @shahules786!
    Source code(tar.gz)
    Source code(zip)
  • 0.7.3(Sep 21, 2020)

    New features :sparkles:

    • #176 extended EDA - thanks to @shahules786

    Bug fixes :bug:

    • #201 error in golden features sampling
    • #199 bug for float multi-class labels
    • #196 add exception for empty data
    • #195 set threshold for accuracy metric instead f1
    • #194 ensemble should be best model if has more than 1 model
    • #193 fixed predict aflter model loading
    • #192 update pyarrow
    • #191 hide shap warnings
    • #190 fix in preprocessing
    • #188 fix type in feature selection - thanks to @uditswaroopa
    Source code(tar.gz)
    Source code(zip)
  • 0.7.2(Sep 15, 2020)

    Bug fixes :bug:

    • #187 fix wrong order in golden features step
    • #186 fix _get_results_path
    • #185 fix models loading
    • #184 exception when drop all features during selection
    • #182 catch exceptions from model and log to errors.md
    • #181 remove forbidden characters in EDA
    • #177 change docstring to google-stype
    • #175 remove tuning_mode parameter from AutoML
    Source code(tar.gz)
    Source code(zip)
Owner
MLJAR
Machine Learning Made Simple
MLJAR
A logistic regression model for health insurance purchasing prediction

Logistic_Regression_Model A logistic regression model for health insurance purchasing prediction This code is using these packages, so please make sur

ShawnWang 1 Nov 29, 2021
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023
A library to generate synthetic time series data by easy-to-use factors and generator

timeseries-generator This repository consists of a python packages that generates synthetic time series dataset in a generic way (under /timeseries_ge

Nike Inc. 87 Dec 20, 2022
LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 06, 2022
Real-time stream processing for python

Streamz Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelin

Python Streamz 1.1k Dec 28, 2022
Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

Feature-Engineering Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared. When the dataset

kemalgunay 5 Apr 21, 2022
Python based GBDT implementation

Py-boost: a research tool for exploring GBDTs Modern gradient boosting toolkits are very complex and are written in low-level programming languages. A

Sberbank AI Lab 20 Sep 21, 2022
Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining

**Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining.** S

Sebastian Raschka 4k Dec 30, 2022
slim-python is a package to learn customized scoring systems for decision-making problems.

slim-python is a package to learn customized scoring systems for decision-making problems. These are simple decision aids that let users make yes-no p

Berk Ustun 37 Nov 02, 2022
Simplify stop motion animation with machine learning.

Simplify stop motion animation with machine learning.

Nick Bild 25 Sep 15, 2022
Crunchdao - Python API for the Crunchdao machine learning tournament

Python API for the Crunchdao machine learning tournament Interact with the Crunc

3 Jan 19, 2022
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

7 Nov 18, 2021
Getting Profit and Loss Make Easy From Binance

Getting Profit and Loss Make Easy From Binance I have been in Binance Automated Trading for some time and have generated a lot of transaction records,

17 Dec 21, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 03, 2023
Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

Anirudh Edpuganti 3 Apr 03, 2022
ML Optimizers from scratch using JAX

Toy implementations of some popular ML optimizers using Python/JAX

Shreyansh Singh 38 Jul 29, 2022
A naive Bayes model for cancer classification using a set of documents

Naivebayes text classifcation model for cancer and noncancer documents Author: Alex King Purpose Requirements/files included How to use 1. Purpose The

Alex W King 1 Nov 24, 2021
Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

DIAL | Notre Dame 220 Dec 13, 2022