scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

Last update: Dec 20, 2022

Overview

Sklearn-genetic-opt

scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

This is meant to be an alternative to popular methods inside scikit-learn such as Grid Search and Randomized Grid Search for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.

Documentation is available here

Main Features:

GASearchCV: Main class of the package for hyperparameters tuning, holds the evolutionary cross-validation optimization routine.
GAFeatureSelectionCV: Main class of the package for feature selection.
Algorithms: Set of different evolutionary algorithms to use as an optimization procedure.
Callbacks: Custom evaluation strategies to generate early stopping rules, logging (into TensorBoard, .pkl files, etc) or your custom logic.
Plots: Generate pre-defined plots to understand the optimization process.
MLflow: Build-in integration with mlflow to log all the hyperparameters, cv-scores and the fitted models.

Demos on Features:

Visualize the progress of your training:

Real-time metrics visualization and comparison across runs:

Sampled distribution of hyperparameters:

Artifacts logging:

Usage:

Install sklearn-genetic-opt

It's advised to install sklearn-genetic using a virtual env, inside the env use:

pip install sklearn-genetic-opt

If you want to get all the features, including plotting and mlflow logging capabilities, install all the extra packages:

pip install sklearn-genetic-opt[all]

The only optional dependency that the last command does not install, it's Tensorflow, it is usually advised to look further which distribution works better for you.

Example: Hyperparameters Tuning

from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

data = load_digits()
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = RandomForestClassifier()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=10,
                               generations=35,
                               param_grid=param_grid,
                               n_jobs=-1,
                               verbose=True,
                               keep_top_k=4)

# Train and optimize the estimator
evolved_estimator.fit(X_train, y_train)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)

Example: Feature Selection

import matplotlib.pyplot as plt
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np

data = load_iris()
X, y = data["data"], data["target"]

# Add random non-important features
noise = np.random.uniform(0, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

clf = SVC(gamma='auto')

evolved_estimator = GAFeatureSelectionCV(
    estimator=clf,
    scoring="accuracy",
    population_size=30,
    generations=20,
    n_jobs=-1)

# Train and select the features
evolved_estimator.fit(X_train, y_train)

# Features selected by the algorithm
features = evolved_estimator.best_features_
print(features)

# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
print(accuracy_score(y_test, y_predict_ga))

Changelog

See the changelog for notes on the changes of Sklearn-genetic-opt

Important links

Official source code repo: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/
Download releases: https://pypi.org/project/sklearn-genetic-opt/
Issue tracker: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/issues
Stable documentation: https://sklearn-genetic-opt.readthedocs.io/en/stable/

Source code

You can check the latest development version with the command:

git clone https://github.com/rodrigo-arenas/Sklearn-genetic-opt.git

Install the development dependencies:

pip install -r dev-requirements.txt

Check the latest in-development documentation: https://sklearn-genetic-opt.readthedocs.io/en/latest/

Contributing

Contributions are more than welcome! There are several opportunities on the ongoing project, so please get in touch if you would like to help out. Make sure to check the current issues and also the Contribution guide.

Big thanks to the people who are helping with this project!

Testing

After installation, you can launch the test suite from outside the source directory:

pytest sklearn_genetic

Comments

[Feature] Parallel Coordinates plot
Is your feature request related to a problem? Please describe. NA

Describe the solution you'd like Implement in the sklearn_genetic.plots module a function named plot_parallel_coordinates to inspect the results of the learning process

Describe alternatives you've considered The function should take two arguments:

estimator: A fitted estimator from sklearn_genetic.GASearchCV

features: list, default=None. Subset of features to plot, if None it plots all the features by default

The function should return an object to plot parallel coordinates according the pandas.plotting.parallel_coordinates function

The data to plot is available on the estimator.logbook object, look the implementation of the plot_search_space function to see how to convert this data to a pandas data frame

The function must select only the non categorical variables, this can be done by inspecting the estimator.space object and comparing against the data types defined in sklearn_genetic.space, i.e Categorical, Continuous and Integer and color against the "score" column. In the same way, it must validate and make a warning if in the features parameter a Categorial one is passed

Additional context Links of some implementations:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.parallel_coordinates.html

https://www.mathworks.com/help/stats/feature-selection-and-feature-transformation.html#buwh6hc-1

help wanted good first issue new feature up-for-grabs
opened by rodrigo-arenas 14
[FEATURE] Report multiple scoring metrics
Hello,

I have been looking into your package and it is really cool. Thank you for putting a lot of effort in developing such an amazing tool.

Is your feature request related to a problem? Please describe. GASearchCV, unlike GridSearchCV, only accepts one scoring metric. Obviously, the algorithm can only use one metric to decide which models will carry over to the next generation. However, I think it would be useful to view different scoring metrics for the best models (e.g. R², MAE, RMSE), which intrinsically may provide a slightly different idea of model performance to the user. Of course we would still be able to decide which metric should be used to select the best models within each generation.

Describe the solution you'd expect I think the implementation of multiple scoring metrics in GASearchCV could be similar to the one implemented in GridSearchCV regarding this specific matter. I show below some examples of this implementation in GridSearchCV:

import numpy as np from sklearn.svm import SVR from sklearn.model_selection import GridSearchCV #generate input X = np.random.normal(75, 10, (1000, 2)) y = np.random.normal(200, 20, 1000) params = {"degree": [2, 3], "C": [10, 20, 50]} #calculate both R2 and MAE for each tested model, but model refit is performed based on the combination of hyperparameters with the best R2 grid = GridSearchCV(SVR(), param_grid=params, scoring=["neg_mean_absolute_error", "r2"], refit="r2") #another way of doing the above, but this time using aliases for the scorers grid = GridSearchCV(SVR(), param_grid=params, scoring={"MAE": "neg_mean_absolute_error", "R2": "r2"], refit="R2") #perform grid search grid.fit(X, y)

If you call grid.cv_results_ in this example, you will see the output dict will have a mean_test_MAE and mean_test_R2 keys (in the case of the second example).
enhancement help wanted new feature up-for-grabs
opened by poroc300 11

MLPClassifier - ValueError: shuffle must be either True or False, got True.

System information Windows 10 Sklearn-genetic-opt version: 0.6.1 Scikit-learn version: 0.24.2 Python version: Python 3.7

Describe the bug When using the GASearchCV class with MLPClassifier as the estimator, I get the error in the title. In my param_grid, I simply have it set to Categorical([True, False]), but it doesn't seem to play well. Wondering what could be causing it?

To Reproduce Could recreate it by creating a binary classification dataset from sklearn, then implementing this:

    curr_params = {"shuffle": Categorical([True, False])}

    evolved_estimator = GASearchCV(estimator=MLPClassifier(),
                                   cv=StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
                                   scoring='balanced_accuracy',
                                   population_size=30,
                                   generations=30,
                                   tournament_size=3,
                                   elitism=True,
                                   crossover_probability=0.8,
                                   mutation_probability=0.1,
                                   param_grid=curr_params,
                                   criteria='max',
                                   algorithm='eaMuPlusLambda',
                                   n_jobs=1,
                                   verbose=True,
                                   keep_top_k=1)

Expected behavior Seems to only be an issue with MLPClassifier so far, but should set the parameter shuffle to True or False.

Screenshots

Additional context

bug

opened by windowshopr 9

Wrong output for GAFeatureSelectionCV only when using max_features for RandomForestClassifier

System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10.0.19044 Sklearn-genetic-opt version: 0.8.0 Scikit-learn version: 1.0.1 Python version: 3.8.11

Describe the bug A clear and concise description of what the bug is.

Firstly, great work!!

When using :

clf_RF    = RandomForestClassifier(random_state=0)

evolved_estimator = GAFeatureSelectionCV(
    estimator   = clf_RF,
    cv          = 5,
    population_size=20, 
    generations =40,
    crossover_probability=0.8,
    mutation_probability = 0.075,
    n_jobs      = -1,
    scoring     = "accuracy",
    max_features = 300
    )

# Train and select the features
evolved_estimator.fit(X, y)

the Output looks like this;

gen	nevals	fitness	fitness_std	fitness_max	fitness_min
0  	20    	-10000 	0          	-10000     	-10000     
1  	32    	-10000 	0          	-10000     	-10000     
2  	33    	-10000 	0          	-10000     	-10000     
3  	37    	-10000 	0          	-10000     	-10000     
4  	36    	-10000 	0          	-10000     	-10000     
5  	37    	-10000 	0          	-10000     	-10000     
6  	36    	-10000 	0          	-10000     	-10000     
7  	36    	-10000 	0          	-10000     	-10000     
8  	36    	-10000 	0          	-10000     	-10000     
9  	33    	-10000 	0          	-10000     	-10000     
10 	33    	-10000 	0          	-10000     	-10000     
11 	34    	-10000 	0          	-10000     	-10000     
12 	34    	-10000 	0          	-10000     	-10000     
13 	36    	-10000 	0          	-10000     	-10000     
14 	33    	-10000 	0          	-10000     	-10000     
15 	34    	-10000 	0          	-10000     	-10000     
16 	33    	-10000 	0          	-10000     	-10000     
17 	37    	-10000 	0          	-10000     	-10000     
18 	35    	-10000 	0          	-10000     	-10000     
19 	37    	-10000 	0          	-10000     	-10000     
20 	34    	-10000 	0          	-10000     	-10000     
21 	35    	-10000 	0          	-10000     	-10000     
22 	35    	-10000 	0          	-10000     	-10000

This doesn't happen when removing the max_feature() parameter.

To Reproduce Steps to reproduce the behavior:

Go to '...'
Add this code '....'
Run with this command '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

bug

opened by cewinharhar 5

change [FEATURE]

Main Features: GASearchCV: Principal class of the package, holds the evolutionary cross-validation optimization routine. Algorithms: Set of different evolutionary algorithms to use as an optimization procedure. Callbacks: Custom evaluation strategies to generate early stopping rules, logging (into TensorBoard, .pkl files, etc) or your custom logic. Plots: Generate pre-defined plots to understand the optimization process. MLflow: Build-in integration with mlflow to log all the hyperparameters, cv-scores and the fitted models
new feature

opened by Fancy-angel 5
[FEATURE] Add in CTRL + C Early Stopping!
Is your feature request related to a problem? Please describe. Nope.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] None.

What is the use case for this feature? When a user wants to stop the optimization process manually (without using a callback), they could press CTRL + C to stop. The best evolved_estimator at the time of pressing CTRL + C will be returned and optimization will stop, allowing the rest of the script to continue.

Describe the solution you'd expect See above

A clear and concise description of what you want to happen. TPOT is a good reference for this. The user presses CTRL + C after at least 1 pipeline has been fitted, and the best pipeline found until that point is used. The rest of the script can continue after that, like the evolved_estimator.predict() function.

Describe the workflow you want to enable See above.

Additional context Love the tool! Would be cool to see this implemented :D
help wanted new feature up-for-grabs
opened by windowshopr 5
Mlflow test

I used the default mlruns file store and I created a test to ensure the mlruns folder is removed. I tried to cover all the tests you requested. If there's anything I should change, just give me a shout.

I'm trying to get a docker container running with the mlflow server and a backend. I do think it's going to take a bit of time to figure out though because I'm not too clued up with docker. So hopefully these tests are ok for now.

opened by Turtle24 4

Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)?

The idea is to take in predictions from an arbitrary number of models, and find optimal weights that maximize the accuracy of the ensembled model.

Here's the estimator that I wrote:

from typing import List, Optional
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils import check_X_y, check_array
from sklearn.utils.estimator_checks import check_estimator, check_is_fitted
from sklearn.metrics import mean_absolute_error


class WeightedAverageEnsemble(BaseEstimator, RegressorMixin):
    """
    
    >>> wae = WeightedAverageEnsemble()
    >>> X = np.random.rand(20, 5)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)
    
    >>> wae = WeightedAverageEnsemble(weights=[0.25, 0.75])
    >>> X = np.random.rand(20, 2)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)

    Parameters
    ----------
    BaseEstimator : _type_
        _description_
    RegressorMixin : _type_
        _description_
    """

    def __init__(self, weights: Optional[List[float]] = None):
        if weights is not None:
            assert np.isclose(sum(weights), 1.0)
        self.weights = weights

    def fit(self, X, y):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X, y = check_X_y(X, y, accept_sparse=False)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]
        if self.weights is None:
            self._mod_weights = np.ones(self.n_features_in_) / self.n_features_in_
            # equivalent to:
            # w = np.ones(self.n_features_in_).reshape(1, -1)
            # w = sklearn.preprocessing.normalize(w, norm="l1", axis=1)
        else:
            self._mod_weights = self.weights
        return self

    def predict(self, X):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X = check_array(X, accept_sparse=False)
        check_is_fitted(self, "is_fitted_")
        W = np.tile(self._mod_weights, (X.shape[0], 1))
        y = np.einsum("ij, ij->i", W, X)
        # should be equivalent to: y = np.sum(W * X)
        # loop with np.dot might also be fast due to BLAS compatibility
        # https://stackoverflow.com/a/26168677/13697228
        # https://stackoverflow.com/a/39657770/13697228
        return y

    def score(self, X, y, **kwargs):
        y_pred = self.predict(X)
        return mean_absolute_error(y, y_pred, **kwargs)


check_estimator(WeightedAverageEnsemble())

Related: https://machinelearningmastery.com/weighted-average-ensemble-with-python

How would you suggest optimizing weights since it's a vector that can change in size based on the size of the input data?

question

opened by sgbaird 3

[FEATURE] Add threshold parameter to ConsecutiveStopping
Is your feature request related to a problem? Please describe. I ran GASearchCV with a callback that stopped the optimization if the fitness was no greater than at least one value of fitness from the last 5 generations.

callback = ConsecutiveStopping(generations=5, metric='fitness')

Checking the log information while the algorithm was running, I have noticed that the reported fitness (-12.7893) was the same for more than 5 consecutive generations (please see the attached image). Under these circumstances, I would have expected the algorithm to have stopped much earlier (in generation 8).

I assume the algorithm did not stop because the logbook only shows 4 decimal places. However, given that fitness improved very little after generation 8, I think in some situations the user could have the option to provide a threshold value to ConsecutiveStopping, which would make the algorithm to stop after N consecutive generations if the improvement in fitness (or any other metric) was no greater than a specific threshold (e.g. 0.0001). This could make the algorithm to finish much faster in some occasions.

Describe the solution you'd expect I have made a custom callback (which hopefully is correct) to achieve what I want (the documentation was quite helpful). Please feel free to make any comments regarding my code:

from sklearn_genetic.callbacks.base import BaseCallback class ConsecutiveStoppingThreshold(BaseCallback): def __init__(self, threshold, N, metric='fitness'): self.threshold = threshold self.N = N self.metric = metric def on_step(self, record, logbook, estimator=None): #not enough data points if len(logbook) <= self.N: return False #get the last N metrics stats = logbook.select(self.metric)[-self.N :] #find the difference between max and min fitness in the last metrics diff = max(stats) - min(stats) if self.threshold > diff: return True return False

I have tested this code and it appears to work fine. In my perspective, such type of callback is very useful and, therefore, I think it should be more easily accessible to users. In my opinion, you could do one of the following:

Show an explicit example, in the section "Custom callbacks" in the package's homepage, where you demonstrate how to achieve the above.

Or have a threshold argument in ConsecutiveStopping where the user can provide a float to determine how much improvement is allowed after N consecutive generations.

new feature
opened by poroc300 3
understand_cv documentation spelling updates

I updated the understand_cv documents grammar and spelling a bit. This is for #43 and I'll slowly go over all the documentation because I want to better understand the package and work on the inner workings eventually.

opened by Turtle24 3
[FEATURE] GAFeaturesSelectionCV
Is your feature request related to a problem? Please describe. This feature will make the package extend it's functionalities to include feature selection using evolutionary algorithms. Currently, only hyperparameters tuning is being done.

Describe the solution you'd expect

Implement the class GAFeaturesSelectionCV inside sklearn_genetic.genetic_search with the following functionalities:

This function should take the same parameters as GASearchCV except for param_grid, the estimator should have it's own defined parameters.

Perform cross-validation over different set of features that are selected using evolutionay algorithms. The same sklean_genetic.algorithms options must be available as optimization routine.

The class should be able to work with the existing features of the package, such as Callbacks, plot fitness evolution.

All the documentation must be updated, indicating which functionallity of the package is compatible only with GASearchCV (e.g most likely plot_search_space won't be compatible with feature selection).

It must accepts a GASearchCV instance as the estimator.

There must be an attribute called best_features_ that has the final selected features by the model.

Additional context The evolutionary algorithm can be defined by assigning a gen to each parameter, if the gen is 1, it means the parameters is selected, 0 otherwise.

Note: I'll be working on this feature, but as always, new ideas and contributions to this is welcome
new feature
opened by rodrigo-arenas 3
[FEATURE] Feature selection and optimization simultaneously

It seems to me that the best approach to optimizing an estimator would be to run both feature selection AND hyperparameter optimization simultaneously within the same evolution process. It would be complex but probably yield better results instead of using one after the other.

Is this something I can do within the current framework, or does this require new code?

Also, do you think it is even a good idea in the first place?
new feature

opened by doodledood 0
[FEATURE]
Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

What is the use case for this feature?

Describe the solution you'd expect

A clear and concise description of what you want to happen.

Describe the workflow you want to enable

Additional context Add any other context or screenshots about the feature request here.
new feature
opened by rawanm900 2
Contributing to this project

Hello, as part of our studies, me and some friends have to contribute to an opensource project related to optimization. Your project seems particularly interesting.

Do you need help on a particular feature? On which subject do you advise us to work?

Have a nice day, Pierre C.
question

opened by pierrechagn 1
[FEATURE] Conda package

Is your feature request related to a problem? Please describe. May I ask if there are plans to release a conda package in the near future?

I want to use this package within a project whose virtual environment is created with conda and all installed packages are also from conda/conda-forge. I have pip installed in the environment and tried to install sklearn-genetic-opt via pip as stated in the docs (pip install sklearn-genetic-opt). pip identified the dependencies and installed them (deap, numpy, etc.). The problem though is that it doesn't integrate well with the environment. For instance, I have pandas 1.5.0 installed in the conda environment, but when I open a Python session and run import sklearn_genetic, the interpreter returns me an error claiming that pandas is not installed.

Describe the solution you'd expect The package would be easier to use if it were possible to install it within conda.

Additional context Everything I reported refers to a Windows 10 21H2 machine.
new feature

opened by abianco88 2
GAFeatureSelectionCV - object has no attribute 'transform'
System information OS Platform and Distribution: Windows 11 Home Sklearn-genetic-opt version: 0.9.0 deap version: 1.3.3 Scikit-learn version: 1.1.2 Python version: 3.8.13

Describe the bug I have fitted an instance of GAFeatureSelectionCV using LGBMClassifier

clf_dim = LGBMClassifier() gen_opt = GAFeatureSelectionCV( clf_dim, cv=5, scoring='avg_prec', refit=True, generations=20, population_size=50, tournament_size=3, mutation_probability=0.8, crossover_probability=0.2, elitism=True, keep_top_k=1, n_jobs=1, verbose=True, )

and got the expected results in the various output attributes such as .best_estimator_ and n_features_in_

However, unlike the example provided in the documentation, I am not attempting to use the selected features and the estimator directly to predict results on test data.

Instead, I am trying to follow the traditional scikit-learn approach of incorporating this estimator to select features as step 'dim' in the following pipeline, before passing them on to another classifier at the end of the pipeline

This requires that the 'transformer' based on GAFeatureSelectionCV supports a transform() method, which it does. However, when I try to use the transform method of the fitted estimator standalone, as in:

gen_opt.transform(X_t)

I get an error suggesting that

'LGBMClassifier' object has no attribute 'transform'

I went on to define a pipeline with the estimator as below:

pipe_dim_full = Pipeline( steps=[ ('enc', encode), ('dim', gen_opt), ('clf', clf), ], )

and upon trying to fit it, I get a somewhat contradictory error:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimator=LGBMClassifier(n_jobs=1, random_state=0, verbose=-1), generations=20, n_jobs=18, return_train_score=True, scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1))' (type <class 'sklearn_genetic.genetic_search.GAFeatureSelectionCV'>) doesn't

As it stands, GAFeatureSelectionCV can't be used in a pipeline without the transform() method being fixed, which is unfortunate as I really like it and was looking forward to using GA across my pipeline.

To Reproduce Steps to reproduce the behavior: As described above. Please reach out if you need more detail.

Expected behavior The transform method should product a matrix with n_features_in_ columns of the input matrix

Additional context There is another module based on deap that successfully offers feature selection by genetic algorithm. Here is a link for reference https://sklearn-genetic.readthedocs.io/en/latest/api.html
bug
opened by RNarayan73 1
[FEATURE] Support for XGBoost early stopping

Thanks for such a cool package.

I'm using GASearchCV to hypertune an xgboost model. However, it is failing if I use early stopping an fit() Can early stopping (and the additional xgbosst fitting params) be used with GASearchCV().fit()?

Thanks, Hayden
new feature

opened by hrampadarath 1

Releases(0.9.0)

0.9.0(Jun 6, 2022)
This release comes with new features and general performance improvements

Features:

Introducing Adaptive Schedulers to enable adaptive mutation and crossover probabilities; currently, supported schedulers are: ConstantAdapter, ExponentialAdapter, InverseAdapter, and PotentialAdapter

Add random_state parameter (default= None) in Continuous, Categorical and Integer classes from space to leave fixed the random seed during hyperparameters sampling.

API Changes:

Changed the default values of mutation_probability and crossover_probability to 0.8 and 0.2, respectively.

The weighted_choice function used in GAFeatureSelectionCV was re-written to give more probability to a number of features closer to the max_features parameter

Removed unused and broken function plot_parallel_coordinates()

Bug Fixes

Now, when using the plot_search_space() function, all the parameters get cast as np.float64 to avoid errors on the seaborn package while plotting bool values.

Source code(tar.gz)
Source code(zip)
0.8.1(Mar 9, 2022)

This release implements a change when the max_features parameter from class GAFeatureSelectionCV is set, the initial population is now sampled giving more probability to solutions with less than max_features features.
Source code(tar.gz)
Source code(zip)
0.8.0(Jan 5, 2022)
This release comes with some requested features and enhancements.

Features:

Class GAFeatureSelectionCV now has a parameter called max_features, int, default=None. If it's not None, it will penalize individuals with more features than max_features, putting a "soft" upper bound to the number of features to be selected.

Classes GASearchCV and GAFeatureSelectionCV now support multi-metric evaluation the same way scikit-learn does; you will see this reflected on the logbook and cv_results_ objects, where now you get results for each metric. As in scikit-learn, if multi-metric is used, the refit parameter must be a str specifying the metric to evaluate the cv-scores.

Training gracefully stops if interrupted by some of these exceptions: KeyboardInterrupt, SystemExit, StopIteration. When one of these exceptions is raised, the model finishes the current generation and saves the current best model. It only works if at least one generation has been completed.

API Changes:

The following parameters changed their default values to create more extensive and different models with better results:

population_size from 10 to 50

generations from 40 to 80

mutation_probability from 0.1 to 0.2

Docs:

A new notebook called Iris_multimetric was added to showcase the new multi-metric capabilities.

Source code(tar.gz)
Source code(zip)
0.7.0(Nov 17, 2021)
This is an exciting release! It introduces features selection capabilities to the package

Features:

GAFeatureSelectionCV class for feature selection along with any scikit-learn classifier or regressor. It optimizes the cv-score while minimizing the number of features to select. This class is compatible with the mlflow and tensorboard integration, the Callbacks, and the plot_fitness_evolution function.

API Changes:

The module mlflow was renamed to mlflow_log to avoid unexpected errors on name resolutions
Source code(tar.gz)
Source code(zip)
0.6.1(Aug 4, 2021)
This is a minor release that fixes a couple of bugs and adds some minor options.

Features:

Added the parameter generations to DeltaThreshold. Now it compares the maximum and minimum values of a metric from the last generations, instead of just the current and previous ones. The default value is 2, so the behavior remains the same as in previous versions.

Bug Fixes:

When a param_grid of length 1 is provided, a user warning is raised instead of an error. Internally it will swap the crossover operation to use the DEAP's tools.cxSimulatedBinaryBounded.

When using Continuous class with boundaries lower and upper, a uniform distribution with limits [lower, lower + upper] was sampled, now, it's properly sampled using a [lower, upper] limit.

Source code(tar.gz)
Source code(zip)
0.6.0(Jul 5, 2021)
This is a big release with several new features and enhancements! 🎊

Features:

Added the ProgressBar callback, it uses tqdm progress bar to shows how many generations are left in the training progress.

Added the TensorBoard callback to log the generation metrics, watch in real-time while the models are trained, and compare different runs in your TensorBoard instance.

Added the TimerStopping callback to stop the iterations after a total (threshold) fitting time has been elapsed.

Added new parallel coordinates plot using plot_parallel_coordinates by @Raul9595

Now if one or more callbacks decides to stop the algorithm, it will print its class name to know which callbacks were responsible of the stopping.

Added support for extra methods coming from scikit-learn's BaseSearchCV, like cv_results_, best_index_ and refit_time_ among others.

Added methods on_start and on_end to BaseCallback. Now the algorithms check for the callbacks like this:

on_start: When the evolutionary algorithm is called from the GASearchCV.fit method.

on_step: When the evolutionary algorithm finishes a generation (no change here).

on_end: At the end of the last generation.

Bug Fixes:

A missing statement was making that the callbacks start to get evaluated from generation 1, ignoring generation 0. Now this is properly handled and callbacks work from generation 0.

API Changes:

The modules sklearn_genetic.plots and sklearn_genetic.mlflow.MLflowConfig now requires an explicit installation of seaborn and mlflow, now those are optionally installed using pip install sklearn-genetic-opt[all].

The GASearchCV.logbook property now has extra information that comes from the scikit-learn cross_validate function.

An optional extra parameter was added to GASearchCV, named return_train_score: bool, default=False. As in scikit-learn, it controls if the cv_results_ should have the training scores.

Docs:

Edited all demos to be in the jupyter notebook format.

Added embedded jupyter notebooks examples in read the docs page.

The modules of the package now have a summary of their classes/functions in the docs.

Updated the callbacks and custom callbacks tutorials to add a new TensorBoard callback and the new methods on the base callback.

Internal:

Now the HallofFame (hof) uses the self.best_params_ for the position 0, to be consistent with the scikit-learn API and parameters like self.best_index_

MLflow now has unit tests by @Turtle24

Thanks to new contributors for helping in this project! @Raul9595 @Turtle24
Source code(tar.gz)
Source code(zip)
0.5.0(Jun 22, 2021)
Features:

Build-in integration with MLflow using the class sklearn_genetic.mlflow.MLflowConfig and the new parameter log_config from the class sklearn_genetic.GASearchCV

Implemented the callback sklearn_genetic.callbacks.LogbookSaver which saves the estimator.logbook object with all the fitted hyperparameters and their cross-validation score

Added the parameter estimator to all the functions on the module sklearn_genetic.callbacks

Docs:

Added user guide "Integrating with MLflow"

Update the tutorial "Custom Callbacks" for new API inheritance behavior

Internal:

Added a base class sklearn_genetic.callbacks.base.BaseCallback from which all Callbacks must inherit from

Now coverage report doesn't take into account the lines with # pragma: no cover and # noqa

Source code(tar.gz)
Source code(zip)
0.4.1(Jun 2, 2021)
Docs:

Added user guide on "Understanding the evaluation process"

Several guides on contributing, code of conduct

Added important links

Docs requirement are now independent of package requirements

Internal:

Changed test ci from travis to Github actions

Source code(tar.gz)
Source code(zip)
0.4.0(May 31, 2021)
Features:

Implemented callbacks module to stop the optimization process based in the current iteration metrics, currently implemented: sklearn_genetic.callbacks.ThresholdStopping , sklearn_genetic.callbacks.ConsecutiveStopping and sklearn_genetic.callbacks.DeltaThreshold.

The algorithms 'eaSimple', 'eaMuPlusLambda', 'eaMuCommaLambda' are now implemented in the module sklearn_genetic.algorithms for more control over their options, rather that taking the deap.algorithms module.

Implemented the sklearn_genetic.plots module and added the function sklearn_genetic.plots.plot_search_space, this function plots a mixed counter, scatter and histogram plots over all the fitted hyperparameters and their cross-validation score.

Documentation based in rst with Sphinx to host in read the docs. It includes public classes and functions documentation as well as several tutorials on how to use the package, link: https://sklearn-genetic-opt.readthedocs.io/

Added best_params_ and best_estimator_ properties after fitting GASearchCV.

Added optional parameters refit, pre_dispatch and error_score.

API Changes:

Removed support for python 3.6, changed the libraries supported versions to be the same as scikit-learn current version.

Several internal changes on the documentation and variables naming style to be compatible with Sphinx.

Removed the parameters continuous_parameters, categorical_parameters and integer_parameters in GASearchCV, replacing them with param_grid.

Source code(tar.gz)
Source code(zip)
0.3.0(May 28, 2021)
Features:

Added the space module to control better the data types and ranges of each hyperparameter, their distribution to sample random values from, and merge all data types in one Space class that can work with the new param_grid parameter

Changed the continuous_parameters, categorical_parameters and integer_parameters for the param_grid, the first ones still work but will be removed in a next version

Added the option to use the eaMuCommaLambda algorithm from deap

The mu and lambda_ parameters of the internal eaMuPlusLambda and eaMuCommaLambda now are in terms of the initial population size and not the number of generations

Source code(tar.gz)
Source code(zip)
0.2.1(May 27, 2021)
Features:

Enabled deap's eaMuPlusLambda algorithm for the optimization process, now is the default routine

Added the parameter keep_top_k to control the amout of solutions if the hall of fame (hof)

Changed default parameters crossover_probability from 1 to 0.8 and generations from 50 to 40

Internal

Changed parameters with pre-defined options to use pydantic models

Fixes

Fix log of the scoring metric in logbook, now is part of the parameters and is show only once

Source code(tar.gz)
Source code(zip)
0.2.0(May 25, 2021)
Features:

Added a logbook and history properties to the fitted GASearchCV to make post-fit analysis

Elitism = False now implements a roulette selection instead of ignoring the parameter

API Changes:

Refactored the optimization algorithm to use deap package instead of a custom implementation, this causes the removal of several methods, properties and variables inside the GASearchCV class

The parameter encoding_length has been removed, it's not longer required to the GASearchCV class

Renamed the property of the fitted estimator from best_params_ to best_params

The verbosity now prints the deap log of the fitness function, it's standard deviation, max and min values from each generation

The variable GASearchCV._best_solutions was removed and it's meant to be replaced with GASearchCV.logbook and GASearchCV.history

Source code(tar.gz)
Source code(zip)
0.1.1(Apr 28, 2021)
Bug Fixes:

Fixs unexpected overwrites over if statements

Correct validation when parameters dicts are empty

Enchacements:

Criteria parameter to control if it's a minimization or maximization problem with respect to the scoring metric

Plot fitness function over generations

Unit tests for all the package

Examples with regression problems

Implementation of some magic methods

Documentation of the GASearchCV class parameters

Source code(tar.gz)
Source code(zip)
0.1.0(Apr 27, 2021)
Features:

GASearchCV for sklearn classifiers and regressors hyperparameters search using genetic algorithms and cross validation

Source code(tar.gz)
Source code(zip)

Owner

Rodrigo Arenas

GitHub Repository

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

8.9k Jan 09, 2023

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 02, 2023

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

3 Feb 10, 2022

Simplify stop motion animation with machine learning.

25 Sep 15, 2022

Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence anal

915 Dec 31, 2022

Project to deploy a machine learning model based on Titanic dataset from Kaggle

kaggle_titanic_deploy Project to deploy a machine learning model based on Titanic dataset from Kaggle In this project we used the Titanic dataset from

8 May 23, 2022

Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

3 Feb 15, 2022

Lightweight Machine Learning Experiment Logging 📖

Simple logging of statistics, model checkpoints, plots and other objects for your Machine Learning Experiments (MLE). Furthermore, the MLELogger comes with smooth multi-seed result aggregation and co

65 Dec 08, 2022

A Multipurpose Library for Synthetic Time Series Generation in Python

TimeSynth Multipurpose Library for Synthetic Time Series Please cite as: J. R. Maat, A. Malali, and P. Protopapas, “TimeSynth: A Multipurpose Library

278 Dec 26, 2022

Python package for machine learning for healthcare using a OMOP common data model

This library was developed in order to facilitate rapid prototyping in Python of predictive machine-learning models using longitudinal medical data from an OMOP CDM-standard database.

75 Jan 03, 2023

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

31 Nov 03, 2022

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.

DoWhy | An end-to-end library for causal inference Amit Sharma, Emre Kiciman Introducing DoWhy and the 4 steps of causal inference | Microsoft Researc

5.6k Jan 07, 2023

scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

Related tags

Overview

Sklearn-genetic-opt

Main Features:

Demos on Features:

Usage:

Example: Hyperparameters Tuning

Example: Feature Selection

Changelog

Important links

Source code

Contributing

Testing

Comments

Releases(0.9.0)

0.9.0(Jun 6, 2022)

Features:

API Changes:

Bug Fixes

0.8.1(Mar 9, 2022)

0.8.0(Jan 5, 2022)

Features:

API Changes:

Docs:

0.7.0(Nov 17, 2021)

Features:

API Changes:

0.6.1(Aug 4, 2021)

Features:

Bug Fixes:

0.6.0(Jul 5, 2021)

Features:

Bug Fixes:

API Changes:

Docs:

Internal:

0.5.0(Jun 22, 2021)

Features:

Docs:

Internal:

0.4.1(Jun 2, 2021)

Docs:

Internal:

0.4.0(May 31, 2021)

Features:

API Changes:

0.3.0(May 28, 2021)

Features:

0.2.1(May 27, 2021)

Features:

Internal

Fixes

0.2.0(May 25, 2021)

Features:

API Changes:

0.1.1(Apr 28, 2021)

Bug Fixes:

Enchacements:

0.1.0(Apr 27, 2021)

Features:

Owner

Rodrigo Arenas

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Simplify stop motion animation with machine learning.

Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Project to deploy a machine learning model based on Titanic dataset from Kaggle

Fundamentals of Machine Learning

Lightweight Machine Learning Experiment Logging 📖

A Multipurpose Library for Synthetic Time Series Generation in Python

Python package for machine learning for healthcare using a OMOP common data model

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort

MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

A basic Ray Tracer that exploits numpy arrays and functions to work fast.

This repository contains the code to predict house price using Linear Regression Method

Client - 🔥 A tool for visualizing and tracking your machine learning experiments