scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

Overview

Tests Codecov PythonVersion PyPi Docs

https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/logo.png?raw=true

Sklearn-genetic-opt

scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

This is meant to be an alternative to popular methods inside scikit-learn such as Grid Search and Randomized Grid Search for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.

Documentation is available here

Main Features:

  • GASearchCV: Main class of the package for hyperparameters tuning, holds the evolutionary cross-validation optimization routine.
  • GAFeatureSelectionCV: Main class of the package for feature selection.
  • Algorithms: Set of different evolutionary algorithms to use as an optimization procedure.
  • Callbacks: Custom evaluation strategies to generate early stopping rules, logging (into TensorBoard, .pkl files, etc) or your custom logic.
  • Plots: Generate pre-defined plots to understand the optimization process.
  • MLflow: Build-in integration with mlflow to log all the hyperparameters, cv-scores and the fitted models.

Demos on Features:

Visualize the progress of your training:

https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/progress_bar.gif?raw=true

Real-time metrics visualization and comparison across runs:

https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/tensorboard_log.png?raw=true

Sampled distribution of hyperparameters:

https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/density.png?raw=true

Artifacts logging:

https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/mlflow_artifacts_4.png?raw=true

Usage:

Install sklearn-genetic-opt

It's advised to install sklearn-genetic using a virtual env, inside the env use:

pip install sklearn-genetic-opt

If you want to get all the features, including plotting and mlflow logging capabilities, install all the extra packages:

pip install sklearn-genetic-opt[all]

The only optional dependency that the last command does not install, it's Tensorflow, it is usually advised to look further which distribution works better for you.

Example: Hyperparameters Tuning

from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

data = load_digits()
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = RandomForestClassifier()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=10,
                               generations=35,
                               param_grid=param_grid,
                               n_jobs=-1,
                               verbose=True,
                               keep_top_k=4)

# Train and optimize the estimator
evolved_estimator.fit(X_train, y_train)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)

Example: Feature Selection

import matplotlib.pyplot as plt
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np

data = load_iris()
X, y = data["data"], data["target"]

# Add random non-important features
noise = np.random.uniform(0, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

clf = SVC(gamma='auto')

evolved_estimator = GAFeatureSelectionCV(
    estimator=clf,
    scoring="accuracy",
    population_size=30,
    generations=20,
    n_jobs=-1)

# Train and select the features
evolved_estimator.fit(X_train, y_train)

# Features selected by the algorithm
features = evolved_estimator.best_features_
print(features)

# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
print(accuracy_score(y_test, y_predict_ga))

Changelog

See the changelog for notes on the changes of Sklearn-genetic-opt

Important links

Source code

You can check the latest development version with the command:

git clone https://github.com/rodrigo-arenas/Sklearn-genetic-opt.git

Install the development dependencies:

pip install -r dev-requirements.txt

Check the latest in-development documentation: https://sklearn-genetic-opt.readthedocs.io/en/latest/

Contributing

Contributions are more than welcome! There are several opportunities on the ongoing project, so please get in touch if you would like to help out. Make sure to check the current issues and also the Contribution guide.

Big thanks to the people who are helping with this project!

Contributors

Testing

After installation, you can launch the test suite from outside the source directory:

pytest sklearn_genetic
Comments
  • [Feature] Parallel Coordinates plot

    [Feature] Parallel Coordinates plot

    Is your feature request related to a problem? Please describe. NA

    Describe the solution you'd like Implement in the sklearn_genetic.plots module a function named plot_parallel_coordinates to inspect the results of the learning process

    Describe alternatives you've considered The function should take two arguments:

    • estimator: A fitted estimator from sklearn_genetic.GASearchCV
    • features: list, default=None. Subset of features to plot, if None it plots all the features by default

    The function should return an object to plot parallel coordinates according the pandas.plotting.parallel_coordinates function

    The data to plot is available on the estimator.logbook object, look the implementation of the plot_search_space function to see how to convert this data to a pandas data frame

    The function must select only the non categorical variables, this can be done by inspecting the estimator.space object and comparing against the data types defined in sklearn_genetic.space, i.e Categorical, Continuous and Integer and color against the "score" column. In the same way, it must validate and make a warning if in the features parameter a Categorial one is passed

    Additional context Links of some implementations:

    • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.parallel_coordinates.html
    • https://www.mathworks.com/help/stats/feature-selection-and-feature-transformation.html#buwh6hc-1
    help wanted good first issue new feature up-for-grabs 
    opened by rodrigo-arenas 14
  • [FEATURE] Report multiple scoring metrics

    [FEATURE] Report multiple scoring metrics

    Hello,

    I have been looking into your package and it is really cool. Thank you for putting a lot of effort in developing such an amazing tool.

    Is your feature request related to a problem? Please describe. GASearchCV, unlike GridSearchCV, only accepts one scoring metric. Obviously, the algorithm can only use one metric to decide which models will carry over to the next generation. However, I think it would be useful to view different scoring metrics for the best models (e.g. R2, MAE, RMSE), which intrinsically may provide a slightly different idea of model performance to the user. Of course we would still be able to decide which metric should be used to select the best models within each generation.

    Describe the solution you'd expect I think the implementation of multiple scoring metrics in GASearchCV could be similar to the one implemented in GridSearchCV regarding this specific matter. I show below some examples of this implementation in GridSearchCV:

    import numpy as np
    from sklearn.svm import SVR
    from sklearn.model_selection import GridSearchCV
    
    #generate input
    X = np.random.normal(75, 10, (1000, 2))
    y = np.random.normal(200, 20, 1000)
    params = {"degree": [2, 3], "C": [10, 20, 50]}
    
    #calculate both R2 and MAE for each tested model, but model refit is performed based on the combination of hyperparameters with the best R2
    grid = GridSearchCV(SVR(), param_grid=params, scoring=["neg_mean_absolute_error",  "r2"], refit="r2")
    
    #another way of doing the above, but this time using aliases for the scorers
    grid = GridSearchCV(SVR(), param_grid=params, scoring={"MAE": "neg_mean_absolute_error",  "R2": "r2"], refit="R2")
    
    #perform grid search
    grid.fit(X, y)
    

    If you call grid.cv_results_ in this example, you will see the output dict will have a mean_test_MAE and mean_test_R2 keys (in the case of the second example).

    enhancement help wanted new feature up-for-grabs 
    opened by poroc300 11
  • MLPClassifier - ValueError: shuffle must be either True or False, got True.

    MLPClassifier - ValueError: shuffle must be either True or False, got True.

    System information Windows 10 Sklearn-genetic-opt version: 0.6.1 Scikit-learn version: 0.24.2 Python version: Python 3.7

    Describe the bug When using the GASearchCV class with MLPClassifier as the estimator, I get the error in the title. In my param_grid, I simply have it set to Categorical([True, False]), but it doesn't seem to play well. Wondering what could be causing it?

    To Reproduce Could recreate it by creating a binary classification dataset from sklearn, then implementing this:

        curr_params = {"shuffle": Categorical([True, False])}
    
        evolved_estimator = GASearchCV(estimator=MLPClassifier(),
                                       cv=StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
                                       scoring='balanced_accuracy',
                                       population_size=30,
                                       generations=30,
                                       tournament_size=3,
                                       elitism=True,
                                       crossover_probability=0.8,
                                       mutation_probability=0.1,
                                       param_grid=curr_params,
                                       criteria='max',
                                       algorithm='eaMuPlusLambda',
                                       n_jobs=1,
                                       verbose=True,
                                       keep_top_k=1)
    

    Expected behavior Seems to only be an issue with MLPClassifier so far, but should set the parameter shuffle to True or False.

    Screenshots

    Additional context

    bug 
    opened by windowshopr 9
  • Wrong output for GAFeatureSelectionCV only when using max_features for RandomForestClassifier

    Wrong output for GAFeatureSelectionCV only when using max_features for RandomForestClassifier

    System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10.0.19044 Sklearn-genetic-opt version: 0.8.0 Scikit-learn version: 1.0.1 Python version: 3.8.11

    Describe the bug A clear and concise description of what the bug is.

    Firstly, great work!!

    When using :

    clf_RF    = RandomForestClassifier(random_state=0)
    
    evolved_estimator = GAFeatureSelectionCV(
        estimator   = clf_RF,
        cv          = 5,
        population_size=20, 
        generations =40,
        crossover_probability=0.8,
        mutation_probability = 0.075,
        n_jobs      = -1,
        scoring     = "accuracy",
        max_features = 300
        )
    
    # Train and select the features
    evolved_estimator.fit(X, y)
    

    the Output looks like this;

    gen	nevals	fitness	fitness_std	fitness_max	fitness_min
    0  	20    	-10000 	0          	-10000     	-10000     
    1  	32    	-10000 	0          	-10000     	-10000     
    2  	33    	-10000 	0          	-10000     	-10000     
    3  	37    	-10000 	0          	-10000     	-10000     
    4  	36    	-10000 	0          	-10000     	-10000     
    5  	37    	-10000 	0          	-10000     	-10000     
    6  	36    	-10000 	0          	-10000     	-10000     
    7  	36    	-10000 	0          	-10000     	-10000     
    8  	36    	-10000 	0          	-10000     	-10000     
    9  	33    	-10000 	0          	-10000     	-10000     
    10 	33    	-10000 	0          	-10000     	-10000     
    11 	34    	-10000 	0          	-10000     	-10000     
    12 	34    	-10000 	0          	-10000     	-10000     
    13 	36    	-10000 	0          	-10000     	-10000     
    14 	33    	-10000 	0          	-10000     	-10000     
    15 	34    	-10000 	0          	-10000     	-10000     
    16 	33    	-10000 	0          	-10000     	-10000     
    17 	37    	-10000 	0          	-10000     	-10000     
    18 	35    	-10000 	0          	-10000     	-10000     
    19 	37    	-10000 	0          	-10000     	-10000     
    20 	34    	-10000 	0          	-10000     	-10000     
    21 	35    	-10000 	0          	-10000     	-10000     
    22 	35    	-10000 	0          	-10000     	-10000 
    

    This doesn't happen when removing the max_feature() parameter.

    To Reproduce Steps to reproduce the behavior:

    1. Go to '...'
    2. Add this code '....'
    3. Run with this command '....'
    4. See error

    Expected behavior A clear and concise description of what you expected to happen.

    Screenshots If applicable, add screenshots to help explain your problem.

    Additional context Add any other context about the problem here.

    bug 
    opened by cewinharhar 5
  •  change [FEATURE]

    change [FEATURE]

    Main Features: GASearchCV: Principal class of the package, holds the evolutionary cross-validation optimization routine. Algorithms: Set of different evolutionary algorithms to use as an optimization procedure. Callbacks: Custom evaluation strategies to generate early stopping rules, logging (into TensorBoard, .pkl files, etc) or your custom logic. Plots: Generate pre-defined plots to understand the optimization process. MLflow: Build-in integration with mlflow to log all the hyperparameters, cv-scores and the fitted models

    new feature 
    opened by Fancy-angel 5
  • [FEATURE] Add in CTRL + C Early Stopping!

    [FEATURE] Add in CTRL + C Early Stopping!

    Is your feature request related to a problem? Please describe. Nope.

    • A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] None.

    • What is the use case for this feature? When a user wants to stop the optimization process manually (without using a callback), they could press CTRL + C to stop. The best evolved_estimator at the time of pressing CTRL + C will be returned and optimization will stop, allowing the rest of the script to continue.

    Describe the solution you'd expect See above

    • A clear and concise description of what you want to happen. TPOT is a good reference for this. The user presses CTRL + C after at least 1 pipeline has been fitted, and the best pipeline found until that point is used. The rest of the script can continue after that, like the evolved_estimator.predict() function.

    • Describe the workflow you want to enable See above.

    Additional context Love the tool! Would be cool to see this implemented :D

    help wanted new feature up-for-grabs 
    opened by windowshopr 5
  • Mlflow test

    Mlflow test

    I used the default mlruns file store and I created a test to ensure the mlruns folder is removed. I tried to cover all the tests you requested. If there's anything I should change, just give me a shout.

    I'm trying to get a docker container running with the mlflow server and a backend. I do think it's going to take a bit of time to figure out though because I'm not too clued up with docker. So hopefully these tests are ok for now.

    opened by Turtle24 4
  • Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)?

    Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)?

    The idea is to take in predictions from an arbitrary number of models, and find optimal weights that maximize the accuracy of the ensembled model.

    Here's the estimator that I wrote:

    from typing import List, Optional
    import numpy as np
    from sklearn.base import BaseEstimator, RegressorMixin
    from sklearn.utils import check_X_y, check_array
    from sklearn.utils.estimator_checks import check_estimator, check_is_fitted
    from sklearn.metrics import mean_absolute_error
    
    
    class WeightedAverageEnsemble(BaseEstimator, RegressorMixin):
        """
        
        >>> wae = WeightedAverageEnsemble()
        >>> X = np.random.rand(20, 5)
        >>> y = np.random.rand(20, 1)
        >>> wae.fit(X, y)
        >>> wae.predict(X)
        
        >>> wae = WeightedAverageEnsemble(weights=[0.25, 0.75])
        >>> X = np.random.rand(20, 2)
        >>> y = np.random.rand(20, 1)
        >>> wae.fit(X, y)
        >>> wae.predict(X)
    
        Parameters
        ----------
        BaseEstimator : _type_
            _description_
        RegressorMixin : _type_
            _description_
        """
    
        def __init__(self, weights: Optional[List[float]] = None):
            if weights is not None:
                assert np.isclose(sum(weights), 1.0)
            self.weights = weights
    
        def fit(self, X, y):
            # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
            X, y = check_X_y(X, y, accept_sparse=False)
            self.is_fitted_ = True
            self.n_features_in_ = X.shape[1]
            if self.weights is None:
                self._mod_weights = np.ones(self.n_features_in_) / self.n_features_in_
                # equivalent to:
                # w = np.ones(self.n_features_in_).reshape(1, -1)
                # w = sklearn.preprocessing.normalize(w, norm="l1", axis=1)
            else:
                self._mod_weights = self.weights
            return self
    
        def predict(self, X):
            # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
            X = check_array(X, accept_sparse=False)
            check_is_fitted(self, "is_fitted_")
            W = np.tile(self._mod_weights, (X.shape[0], 1))
            y = np.einsum("ij, ij->i", W, X)
            # should be equivalent to: y = np.sum(W * X)
            # loop with np.dot might also be fast due to BLAS compatibility
            # https://stackoverflow.com/a/26168677/13697228
            # https://stackoverflow.com/a/39657770/13697228
            return y
    
        def score(self, X, y, **kwargs):
            y_pred = self.predict(X)
            return mean_absolute_error(y, y_pred, **kwargs)
    
    
    check_estimator(WeightedAverageEnsemble())
    

    Related: https://machinelearningmastery.com/weighted-average-ensemble-with-python

    How would you suggest optimizing weights since it's a vector that can change in size based on the size of the input data?

    question 
    opened by sgbaird 3
  • [FEATURE] Add threshold parameter to ConsecutiveStopping

    [FEATURE] Add threshold parameter to ConsecutiveStopping

    Is your feature request related to a problem? Please describe. I ran GASearchCV with a callback that stopped the optimization if the fitness was no greater than at least one value of fitness from the last 5 generations.

    callback = ConsecutiveStopping(generations=5, metric='fitness')
    

    Checking the log information while the algorithm was running, I have noticed that the reported fitness (-12.7893) was the same for more than 5 consecutive generations (please see the attached image). Under these circumstances, I would have expected the algorithm to have stopped much earlier (in generation 8).

    consecutive_stopping

    I assume the algorithm did not stop because the logbook only shows 4 decimal places. However, given that fitness improved very little after generation 8, I think in some situations the user could have the option to provide a threshold value to ConsecutiveStopping, which would make the algorithm to stop after N consecutive generations if the improvement in fitness (or any other metric) was no greater than a specific threshold (e.g. 0.0001). This could make the algorithm to finish much faster in some occasions.

    Describe the solution you'd expect I have made a custom callback (which hopefully is correct) to achieve what I want (the documentation was quite helpful). Please feel free to make any comments regarding my code:

    from sklearn_genetic.callbacks.base import BaseCallback
    
    class ConsecutiveStoppingThreshold(BaseCallback):
        def __init__(self, threshold, N, metric='fitness'):
            self.threshold = threshold
            self.N = N
            self.metric = metric
            
        def on_step(self, record, logbook, estimator=None):
            #not enough data points
            if len(logbook) <= self.N:
                return False
            
            #get the last N metrics
            stats = logbook.select(self.metric)[-self.N :]
            
            #find the difference between max and min fitness in the last metrics
            diff = max(stats) - min(stats)
            
            if self.threshold > diff:
                return True
            return False
    

    I have tested this code and it appears to work fine. In my perspective, such type of callback is very useful and, therefore, I think it should be more easily accessible to users. In my opinion, you could do one of the following:

    1. Show an explicit example, in the section "Custom callbacks" in the package's homepage, where you demonstrate how to achieve the above.
    2. Or have a threshold argument in ConsecutiveStopping where the user can provide a float to determine how much improvement is allowed after N consecutive generations.
    new feature 
    opened by poroc300 3
  • understand_cv documentation spelling updates

    understand_cv documentation spelling updates

    I updated the understand_cv documents grammar and spelling a bit. This is for #43 and I'll slowly go over all the documentation because I want to better understand the package and work on the inner workings eventually.

    opened by Turtle24 3
  • [FEATURE] GAFeaturesSelectionCV

    [FEATURE] GAFeaturesSelectionCV

    Is your feature request related to a problem? Please describe. This feature will make the package extend it's functionalities to include feature selection using evolutionary algorithms. Currently, only hyperparameters tuning is being done.

    Describe the solution you'd expect

    Implement the class GAFeaturesSelectionCV inside sklearn_genetic.genetic_search with the following functionalities:

    • This function should take the same parameters as GASearchCV except for param_grid, the estimator should have it's own defined parameters.
    • Perform cross-validation over different set of features that are selected using evolutionay algorithms. The same sklean_genetic.algorithms options must be available as optimization routine.
    • The class should be able to work with the existing features of the package, such as Callbacks, plot fitness evolution.
    • All the documentation must be updated, indicating which functionallity of the package is compatible only with GASearchCV (e.g most likely plot_search_space won't be compatible with feature selection).
    • It must accepts a GASearchCV instance as the estimator.
    • There must be an attribute called best_features_ that has the final selected features by the model.

    Additional context The evolutionary algorithm can be defined by assigning a gen to each parameter, if the gen is 1, it means the parameters is selected, 0 otherwise.

    Note: I'll be working on this feature, but as always, new ideas and contributions to this is welcome

    new feature 
    opened by rodrigo-arenas 3
  • [FEATURE] Feature selection and optimization simultaneously

    [FEATURE] Feature selection and optimization simultaneously

    It seems to me that the best approach to optimizing an estimator would be to run both feature selection AND hyperparameter optimization simultaneously within the same evolution process. It would be complex but probably yield better results instead of using one after the other.

    Is this something I can do within the current framework, or does this require new code?

    Also, do you think it is even a good idea in the first place?

    new feature 
    opened by doodledood 0
  • [FEATURE]

    [FEATURE]

    Is your feature request related to a problem? Please describe.

    • A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
    • What is the use case for this feature?

    Describe the solution you'd expect

    • A clear and concise description of what you want to happen.
    • Describe the workflow you want to enable

    Additional context Add any other context or screenshots about the feature request here.

    new feature 
    opened by rawanm900 2
  • Contributing to this project

    Contributing to this project

    Hello, as part of our studies, me and some friends have to contribute to an opensource project related to optimization. Your project seems particularly interesting.

    Do you need help on a particular feature? On which subject do you advise us to work?

    Have a nice day, Pierre C.

    question 
    opened by pierrechagn 1
  • [FEATURE] Conda package

    [FEATURE] Conda package

    Is your feature request related to a problem? Please describe. May I ask if there are plans to release a conda package in the near future?

    I want to use this package within a project whose virtual environment is created with conda and all installed packages are also from conda/conda-forge. I have pip installed in the environment and tried to install sklearn-genetic-opt via pip as stated in the docs (pip install sklearn-genetic-opt). pip identified the dependencies and installed them (deap, numpy, etc.). The problem though is that it doesn't integrate well with the environment. For instance, I have pandas 1.5.0 installed in the conda environment, but when I open a Python session and run import sklearn_genetic, the interpreter returns me an error claiming that pandas is not installed.

    Describe the solution you'd expect The package would be easier to use if it were possible to install it within conda.

    Additional context Everything I reported refers to a Windows 10 21H2 machine.

    new feature 
    opened by abianco88 2
  • GAFeatureSelectionCV - <classifier> object has no attribute 'transform'

    GAFeatureSelectionCV - object has no attribute 'transform'

    System information OS Platform and Distribution: Windows 11 Home Sklearn-genetic-opt version: 0.9.0 deap version: 1.3.3 Scikit-learn version: 1.1.2 Python version: 3.8.13

    Describe the bug I have fitted an instance of GAFeatureSelectionCV using LGBMClassifier

    clf_dim = LGBMClassifier()
    gen_opt = GAFeatureSelectionCV(
                                   clf_dim, cv=5, scoring='avg_prec', refit=True, 
                                   generations=20, population_size=50, tournament_size=3,
                                   mutation_probability=0.8, crossover_probability=0.2, elitism=True, keep_top_k=1,
                                   n_jobs=1, verbose=True, 
                                  )
    

    and got the expected results in the various output attributes such as .best_estimator_ and n_features_in_

    However, unlike the example provided in the documentation, I am not attempting to use the selected features and the estimator directly to predict results on test data.

    Instead, I am trying to follow the traditional scikit-learn approach of incorporating this estimator to select features as step 'dim' in the following pipeline, before passing them on to another classifier at the end of the pipeline image

    This requires that the 'transformer' based on GAFeatureSelectionCV supports a transform() method, which it does. However, when I try to use the transform method of the fitted estimator standalone, as in:

    gen_opt.transform(X_t)
    

    I get an error suggesting that

    'LGBMClassifier' object has no attribute 'transform'

    I went on to define a pipeline with the estimator as below:

    pipe_dim_full = Pipeline(
        steps=[
            ('enc', encode), 
            ('dim', gen_opt), 
            ('clf', clf), 
        ], 
    )
    

    and upon trying to fit it, I get a somewhat contradictory error:

    TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimator=LGBMClassifier(n_jobs=1, random_state=0, verbose=-1), generations=20, n_jobs=18, return_train_score=True, scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1))' (type <class 'sklearn_genetic.genetic_search.GAFeatureSelectionCV'>) doesn't

    As it stands, GAFeatureSelectionCV can't be used in a pipeline without the transform() method being fixed, which is unfortunate as I really like it and was looking forward to using GA across my pipeline.

    To Reproduce Steps to reproduce the behavior: As described above. Please reach out if you need more detail.

    Expected behavior The transform method should product a matrix with n_features_in_ columns of the input matrix

    Additional context There is another module based on deap that successfully offers feature selection by genetic algorithm. Here is a link for reference https://sklearn-genetic.readthedocs.io/en/latest/api.html

    bug 
    opened by RNarayan73 1
  • [FEATURE] Support for XGBoost early stopping

    [FEATURE] Support for XGBoost early stopping

    Thanks for such a cool package.

    I'm using GASearchCV to hypertune an xgboost model. However, it is failing if I use early stopping an fit() Can early stopping (and the additional xgbosst fitting params) be used with GASearchCV().fit()?

    Thanks, Hayden

    new feature 
    opened by hrampadarath 1
Releases(0.9.0)
  • 0.9.0(Jun 6, 2022)

    This release comes with new features and general performance improvements

    Features:

    • Introducing Adaptive Schedulers to enable adaptive mutation and crossover probabilities; currently, supported schedulers are: ConstantAdapter, ExponentialAdapter, InverseAdapter, and PotentialAdapter

    • Add random_state parameter (default= None) in Continuous, Categorical and Integer classes from space to leave fixed the random seed during hyperparameters sampling.

    API Changes:

    • Changed the default values of mutation_probability and crossover_probability to 0.8 and 0.2, respectively.

    • The weighted_choice function used in GAFeatureSelectionCV was re-written to give more probability to a number of features closer to the max_features parameter

    • Removed unused and broken function plot_parallel_coordinates()

    Bug Fixes

    • Now, when using the plot_search_space() function, all the parameters get cast as np.float64 to avoid errors on the seaborn package while plotting bool values.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.1(Mar 9, 2022)

    This release implements a change when the max_features parameter from class GAFeatureSelectionCV is set, the initial population is now sampled giving more probability to solutions with less than max_features features.

    Source code(tar.gz)
    Source code(zip)
  • 0.8.0(Jan 5, 2022)

    This release comes with some requested features and enhancements.

    Features:

    • Class GAFeatureSelectionCV now has a parameter called max_features, int, default=None. If it's not None, it will penalize individuals with more features than max_features, putting a "soft" upper bound to the number of features to be selected.

    • Classes GASearchCV and GAFeatureSelectionCV now support multi-metric evaluation the same way scikit-learn does; you will see this reflected on the logbook and cv_results_ objects, where now you get results for each metric. As in scikit-learn, if multi-metric is used, the refit parameter must be a str specifying the metric to evaluate the cv-scores.

    • Training gracefully stops if interrupted by some of these exceptions: KeyboardInterrupt, SystemExit, StopIteration. When one of these exceptions is raised, the model finishes the current generation and saves the current best model. It only works if at least one generation has been completed.

    API Changes:

    • The following parameters changed their default values to create more extensive and different models with better results:

      • population_size from 10 to 50

      • generations from 40 to 80

      • mutation_probability from 0.1 to 0.2

    Docs:

    • A new notebook called Iris_multimetric was added to showcase the new multi-metric capabilities.
    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(Nov 17, 2021)

    This is an exciting release! It introduces features selection capabilities to the package

    Features:

    • GAFeatureSelectionCV class for feature selection along with any scikit-learn classifier or regressor. It optimizes the cv-score while minimizing the number of features to select. This class is compatible with the mlflow and tensorboard integration, the Callbacks, and the plot_fitness_evolution function.

    API Changes:

    The module mlflow was renamed to mlflow_log to avoid unexpected errors on name resolutions

    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Aug 4, 2021)

    This is a minor release that fixes a couple of bugs and adds some minor options.

    Features:

    • Added the parameter generations to DeltaThreshold. Now it compares the maximum and minimum values of a metric from the last generations, instead of just the current and previous ones. The default value is 2, so the behavior remains the same as in previous versions.

    Bug Fixes:

    • When a param_grid of length 1 is provided, a user warning is raised instead of an error. Internally it will swap the crossover operation to use the DEAP's tools.cxSimulatedBinaryBounded.
    • When using Continuous class with boundaries lower and upper, a uniform distribution with limits [lower, lower + upper] was sampled, now, it's properly sampled using a [lower, upper] limit.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Jul 5, 2021)

    This is a big release with several new features and enhancements! 🎊

    Features:

    • Added the ProgressBar callback, it uses tqdm progress bar to shows how many generations are left in the training progress.

    • Added the TensorBoard callback to log the generation metrics, watch in real-time while the models are trained, and compare different runs in your TensorBoard instance.

    • Added the TimerStopping callback to stop the iterations after a total (threshold) fitting time has been elapsed.

    • Added new parallel coordinates plot using plot_parallel_coordinates by @Raul9595

    • Now if one or more callbacks decides to stop the algorithm, it will print its class name to know which callbacks were responsible of the stopping.

    • Added support for extra methods coming from scikit-learn's BaseSearchCV, like cv_results_, best_index_ and refit_time_ among others.

    • Added methods on_start and on_end to BaseCallback. Now the algorithms check for the callbacks like this:

      • on_start: When the evolutionary algorithm is called from the GASearchCV.fit method.

      • on_step: When the evolutionary algorithm finishes a generation (no change here).

      • on_end: At the end of the last generation.

    Bug Fixes:

    • A missing statement was making that the callbacks start to get evaluated from generation 1, ignoring generation 0. Now this is properly handled and callbacks work from generation 0.

    API Changes:

    • The modules sklearn_genetic.plots and sklearn_genetic.mlflow.MLflowConfig now requires an explicit installation of seaborn and mlflow, now those are optionally installed using pip install sklearn-genetic-opt[all].
    • The GASearchCV.logbook property now has extra information that comes from the scikit-learn cross_validate function.
    • An optional extra parameter was added to GASearchCV, named return_train_score: bool, default=False. As in scikit-learn, it controls if the cv_results_ should have the training scores.

    Docs:

    • Edited all demos to be in the jupyter notebook format.
    • Added embedded jupyter notebooks examples in read the docs page.
    • The modules of the package now have a summary of their classes/functions in the docs.
    • Updated the callbacks and custom callbacks tutorials to add a new TensorBoard callback and the new methods on the base callback.

    Internal:

    • Now the HallofFame (hof) uses the self.best_params_ for the position 0, to be consistent with the scikit-learn API and parameters like self.best_index_
    • MLflow now has unit tests by @Turtle24

    Thanks to new contributors for helping in this project! @Raul9595 @Turtle24

    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Jun 22, 2021)

    Features:

    • Build-in integration with MLflow using the class sklearn_genetic.mlflow.MLflowConfig and the new parameter log_config from the class sklearn_genetic.GASearchCV

    • Implemented the callback sklearn_genetic.callbacks.LogbookSaver which saves the estimator.logbook object with all the fitted hyperparameters and their cross-validation score

    • Added the parameter estimator to all the functions on the module sklearn_genetic.callbacks

    Docs:

    • Added user guide "Integrating with MLflow"
    • Update the tutorial "Custom Callbacks" for new API inheritance behavior

    Internal:

    • Added a base class sklearn_genetic.callbacks.base.BaseCallback from which all Callbacks must inherit from
    • Now coverage report doesn't take into account the lines with # pragma: no cover and # noqa
    Source code(tar.gz)
    Source code(zip)
  • 0.4.1(Jun 2, 2021)

    Docs:

    • Added user guide on "Understanding the evaluation process"
    • Several guides on contributing, code of conduct
    • Added important links
    • Docs requirement are now independent of package requirements

    Internal:

    • Changed test ci from travis to Github actions
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(May 31, 2021)

    Features:

    • Implemented callbacks module to stop the optimization process based in the current iteration metrics, currently implemented: sklearn_genetic.callbacks.ThresholdStopping , sklearn_genetic.callbacks.ConsecutiveStopping and sklearn_genetic.callbacks.DeltaThreshold.
    • The algorithms 'eaSimple', 'eaMuPlusLambda', 'eaMuCommaLambda' are now implemented in the module sklearn_genetic.algorithms for more control over their options, rather that taking the deap.algorithms module.
    • Implemented the sklearn_genetic.plots module and added the function sklearn_genetic.plots.plot_search_space, this function plots a mixed counter, scatter and histogram plots over all the fitted hyperparameters and their cross-validation score.
    • Documentation based in rst with Sphinx to host in read the docs. It includes public classes and functions documentation as well as several tutorials on how to use the package, link: https://sklearn-genetic-opt.readthedocs.io/
    • Added best_params_ and best_estimator_ properties after fitting GASearchCV.
    • Added optional parameters refit, pre_dispatch and error_score.

    API Changes:

    • Removed support for python 3.6, changed the libraries supported versions to be the same as scikit-learn current version.
    • Several internal changes on the documentation and variables naming style to be compatible with Sphinx.
    • Removed the parameters continuous_parameters, categorical_parameters and integer_parameters in GASearchCV, replacing them with param_grid.
    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(May 28, 2021)

    Features:

    • Added the space module to control better the data types and ranges of each hyperparameter, their distribution to sample random values from, and merge all data types in one Space class that can work with the new param_grid parameter
    • Changed the continuous_parameters, categorical_parameters and integer_parameters for the param_grid, the first ones still work but will be removed in a next version
    • Added the option to use the eaMuCommaLambda algorithm from deap
    • The mu and lambda_ parameters of the internal eaMuPlusLambda and eaMuCommaLambda now are in terms of the initial population size and not the number of generations
    Source code(tar.gz)
    Source code(zip)
  • 0.2.1(May 27, 2021)

    Features:

    • Enabled deap's eaMuPlusLambda algorithm for the optimization process, now is the default routine
    • Added the parameter keep_top_k to control the amout of solutions if the hall of fame (hof)
    • Changed default parameters crossover_probability from 1 to 0.8 and generations from 50 to 40

    Internal

    • Changed parameters with pre-defined options to use pydantic models

    Fixes

    • Fix log of the scoring metric in logbook, now is part of the parameters and is show only once
    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(May 25, 2021)

    Features:

    • Added a logbook and history properties to the fitted GASearchCV to make post-fit analysis
    • Elitism = False now implements a roulette selection instead of ignoring the parameter

    API Changes:

    • Refactored the optimization algorithm to use deap package instead of a custom implementation, this causes the removal of several methods, properties and variables inside the GASearchCV class
    • The parameter encoding_length has been removed, it's not longer required to the GASearchCV class
    • Renamed the property of the fitted estimator from best_params_ to best_params
    • The verbosity now prints the deap log of the fitness function, it's standard deviation, max and min values from each generation
    • The variable GASearchCV._best_solutions was removed and it's meant to be replaced with GASearchCV.logbook and GASearchCV.history
    Source code(tar.gz)
    Source code(zip)
  • 0.1.1(Apr 28, 2021)

    Bug Fixes:

    • Fixs unexpected overwrites over if statements
    • Correct validation when parameters dicts are empty

    Enchacements:

    • Criteria parameter to control if it's a minimization or maximization problem with respect to the scoring metric
    • Plot fitness function over generations
    • Unit tests for all the package
    • Examples with regression problems
    • Implementation of some magic methods
    • Documentation of the GASearchCV class parameters
    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Apr 27, 2021)

Owner
Rodrigo Arenas
Rodrigo Arenas
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
Pandas DataFrames and Series as Interactive Tables in Jupyter

Pandas DataFrames and Series as Interactive Tables in Jupyter Star Turn pandas DataFrames and Series into interactive datatables in both your notebook

Marc Wouts 364 Jan 04, 2023
Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

Michael Floering 11.1k Jan 03, 2023
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
whylogs: A Data and Machine Learning Logging Standard

whylogs: A Data and Machine Learning Logging Standard whylogs is an open source standard for data and ML logging whylogs logging agent is the easiest

WhyLabs 2k Jan 06, 2023
Titanic Traveller Survivability Prediction

The aim of the mini project is predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and more.

John Phillip 0 Jan 20, 2022
Basic Docker Compose for Machine Learning Purposes

Docker-compose for Machine Learning How to use: cd docker-ml-jupyterlab

Chris Chen 1 Oct 29, 2021
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 07, 2023
Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

Self Organising Map for Clustering of Atomistic Samples - V2 Description Self Organising Map (also known as Kohonen Network) implemented in Python for

Franco Aquistapace 0 Nov 16, 2021
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
This machine learning model was developed for House Prices

This machine learning model was developed for House Prices - Advanced Regression Techniques competition in Kaggle by using several machine learning models such as Random Forest, XGBoost and LightGBM.

serhat_derya 1 Mar 02, 2022
Deep Survival Machines - Fully Parametric Survival Regression

Package: dsm Python package dsm provides an API to train the Deep Survival Machines and associated models for problems in survival analysis. The under

Carnegie Mellon University Auton Lab 10 Dec 30, 2022
Accelerating model creation and evaluation.

EmeraldML A machine learning library for streamlining the process of (1) cleaning and splitting data, (2) training, optimizing, and testing various mo

Yusuf 0 Dec 06, 2021
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex

Taylor G Smith 54 Aug 20, 2022
End to End toy example of MLOps

churn_model MLOps Toy Example End to End You might find below links useful Connect VSCode to Git MLFlow Port Heroku App Project Organization ├── LICEN

Ashish Tele 6 Feb 06, 2022
dirty_cat is a Python module for machine-learning on dirty categorical variables.

dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.

637 Dec 29, 2022
Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Lingtrain Alignment Studio Intro Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages. Extracts parallel

Sergei Averkiev 186 Jan 03, 2023
Cryptocurrency price prediction and exceptions in python

Cryptocurrency price prediction and exceptions in python This is a coursework on foundations of computing module Through this coursework i worked on m

Panagiotis Sotirellos 1 Nov 07, 2021
Katana project is a template for ASAP 🚀 ML application deployment

Katana project is a FastAPI template for ASAP 🚀 ML API deployment

Mohammad Shahebaz 100 Dec 26, 2022