A modular active learning framework for Python

Overview

modAL

Modular Active Learning framework for Python3

travis-ci-master codecov-master readthedocs

Page contents

Introduction

modAL is an active learning framework for Python3, designed with modularity, flexibility and extensibility in mind. Built on top of scikit-learn, it allows you to rapidly create active learning workflows with nearly complete freedom. What is more, you can easily replace parts with your custom built solutions, allowing you to design novel algorithms with ease.

Active learning from bird's-eye view

With the recent explosion of available data, you have can have millions of unlabelled examples with a high cost to obtain labels. For instance, when trying to predict the sentiment of tweets, obtaining a training set can require immense manual labour. But worry not, active learning comes to the rescue! In general, AL is a framework allowing you to increase classification performance by intelligently querying you to label the most informative instances. To give an example, suppose that you have the following data and classifier with shaded regions signifying the classification probability.

Suppose that you can query the label of an unlabelled instance, but it costs you a lot. Which one would you choose? By querying an instance in the uncertain region, surely you obtain more information than querying by random. Active learning gives you a set of tools to handle problems like this. In general, an active learning workflow looks like the following.

The key components of any workflow are the model you choose, the uncertainty measure you use and the query strategy you apply to request labels. With modAL, instead of choosing from a small set of built-in components, you have the freedom to seamlessly integrate scikit-learn or Keras models into your algorithm and easily tailor your custom query strategies and uncertainty measures.

modAL in action

Let's see what modAL can do for you!

From zero to one in a few lines of code

Active learning with a scikit-learn classifier, for instance RandomForestClassifier, can be as simple as the following.

from modAL.models import ActiveLearner
from sklearn.ensemble import RandomForestClassifier

# initializing the learner
learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    X_training=X_training, y_training=y_training
)

# query for labels
query_idx, query_inst = learner.query(X_pool)

# ...obtaining new labels from the Oracle...

# supply label for queried instance
learner.teach(X_pool[query_idx], y_new)

Replacing parts quickly

If you would like to use different uncertainty measures and query strategies than the default uncertainty sampling, you can either replace them with several built-in strategies or you can design your own by following a few very simple design principles. For instance, replacing the default uncertainty measure to classification entropy looks the following.

from modAL.models import ActiveLearner
from modAL.uncertainty import entropy_sampling
from sklearn.ensemble import RandomForestClassifier

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=entropy_sampling,
    X_training=X_training, y_training=y_training
)

Replacing parts with your own solutions

modAL was designed to make it easy for you to implement your own query strategy. For example, implementing and using a simple random sampling strategy is as easy as the following.

import numpy as np

def random_sampling(classifier, X_pool):
    n_samples = len(X_pool)
    query_idx = np.random.choice(range(n_samples))
    return query_idx, X_pool[query_idx]

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=random_sampling,
    X_training=X_training, y_training=y_training
)

For more details on how to implement your custom strategies, visit the page Extending modAL!

An example with active regression

To see modAL in real action, let's consider an active regression problem with Gaussian Processes! In this example, we shall try to learn the noisy sine function:

import numpy as np

X = np.random.choice(np.linspace(0, 20, 10000), size=200, replace=False).reshape(-1, 1)
y = np.sin(X) + np.random.normal(scale=0.3, size=X.shape)

For active learning, we shall define a custom query strategy tailored to Gaussian processes. In a nutshell, a query stategy in modAL is a function taking (at least) two arguments (an estimator object and a pool of examples), outputting the index of the queried instance. In our case, the arguments are regressor and X.

def GP_regression_std(regressor, X):
    _, std = regressor.predict(X, return_std=True)
    return np.argmax(std)

After setting up the query strategy and the data, the active learner can be initialized.

from modAL.models import ActiveLearner
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF

n_initial = 5
initial_idx = np.random.choice(range(len(X)), size=n_initial, replace=False)
X_training, y_training = X[initial_idx], y[initial_idx]

kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
         + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

regressor = ActiveLearner(
    estimator=GaussianProcessRegressor(kernel=kernel),
    query_strategy=GP_regression_std,
    X_training=X_training.reshape(-1, 1), y_training=y_training.reshape(-1, 1)
)

The initial regressor is not very accurate.

The blue band enveloping the regressor represents the standard deviation of the Gaussian process at the given point. Now we are ready to do active learning!

# active learning
n_queries = 10
for idx in range(n_queries):
    query_idx, query_instance = regressor.query(X)
    regressor.teach(X[query_idx].reshape(1, -1), y[query_idx].reshape(1, -1))

After a few queries, we can see that the prediction is much improved.

Additional examples

Including this, many examples are available:

Installation

modAL requires

  • Python >= 3.5
  • NumPy >= 1.13
  • SciPy >= 0.18
  • scikit-learn >= 0.18

You can install modAL directly with pip:

pip install modAL

Alternatively, you can install modAL directly from source:

pip install git+https://github.com/modAL-python/modAL.git

Documentation

You can find the documentation of modAL at https://modAL-python.github.io, where several tutorials and working examples are available, along with a complete API reference. For running the examples, Matplotlib >= 2.0 is recommended.

Citing

If you use modAL in your projects, you can cite it as

@article{modAL2018,
    title={mod{AL}: {A} modular active learning framework for {P}ython},
    author={Tivadar Danka and Peter Horvath},
    url={https://github.com/modAL-python/modAL},
    note={available on arXiv at \url{https://arxiv.org/abs/1805.00979}}
}

About the developer

modAL is developed by me, Tivadar Danka (aka cosmic-cortex in GitHub). I have a PhD in pure mathematics, but I fell in love with biology and machine learning right after I finished my PhD. I have changed fields and now I work in the Bioimage Analysis and Machine Learning Group of Peter Horvath, where I am working to develop active learning strategies for intelligent sample analysis in biology. During my work I realized that in Python, creating and prototyping active learning workflows can be made really easy and fast with scikit-learn, so I ended up developing a general framework for this. The result is modAL :) If you have any questions, requests or suggestions, you can contact me at [email protected]! I hope you'll find modAL useful!

Comments
  • Pandas support & support for applying transformations configured in sklearn.pipeline

    Pandas support & support for applying transformations configured in sklearn.pipeline

    Most notable changes

    • query strategies now only return the indices of the selected instances, the query method then includes the instances themselves
      • old interface is still supported, but its usage results in a deprecation warning
    • added on_transformed parameter to learners; when True and the estimator uses sklearn.pipeline, the transformations configured in that pipeline are applied before calculating metrics on the data set
      • Committees also support this functionality, but as they have no X_training (could be different for each of their learners), the training data can yet not be transformed

    Note

    @cosmic-cortex , after playing around with your code, I must say you have created a great library! I am open to discussion to get this functionality merged, but please don't feel any pressure to do so if you are not satisfied with the implementation. I just needed to resolve #104 for my project and my fork is now sufficient for my needs.

    Note2

    Not sure where this functionality should be addressed in the docs.

    opened by BoyanH 15
  • vote_entropy

    vote_entropy

    I guess, the vote_entropy and KL_Divergence is not being returned, and all values corresponds to zero. Also, if I am doing it wrong, can you suggest a code snippet, how to use, Kl_Divergence or vote_entropy instead of concensus entropy for querying the points. when using query by committee

    opened by srivastavapravesh14-zz 10
  • cold start handling in ranked batch sampling

    cold start handling in ranked batch sampling

    Hi!

    The behavior of cold start handling in ranked batch sampling seems different from the Cardoso et al.'s "Ranked batch-mode active learning".

    https://github.com/modAL-python/modAL/blob/452898fc181b6d4ae6399dfdcb311ceb952c8486/modAL/batch.py#L133-L139

    In modAL's implementation, in the case of cold start, the instance selected by select_cold_start_instance is not added to the instance list instance_index_ranking. While in "Ranked batch-mode active learning", the instance selected by select_cold_start_instance seems to be the first item in instance_index_ranking.

    https://github.com/modAL-python/modAL/blob/452898fc181b6d4ae6399dfdcb311ceb952c8486/modAL/batch.py#L46

    If my understanding on the algorithm proposed in the paper and modAL's implementation is correct, we can change the return of select_cold_start_instance to return best_coldstart_instance_index, X[best_coldstart_instance_index].reshape(1, -1), store best_coldstart_instance_index in instance_index_ranking, and revise ranked_batch correspondingly.

    opened by zhangyu94 10
  • Support batch-mode queries?

    Support batch-mode queries?

    Hi,

    I've run into a bit of a use-case that I'm not sure is quite supported by modAL – nor the broader libraries for active learning – but would be relatively simple to implement. After reviewing modAL's internals a bit, I don't think it officially supports active learning with batch-mode queries.

    The sampling strategies (for example, uncertainty sampling) do support the n_instances parameter, but from what I can tell, uncertainty sampling may return redundant/sub-optimal queries if we return more than one instance from the unlabeled set. This is a bit prohibitive in settings where we'd like to ask an active learner to return multiple (if not all) examples from the unlabeled set/pool, and the computational cost for re-training an active learning model goes without saying.

    I found requests for batch-mode support in the popular libact library (issues #57 and #89) but, to the best of my knowledge, I'm not sure they were addressed in any of their PRs.

    In that case, does it make sense to implement something like [Ranked batch-mode active learning] by Cardoso et al.? I took a crack at it this weekend for a better personal understanding, but if it's worth integrating and supporting in modAL I'm happy to polish it and talk it through in a PR.

    Thanks!

    opened by dataframing 10
  • Pytorch runnable example

    Pytorch runnable example

    this is a runnable example of modAL using pytorch models, wrapped with skorch. this example is very similar to the one we can find in modAL/examples/keras_integration.py

    opened by damienlancry 9
  • use different query strategies

    use different query strategies

    I am using keras/tensorflow models with this framework and the activelearner class. As soon as I try to change the query strategy, different errors occur.

      learner = ActiveLearner(
    estimator=classifier,
    query_strategy=expected_error_reduction,
    X_training=x_initial_training,
    y_training=y_initial_training,
    )
    prescore = learner.score(x_test, y_test)
    n_queries = 50
    postscore = np.zeros(shape=(n_queries, 1))
    for idx in range(n_queries):
        print('Query no. %d' % (idx + 1))
        query_idx, query_instance = learner.query(x_pool)
        learner.teach(
            X=x_pool[query_idx],
            y=y_pool[query_idx],
            only_new=True,
            epochs=10,
            validation_data=(x_val, y_val),
        )
       # remove queried instances from pool
       x_pool = np.delete(x_pool, query_idx, axis=0)
       y_pool = np.delete(y_pool, query_idx, axis=0)
       postscore[idx, 0] = learner.score(x_test, y_test)
    

    What do I have to change to implement the different strategies. The trainings_input is 3D shape. I tried up to now all uncertainty methods of which only the default selection did work. Now I was trying the expected error_reduction strategy, but there occur errors as well.

    I am afraid the 3D shape of the training data is killing all the other algorithms, but for a LSTM this kind of shape is required.

    opened by alexv1247 9
  • docs: refactor documentation

    docs: refactor documentation

    Autoconversion of docstrings with pyment doesn't work well, because the initial format was not following a strict standard. So there are a lot of manual corrections. I have chosen Google style for docstring, however conversion from it to NumPy style with pyment could be easier.

    The first half of modAL.models looks good, but there may be some improvements (further deduplication) in coming days. Review and comments on committed parts could help to finish the whole refactoring (I hope, by the weekend).

    opened by nikolay-bushkov 9
  • DBAL with Image Data implementation using modAL

    DBAL with Image Data implementation using modAL

    I created an example script trying to reproduce the results of Deep Bayesian Active Learning with Image Data using modAL. I used this keras code from one of the authors. I cannot think of anything I am doing differently and yet their code works and not mine. For the acquisition function instead of using their modified keras, i used yarin gal's implementation (first author). Can you spot any mistake in my code? EDIT: I actually found a mistake in my code, I was not really computing the entropy but rather the other half of BALD function. I fixed this mistake and am currently running the code. EDIT2: Still not working

    opened by damienlancry 8
  • Entropy sampling query startegy instable

    Entropy sampling query startegy instable

    I'm using entropy sampling startegy to select samples for RandomForest classification of 7 classes. However when i did my query with entropy sampling (i tried also uncertainty samplig) i have a different result every time i run the query. the selected samples are never the same (i have not changed my input data).

    Thank you in advance for your help.

    opened by YousraH 8
  • about learner.teach

    about learner.teach

    it seems that each time we run the learner. teach, the model will fit the initial data plus the new data from the beginning just like an untrained new model, can the model just learn the new data with the weight which has been trained on the initial data?

    opened by luxu1220 7
  • Using RandomForestClassificatier on vectors for predicting labels gives

    Using RandomForestClassificatier on vectors for predicting labels gives "Found input variables with inconsistent numbers of samples"

    I am learning from Active Regression tutorial page but it has not taken up the case of applying learners to more than one dimension vectors ( I was not able to find a specific example in the doc for this, so please point if you know one ).

    In the function named my_stuff

    My learner is

    regressor = ActiveLearner(
            estimator=RandomForestClassifier(),
            query_strategy=entropy_sampling,
            X_training=X_training, y_training=y_training.ravel()
        )
    
    

    My dataset X is (13084, 50) ( meaning 13084 vectors each having 50 length ) and y is (13084, 1) ( similar meaning ).

    Here X_training is (5, 50) and y_training is (5, 1). In this section of the code( taken blatantly from the tutorial page mentioned above ):

    for idx in range(n_queries):
            query_idx, query_instance = regressor.query(X)
            print(query_idx, 'query_idx', X_training.shape, y_training.shape)
            regressor.teach(X[query_idx].reshape(-1, 1), y[query_idx].reshape(-1, 1))
    

    The program ended abruptly, so upon using python debugger I found the error:

    ValueError: Found input variables with inconsistent numbers of samples: [50, 1]
    > /path/to/file/predict.py(286)my_stuff()
    -> regressor.teach(X[query_idx].reshape(-1, 1), y[query_idx].reshape(-1, 1))
    
    

    regressor Here X[query_idx].reshape(-1, 1) has shape (50, 1) and y[query_idx].reshape(-1, 1) has shape (1, 1).

    What would be the correct procedure for the teach procedure?

    opened by berserker1 6
  • Which sampling method is best for very unbalanced data?

    Which sampling method is best for very unbalanced data?

    Hi!

    I am wondering, which of the implemented sampling strategies handles unbalanced data best? I believe if I get the top 10000 uncertain data instances, but 99 % are in the same class, this would not help much for the next training process iteration, right?

    Thank you in advance!

    opened by vandreslime 0
  • Can I use modAL with estimators from other libraries than scikit-learn like xgboost?

    Can I use modAL with estimators from other libraries than scikit-learn like xgboost?

    Hi there,

    I have already trained some good working estimators (xgboost, catboost & lightgbm). I would like to add an active learner, because we need to decide which data to label continuously.

    The documentation says, that I need to use a scikit-learn estimator object. Does that mean I can't use the models from xgboost, catboost & lightgbm? I used the models from the libraries with the same names.

    And another question (for my understanding). Do I give an estimator that is already trained, or does the active learner train a model from scratch?

    I am new to the field of active learning, so thank you very much!

    opened by vandreslime 0
  • Proof of concept for allowing non-sklearn estimators

    Proof of concept for allowing non-sklearn estimators

    Not sure if there is any desire for this feature, but in this PR I have sketched out a way to use virtually any estimator type with the ActiveLearner and BayesianOptimizer classes.

    Motivation

    Allow us to use other training and inference facilities, such as HuggingFace models that are trained using the Trainer class, use AWS SageMaker Estimators, etc. With this added flexibility, the training and inference does not need to even run on the same hardware as the modAL code. This brings the suite of sampling methods here to many new applications, particularly resource-intensive deep learning models that typically don't fit that great under the sklearn interface.

    Implementation

    Rather than call the classic sklearn estimator functions such as fit, predict, predict_proba, and score, this PR adds a layer of callables that can be overridden: fit_func, predict_func, predict_proba_func, and score_func.

        def __init__(self,
                     estimator: BaseEstimator,
                     query_strategy: Callable = uncertainty_sampling,
                     X_training: Optional[modALinput] = None,
                     y_training: Optional[modALinput] = None,
                     bootstrap_init: bool = False,
                     on_transformed: bool = False,
                     force_all_finite: bool = True,
                     fit_func: FitFunction = SKLearnFitFunction(),
                     predict_func: PredictFunction = SKLearnPredictFunction(),
                     predict_proba_func: PredictProbaFunction = SKLearnPredictProbaFunction(),
                     score_func: ScoreFunction = SKLearnScoreFunction(),
                     **fit_kwargs
                     ) -> None:
    

    I added SKLearn implementations of each by default (included their corresponding Protocol classes as well). Here's how fit works:

    class FitFunction(Protocol):
        def __call__(self, estimator: GenericEstimator, X, y, **kwargs) -> GenericEstimator:
            raise NotImplementedError
    # ...
    class SKLearnFitFunction(FitFunction):
        def __call__(self, estimator: BaseEstimator, X, y, **kwargs) -> BaseEstimator:
            return estimator.fit(X=X, y=y, **kwargs)
    

    I'll also note that the changes in this PR don't break any of the existing tests.

    Usage

    When using SageMaker, we might implement fit and predict_proba in this manner:

    class CustomEstimator:
        hf_predictor: Union[HuggingFacePredictor, Predictor]
        hf_estimator: HuggingFace
    
        def __init__(self, hf_predictor: HuggingFacePredictor, hf_estimator: HuggingFace):
            self.hf_predictor = hf_predictor
            self.hf_estimator = hf_estimator
    
    class CustomFitFunction(FitFunction):
        def __call__(self, estimator: CustomEstimator, X, y, **kwargs) -> CustomEstimator:
            # notice we don't use `y` -- the label is baked into the HuggingFace Dataset
            return estimator.hf_estimator.fit(X=X, **kwargs)
    
    class CustomPredictProbaFunction(PredictProbaFunction):
        @staticmethod
        def hf_prediction_to_proba(predictions: Union[List[Dict], object],
                                   positive_class_label: str = 'LABEL_1',
                                   negative_class_label: str = 'LABEL_0') -> np.array:
            label_key: str = 'label'
            score_key: str = 'score'
            p = []
            for prediction in predictions:
                if positive_class_label == prediction[label_key]:
                    score = prediction[score_key]
                    p.append([score, 1.0 - score])
                if negative_class_label == prediction[label_key]:
                    score = prediction[score_key]
                    p.append([1.0 - score, score])
            return np.array(p)
    
        def __call__(self, estimator: CustomEstimator, X, **kwargs) -> np.array:
            return self.hf_prediction_to_proba(
                predictions=estimator.hf_predictor.predict(dict(inputs=X))
            )
    
    estimator = CustomEstimator(hf_predictor=hf_predictor, hf_estimator=hf_estimator)
    
    learner = ActiveLearner(
        estimator=estimator,
        fit_func=CustomFitFunction(),
        predict_proba_func=CustomPredictProbaFunction(),
        X_training=train_dataset # standard HuggingFace Dataset instead of your typical types for `X` in `sklearn`
    )
    

    If you've made it this far, I'd ask that you forgive the clunkiness. This was a rough sketch of an idea I wanted to get written down before I forgot it. Anyways, would love some feedback, and if you think this PR is worth finishing, let me know. I can say for me, this would unlock a lot of really useful applications.

    opened by adelevie 2
  • TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

    TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid

    trying to run this notebook https://www.kaggle.com/code/kmader/active-learning-optimization-improvement/notebook

    getting an error in learner.teach step (and also in pd.concat(seq_iter.compute()) step):

    # initializing the learner
    from modAL.models import ActiveLearner
    initial_df = all_papaya_samples_df.sample(20, random_state=2018)
    learner = ActiveLearner(
        estimator=SVC(kernel = 'rbf', probability=True, random_state = 2018),
        X_training=initial_df[['firmness', 'redness']], 
        y_training=initial_df['tastiness']
    )
    # query for labels
    X_pool = all_papaya_samples_df[['firmness', 'redness']].values
    y_pool = all_papaya_samples_df['tastiness'].values
    query_idx, query_inst = learner.query(X_pool)
    query_idx, query_inst
    fig, m_axs = plt.subplots(2, 3, figsize = (12, 12))
    last_pts = initial_df.shape[0]
    queried_pts = []
    
    for c_ax, c_pts in zip(m_axs.flatten(), np.linspace(20, 350, 6).astype(int)):
        for _ in range(c_pts-last_pts):
            query_idx, _ = learner.query(X_pool)
            queried_pts += [query_idx]
            learner.teach(X_pool[query_idx], y_pool[query_idx])
        last_pts = c_pts
        fit_and_show_model(learner, 
                           None, 
                           title_str = 'Sampled: {}'.format(c_pts),
                           ax = c_ax,
                           fit_model = False
                          )
    
    TypeError                                 Traceback (most recent call last)
    /tmp/ipykernel_28372/2173050794.py in <module>
          6         query_idx, _ = learner.query(X_pool)
          7         queried_pts += [query_idx]
    ----> 8         learner.teach(X_pool[query_idx], y_pool[query_idx])
          9     last_pts = c_pts
         10     fit_and_show_model(learner, 
    
    /opt/conda/lib/python3.7/site-packages/modAL/models/learners.py in teach(self, X, y, bootstrap, only_new, **fit_kwargs)
         96             **fit_kwargs: Keyword arguments to be passed to the fit method of the predictor.
         97         """
    ---> 98         self._add_training_data(X, y)
         99         if not only_new:
        100             self._fit_to_known(bootstrap=bootstrap, **fit_kwargs)
    
    /opt/conda/lib/python3.7/site-packages/modAL/models/base.py in _add_training_data(self, X, y)
         94         else:
         95             try:
    ---> 96                 self.X_training = data_vstack((self.X_training, X))
         97                 self.y_training = data_vstack((self.y_training, y))
         98             except ValueError:
    
    /opt/conda/lib/python3.7/site-packages/modAL/utils/data.py in data_vstack(blocks)
         22         return sp.vstack(blocks)
         23     elif isinstance(blocks[0], pd.DataFrame):
    ---> 24         return blocks[0].append(blocks[1:])
         25     elif isinstance(blocks[0], np.ndarray):
         26         return np.concatenate(blocks)
    
    /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in append(self, other, ignore_index, verify_integrity, sort)
       8967                 ignore_index=ignore_index,
       8968                 verify_integrity=verify_integrity,
    -> 8969                 sort=sort,
       8970             )
       8971         ).__finalize__(self, method="append")
    
    /opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        309                     stacklevel=stacklevel,
        310                 )
    --> 311             return func(*args, **kwargs)
        312 
        313         return wrapper
    
    /opt/conda/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
        302         verify_integrity=verify_integrity,
        303         copy=copy,
    --> 304         sort=sort,
        305     )
        306 
    
    /opt/conda/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
        382                     "only Series and DataFrame objs are valid"
        383                 )
    --> 384                 raise TypeError(msg)
        385 
        386             ndims.add(obj.ndim)
    
    TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
    
    opened by akamil-etsy 0
  • AttributeError: bootstrap_init

    AttributeError: bootstrap_init

    I am trying to apply the package for sklearn RandomForestClassifier like this:

    learner= ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=modAL.uncertainty.uncertainty_sampling,
    X_training=X_train0, y_training=y_train
    )
    
    learner
    

    Then the following error appears:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/core/formatters.py:973, in MimeBundleFormatter.__call__(self, obj, include, exclude)
        970     method = get_real_method(obj, self.print_method)
        972     if method is not None:
    --> 973         return method(include=include, exclude=exclude)
        974     return None
        975 else:
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:614, in BaseEstimator._repr_mimebundle_(self, **kwargs)
        612 def _repr_mimebundle_(self, **kwargs):
        613     """Mime bundle used by jupyter kernels to display estimator"""
    --> 614     output = {"text/plain": repr(self)}
        615     if get_config()["display"] == "diagram":
        616         output["text/html"] = estimator_html_repr(self)
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:279, in BaseEstimator.__repr__(self, N_CHAR_MAX)
        271 # use ellipsis for sequences with a lot of elements
        272 pp = _EstimatorPrettyPrinter(
        273     compact=True,
        274     indent=1,
        275     indent_at_name=True,
        276     n_max_elements_to_show=N_MAX_ELEMENTS_TO_SHOW,
        277 )
    --> 279 repr_ = pp.pformat(self)
        281 # Use bruteforce ellipsis when there are a lot of non-blank characters
        282 n_nonblank = len("".join(repr_.split()))
    
    File ~/tensorflow-test/env/lib/python3.8/pprint.py:153, in PrettyPrinter.pformat(self, object)
        151 def pformat(self, object):
        152     sio = _StringIO()
    --> 153     self._format(object, sio, 0, 0, {}, 0)
        154     return sio.getvalue()
    
    File ~/tensorflow-test/env/lib/python3.8/pprint.py:170, in PrettyPrinter._format(self, object, stream, indent, allowance, context, level)
        168     self._readable = False
        169     return
    --> 170 rep = self._repr(object, context, level)
        171 max_width = self._width - indent - allowance
        172 if len(rep) > max_width:
    
    File ~/tensorflow-test/env/lib/python3.8/pprint.py:404, in PrettyPrinter._repr(self, object, context, level)
        403 def _repr(self, object, context, level):
    --> 404     repr, readable, recursive = self.format(object, context.copy(),
        405                                             self._depth, level)
        406     if not readable:
        407         self._readable = False
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:189, in _EstimatorPrettyPrinter.format(self, object, context, maxlevels, level)
        188 def format(self, object, context, maxlevels, level):
    --> 189     return _safe_repr(
        190         object, context, maxlevels, level, changed_only=self._changed_only
        191     )
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:440, in _safe_repr(object, context, maxlevels, level, changed_only)
        438 recursive = False
        439 if changed_only:
    --> 440     params = _changed_params(object)
        441 else:
        442     params = object.get_params(deep=False)
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:93, in _changed_params(estimator)
         89 def _changed_params(estimator):
         90     """Return dict (param_name: value) of parameters that were given to
         91     estimator with non-default values."""
    ---> 93     params = estimator.get_params(deep=False)
         94     init_func = getattr(estimator.__init__, "deprecated_original", estimator.__init__)
         95     init_params = inspect.signature(init_func).parameters
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:210, in BaseEstimator.get_params(self, deep)
        208 out = dict()
        209 for key in self._get_param_names():
    --> 210     value = getattr(self, key)
        211     if deep and hasattr(value, "get_params"):
        212         deep_items = value.get_params().items()
    
    AttributeError: 'ActiveLearner' object has no attribute 'bootstrap_init'---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/core/formatters.py:707, in PlainTextFormatter.__call__(self, obj)
        700 stream = StringIO()
        701 printer = pretty.RepresentationPrinter(stream, self.verbose,
        702     self.max_width, self.newline,
        703     max_seq_length=self.max_seq_length,
        704     singleton_pprinters=self.singleton_printers,
        705     type_pprinters=self.type_printers,
        706     deferred_pprinters=self.deferred_printers)
    --> 707 printer.pretty(obj)
        708 printer.flush()
        709 return stream.getvalue()
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
        407                         return meth(obj, self, cycle)
        408                 if cls is not object \
        409                         and callable(cls.__dict__.get('__repr__')):
    --> 410                     return _repr_pprint(obj, self, cycle)
        412     return _default_pprint(obj, self, cycle)
        413 finally:
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
        776 """A pprint that just redirects to the normal repr function."""
        777 # Find newlines and replace them with p.break_()
    --> 778 output = repr(obj)
        779 lines = output.splitlines()
        780 with p.group():
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:279, in BaseEstimator.__repr__(self, N_CHAR_MAX)
        271 # use ellipsis for sequences with a lot of elements
        272 pp = _EstimatorPrettyPrinter(
        273     compact=True,
        274     indent=1,
        275     indent_at_name=True,
        276     n_max_elements_to_show=N_MAX_ELEMENTS_TO_SHOW,
        277 )
    --> 279 repr_ = pp.pformat(self)
        281 # Use bruteforce ellipsis when there are a lot of non-blank characters
        282 n_nonblank = len("".join(repr_.split()))
    
    File ~/tensorflow-test/env/lib/python3.8/pprint.py:153, in PrettyPrinter.pformat(self, object)
        151 def pformat(self, object):
        152     sio = _StringIO()
    --> 153     self._format(object, sio, 0, 0, {}, 0)
        154     return sio.getvalue()
    
    File ~/tensorflow-test/env/lib/python3.8/pprint.py:170, in PrettyPrinter._format(self, object, stream, indent, allowance, context, level)
        168     self._readable = False
        169     return
    --> 170 rep = self._repr(object, context, level)
        171 max_width = self._width - indent - allowance
        172 if len(rep) > max_width:
    
    File ~/tensorflow-test/env/lib/python3.8/pprint.py:404, in PrettyPrinter._repr(self, object, context, level)
        403 def _repr(self, object, context, level):
    --> 404     repr, readable, recursive = self.format(object, context.copy(),
        405                                             self._depth, level)
        406     if not readable:
        407         self._readable = False
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:189, in _EstimatorPrettyPrinter.format(self, object, context, maxlevels, level)
        188 def format(self, object, context, maxlevels, level):
    --> 189     return _safe_repr(
        190         object, context, maxlevels, level, changed_only=self._changed_only
        191     )
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:440, in _safe_repr(object, context, maxlevels, level, changed_only)
        438 recursive = False
        439 if changed_only:
    --> 440     params = _changed_params(object)
        441 else:
        442     params = object.get_params(deep=False)
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:93, in _changed_params(estimator)
         89 def _changed_params(estimator):
         90     """Return dict (param_name: value) of parameters that were given to
         91     estimator with non-default values."""
    ---> 93     params = estimator.get_params(deep=False)
         94     init_func = getattr(estimator.__init__, "deprecated_original", estimator.__init__)
         95     init_params = inspect.signature(init_func).parameters
    
    File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:210, in BaseEstimator.get_params(self, deep)
        208 out = dict()
        209 for key in self._get_param_names():
    --> 210     value = getattr(self, key)
        211     if deep and hasattr(value, "get_params"):
        212         deep_items = value.get_params().items()
    
    AttributeError: 'ActiveLearner' object has no attribute 'bootstrap_init'
    
    
    

    I have to run it with python 3.8 as I am using tensorflow under the mac M1 chip and this still has some dependency issues. For the rest, there is nothing different from the usual way I feed in the RF model (data formats are correct). Any idea why is it calling this attribute?

    opened by luisignaciomenendez 1
  • decision_function instead of predict_proba

    decision_function instead of predict_proba

    Several non-probabilistic estimators, such as SVMs in particular, can be used with uncertainty sampling. Scikit-Learn estimators that support the decision_function method can be used with the closest-to-hyperplane selection algorithm [Bloodgood]. This is actually a very popular strategy in AL research and would be very easy to implement.

    opened by lkurlandski 5
Releases(0.4.1)
  • 0.4.1(Jan 7, 2021)

  • 0.4.0(Nov 1, 2020)

    Release notes

    modAL 0.4.0 is finally here! This new release is made possible by the contributions of @BoyanH, @damienlancry, and @OskarLiew, many thanks to them!

    New features

    • pandas.DataFrame support, thanks to @BoyanH! This was a frequently requested feature which I was unable to properly implement, but @BoyanH has found a solution for this in #105.
    • Support for scikit-learn pipelines, also by @BoyanH. Now learners support querying on the transformed data by setting on_transformed=True upon initialization.

    Changes

    • Query strategies should no longer return the selected instances, only the indices for the queried objects. (See #104 by @BoyanH.)

    Fixes

    • Committee sets classes when fitting, this solves the error which occurred when no training data was provided during initialization. This fix was contributed in #100 by @OskarLiew, thanks for that!
    • Some typos in the ranked batch mode sampling example, fixed by @damienlancry.
    Source code(tar.gz)
    Source code(zip)
  • 0.3.6(Aug 21, 2020)

  • 0.3.5(Nov 11, 2019)

    Changes

    • ActiveLearner now supports np.nan and np.inf in the data by setting force_all_finite=False upon initialization. #58
    • Bayesian optimization fixed for multidimensional functions.
    • Calls to check_X_y no longer converts between datatypes. #49
    • Expected error reduction implementation error fixed. #45
    • modAL.utils.data_vstack now falls back to numpy.concatenate if possible.
    • Multidimensional data for ranked batch sampling and expected error reduction fixed. #41

    Fixes by @zhangyu94:

    • modAL.selection.shuffled_argmax #32
    • Cold start instance in modAL.batch.ranked_batch fixed. #30
    • Best instance index in modAL.batch.select_instance fixed. #29
    Source code(tar.gz)
    Source code(zip)
  • 0.3.4(Dec 5, 2018)

    New features

    • To handle the case when the maximum utility score is not unique, a random tie break option was introduced. From this version, passing random_tie_break=True to the query strategies first shuffles the pool then uses a stable sorting to find the instances to query. In the case where the maximum utility score is not unique, it is equivalent of randomly sampling from the top scoring instances.

    Changes

    • modAL.expected_error.expected_error_reduction runtime improved by omitting unnecessary cloning of the estimator for every instance in the pool.
    Source code(tar.gz)
    Source code(zip)
  • 0.3.3(Nov 30, 2018)

  • 0.3.2(Nov 26, 2018)

  • 0.3.1(Oct 2, 2018)

    Release notes

    The new release of modAL is here! This is a milestone in its evolution, because it has just received its first contributions from the open source community! :) Thanks for @dataframing and @nikolay-bushkov for their work! Hoping to see many more contributions from the community, because modAL still has a long way to go! :)

    New features

    • Ranked batch mode queries by @dataframing. With this query strategy, several instances can be queried for labeling, which alleviates a lot of problems in uncertainty sampling. For details, see Ranked batch mode learning by Cardoso et al.
    • Sparse matrix support by @nikolay-bushkov. From now, if the estimator can handle sparse matrices, you can use them to fit the active learning models!
    • Cold start support has been added to all the models. This means that now learner.query() can be used without training the model first.

    Changes

    • The documentation has gone under a major refactoring thanks to @nikolay-bushkov! Type annotations have been added and the docstrings were refactored to follow Google style docstrings. The website has been changed accordingly. Instead of GitHub pages, ReadTheDocs are used and the old website is merged with the API reference. Regarding the examples, Jupyter notebooks were added by @dataframing. For details, check it out at https://modAL-python.github.io/!
    • .query() methods changed for BaseLearner and BaseCommittee to allow more general arguments for query strategies. Now it can accept any argument as long as the query_strategy function supports it.
    • .score() method was added for Committee. Fixes #6.
    • The modAL.density module was refactored using functions from sklearn.metrics.pairwise. This resulted in a major increase in performance as well as a more sustainable codebase for the module.

    Bugfixes

    • 1D array handling issues fixed, numpy.vstack calls replaced with numpy.concatenate. Fixes #15.
    • np.sum(generator) calls were replaced with np.sum(np.from_iter(generator)) because deprecation of the original one.
    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Apr 25, 2018)

    Release notes

    New features

    • Bayesian optimization. Bayesian optimization is a method for optimizing black box functions for which evaluation may be expensive and derivatives may not be available. It uses a query loop very similar to active learning, which makes it possible to implement it using an API identical to the ActiveLearner. Sampling for values are made by strategies estimating the possible gains for each point. Among these, three strategies are implemented currently: probability of improvement, expected improvement and upper confidence bounds.

    Changes

    • modAL.models.BaseLearner abstract base class implemented. ActiveLearner and BayesianOptimizer both inherit from it.
    • modAL.models.ActiveLearner.query() now passes the ActiveLearner object to the query function instead of just the estimator.

    Fixes

    • modAL.utils.selection.multi_argmax() now works for arrays with shape (-1, ) as well as (-1, 1).
    Source code(tar.gz)
    Source code(zip)
  • 0.2.1(Apr 18, 2018)

    Release notes

    New features

    • modAL.utils.combination.make_query_strategy function factory to make the implementation of custom query strategies easier.
    • ActiveLearner and Committee models can be fitted using new data only by passing only_new=True to their .teach() methods. This is useful when working with models where the fitting does not occur from scratch, for instance tensorflow or keras models.

    Fixes

    • Checks added to modAL.utils.selection.weighted_random() to avoid division with zero.
    • ABC metaclassing now compatible with earlier Python versions (i.e. Python 2.7). Fixes #3 .
    • sklearn.utils.check_array calls removed from modAL.models, performing checks now up to the estimator. As a consequence, images doesn't need to be flattened. Fixes #5 .
    • BaseCommittee now inherits from sklearn.base.BaseEstimator.
    • modAL.utils.combination.make_linear_combination rewritten using genexps, resulting in performance increase.
    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Feb 10, 2018)

    Release notes

    New features

    • Information density measures. With the information_density function in modAL.density, density-based information metrics can be employed.
    • Functions for making new utility measures by linear combinations and products. With the function factories in modAL.utils.combination, functions can be transformed into their linear combination and product.

    Changes

    • ActiveLearner constructor arguments renamed: predictor was renamed to estimator, X_initial and y_initial was renamed to X_training and y_training.
    • ActiveLearner, Committee and CommitteeRegressor now also inherits from sklearn.base.BaseEstimator. Because of this, for instance, get_params() and set_params() methods can be used.
    • The private attributes of ActiveLearner, Committee and CommitteeRegressor now exposed as public attributes.
    • As a result of the previous, the classes now can be cloned with sklearn.base.clone.
    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Jan 8, 2018)

    modAL 0.1.0

    Modular Active Learning framework for Python3

    Release notes

    modAL is finally released! For its capabilities and documentation, see the page https://cosmic-cortex.github.io/modAL/!

    Installation

    modAL requires

    • Python >= 3.5
    • NumPy >= 1.13
    • SciPy >= 0.18
    • scikit-learn >= 0.18

    You can install modAL directly with pip:

    pip install modAL
    

    Alternatively, you can install modAL directly from source:

    pip install git+https://github.com/cosmic-cortex/modAL.git
    Source code(tar.gz)
    Source code(zip)
Owner
modAL
A modular active learning framework for Python3
modAL
AP1 Transcription Factor Binding Site Prediction

A machine learning project that predicted binding sites of AP1 transcription factor, using ChIP-Seq data and local DNA shape information.

1 Jan 21, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
dirty_cat is a Python module for machine-learning on dirty categorical variables.

dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.

637 Dec 29, 2022
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
A repository of PyBullet utility functions for robotic motion planning, manipulation planning, and task and motion planning

pybullet-planning (previously ss-pybullet) A repository of PyBullet utility functions for robotic motion planning, manipulation planning, and task and

Caelan Garrett 260 Dec 27, 2022
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022
Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library

Multiple-Linear-Regression-master - A python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear model library

Kushal Shingote 1 Feb 06, 2022
A collection of video resources for machine learning

Machine Learning Videos This is a collection of recorded talks at machine learning conferences, workshops, seminars, summer schools, and miscellaneous

Dustin Tran 1.5k Dec 29, 2022
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
Banpei is a Python package of the anomaly detection.

Banpei Banpei is a Python package of the anomaly detection. Anomaly detection is a technique used to identify unusual patterns that do not conform to

Hirofumi Tsuruta 282 Jan 03, 2023
SPCL 48 Dec 12, 2022
A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

Utsav 33 Dec 03, 2022
Cool Python features for machine learning that I used to be too afraid to use. Will be updated as I have more time / learn more.

python-is-cool A gentle guide to the Python features that I didn't know existed or was too afraid to use. This will be updated as I learn more and bec

Chip Huyen 3.3k Jan 05, 2023
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023
BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models.

Model Serving Made Easy BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models. Supports multi

BentoML 4.4k Jan 04, 2023
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 03, 2021