Metric learning algorithms in Python

Overview

Travis-CI Build Status License PyPI version Code coverage

metric-learn: Metric Learning in Python

metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised metric learning algorithms. As part of scikit-learn-contrib, the API of metric-learn is compatible with scikit-learn, the leading library for machine learning in Python. This allows to use all the scikit-learn routines (for pipelining, model selection, etc) with metric learning algorithms through a unified interface.

Algorithms

  • Large Margin Nearest Neighbor (LMNN)
  • Information Theoretic Metric Learning (ITML)
  • Sparse Determinant Metric Learning (SDML)
  • Least Squares Metric Learning (LSML)
  • Sparse Compositional Metric Learning (SCML)
  • Neighborhood Components Analysis (NCA)
  • Local Fisher Discriminant Analysis (LFDA)
  • Relative Components Analysis (RCA)
  • Metric Learning for Kernel Regression (MLKR)
  • Mahalanobis Metric for Clustering (MMC)

Dependencies

  • Python 3.6+ (the last version supporting Python 2 and Python 3.5 was v0.5.0)
  • numpy, scipy, scikit-learn>=0.20.3

Optional dependencies

  • For SDML, using skggm will allow the algorithm to solve problematic cases (install from commit a0ed406). pip install 'git+https://github.com/skggm/[email protected]' to install the required version of skggm from GitHub.
  • For running the examples only: matplotlib

Installation/Setup

  • If you use Anaconda: conda install -c conda-forge metric-learn. See more options here.
  • To install from PyPI: pip install metric-learn.
  • For a manual install of the latest code, download the source repository and run python setup.py install. You may then run pytest test to run all tests (you will need to have the pytest package installed).

Usage

See the sphinx documentation for full documentation about installation, API, usage, and examples.

Citation

If you use metric-learn in a scientific publication, we would appreciate citations to the following paper:

metric-learn: Metric Learning Algorithms in Python, de Vazelhes et al., Journal of Machine Learning Research, 21(138):1-6, 2020.

Bibtex entry:

@article{metric-learn,
  title = {metric-learn: {M}etric {L}earning {A}lgorithms in {P}ython},
  author = {{de Vazelhes}, William and {Carey}, CJ and {Tang}, Yuan and
            {Vauquier}, Nathalie and {Bellet}, Aur{\'e}lien},
  journal = {Journal of Machine Learning Research},
  year = {2020},
  volume = {21},
  number = {138},
  pages = {1--6}
}
Comments
  • [MRG] Address comments for sklearn-contrib integration

    [MRG] Address comments for sklearn-contrib integration

    Hi, we've made a request for inclusion in scikit-learn-contrib, this PR intends to address the comments of the issue: https://github.com/scikit-learn-contrib/scikit-learn-contrib/issues/40

    TODO:

    • [x] Fix flake8 errors (there remains some error due to unused imports in metric_learn/__init__.py, but I guess these are needed right ?) And also inverse_covariance.quic is imported but unused, but this is normal since it's just to define the variable HAS_SKGGM. I don't know if there's another way to bypass this. Here is the log of flake8 after the fixes:
    ./test/metric_learn_test.py:17:3: F401 'inverse_covariance.quic' imported but unused
    ./metric_learn/__init__.py:3:1: F401 '.constraints.Constraints' imported but unused
    ./metric_learn/__init__.py:4:1: F401 '.covariance.Covariance' imported but unused
    ./metric_learn/__init__.py:5:1: F401 '.itml.ITML' imported but unused
    ./metric_learn/__init__.py:5:1: F401 '.itml.ITML_Supervised' imported but unused
    ./metric_learn/__init__.py:6:1: F401 '.lmnn.LMNN' imported but unused
    ./metric_learn/__init__.py:7:1: F401 '.lsml.LSML' imported but unused
    ./metric_learn/__init__.py:7:1: F401 '.lsml.LSML_Supervised' imported but unused
    ./metric_learn/__init__.py:8:1: F401 '.sdml.SDML_Supervised' imported but unused
    ./metric_learn/__init__.py:8:1: F401 '.sdml.SDML' imported but unused
    ./metric_learn/__init__.py:9:1: F401 '.nca.NCA' imported but unused
    ./metric_learn/__init__.py:10:1: F401 '.lfda.LFDA' imported but unused
    ./metric_learn/__init__.py:11:1: F401 '.rca.RCA' imported but unused
    ./metric_learn/__init__.py:11:1: F401 '.rca.RCA_Supervised' imported but unused
    ./metric_learn/__init__.py:12:1: F401 '.mlkr.MLKR' imported but unused
    ./metric_learn/__init__.py:13:1: F401 '.mmc.MMC_Supervised' imported but unused
    ./metric_learn/__init__.py:13:1: F401 '.mmc.MMC' imported but unused
    ./metric_learn/__init__.py:15:1: F401 '._version.__version__' imported but unused
    

    Note that I ignored some errors (E111 (indentation is not a multiple of four),E114 (indentation is not a multiple of four (comment)))

    • [x] Put Python 3.7 in the CI tests
    opened by wdevazelhes 36
  • [MRG] Refactor the metric() method

    [MRG] Refactor the metric() method

    Fixes #147

    TODO:

    • [x] Add some tests
    • [x] Add references to the right parts of documentation (like Mahalanobis Distances) in the docstrings (if possible)
    • [x] Emphasize a bit more the difference and links between this and score_pairs in the docstring
    • [x] Be careful that it should work on 1D arrays
    • [x] Be careful that it should not return a float if given 2D arrays
    • [x] Remove useless np.atleast2d (those in transformer_from_metric and those just before returning the transformer_)
    opened by wdevazelhes 33
  • [MRG] Enhance documentation

    [MRG] Enhance documentation

    This PR enhances the documentation by fixing issues about the doc and adding other improvements

    Fixes #155 Fixes #149 Fixes #150 Fixes #135

    TODO:

    • [x] Add link to docstring in documentation's titles of algos (fixes #155)
    • [x] Fix #149
    • [x] Fix #150
    • [x] Put the description of algorithms that have a supervised version in the non-supervised version (and not at the top of the page). This allows the user not to scroll up everytime after following a link. We could also make separate pages for algos and their supervised version
    • [x] Check that links still work (like MMC puts on the docstring of MMC (had a pb with that, also MahalanobisMixin was not working))
    • [x] Check that no num_dims remains (they should all be changed into n_components, I saw one in MLKR for instance)
    • [x] Solve the TODOs inside the .rst files
    • [x] Put a list of the methods in the docstring page (automatically), like in scikit-learn. This will make it easier to find methods without having to scroll down every time
    • Ensure that the doctest work -> postponed, already opened in #156
    • [x] some arguments are not documented, ensure that they are all documented
    • Maybe add metric_learn.constraints.positive_negative_pairs docstring -> postponed, discussed in #227
    • [x] Address https://github.com/metric-learn/metric-learn/pull/208#pullrequestreview-248843755
    opened by wdevazelhes 28
  • [MRG] Create new Mahalanobis mixin

    [MRG] Create new Mahalanobis mixin

    This PR creates a new Mahalanobis Mixin (cf issue https://github.com/metric-learn/metric-learn/issues/91), which is a common interface for all algorithms that learn a Mahalanobis type (pseudo) distance (of the form (x - x')^T M (x - x')) (right now all algorithms are this form but there might be others in the future).

    This interface will enforce that an attribute metric_ exists, add documentation for it in the docstring of child classes, and will allow to factorize computations of embed functions (similar to what is done now with transform), and score_pairs function (these functions will come in later PRs, therefore right now this Mixin seems a bit artificial but it is temporary).

    I also used the opportunity of this PR to improve the way the argument metric_ is returned, checking that the matrix indeed exists (i.e. it has been explicitely initialized or the estimator has been fitted), and raising a warning otherwise. Don't hesitate to comment on this last part, or to tell me if it should belong to a separate PR.

    TODO:

    • [x] Create the class
    • [x] Make current algorithms inherit from it
    • [x] Use this opportunity to improve the metric_ property
    • [x] Maybe add some more tests
    • [X] Fix docstrings to render nicely (as if metric_ was a regular Attribute of the class) done by copying, see https://github.com/metric-learn/metric-learn/pull/96#issuecomment-415036218
    • [x] Check array at predict, embed etc, to only allow arrays of the right shape =>EDIT: full checking will be done in the "preprocessor" PR

    EDIT: Initially we were thinking of doing also an ExplicitMixin that would be for metric learning algorithms that have a way to embed points in a space where the metric is the learned one. Since all algorithms are of this form for now, we will not implement it but rather implement all the functions in MahalanobisMixin (see https://github.com/metric-learn/metric-learn/pull/95#issuecomment-394689505)

    • [ ] Add embed function => EDIT: for now we will let only a transform function (see https://github.com/metric-learn/metric-learn/pull/96#issuecomment-407118297)
    • [x] Add score_pairs function
    opened by wdevazelhes 26
  • [MRG] Add preprocessor option

    [MRG] Add preprocessor option

    This PR adds an argument preprocessor to weakly supervised algorithms initialization: an option that allows them to accept two types of input: either 3D arrays of tuples of points as before, or 2D arrays of indices/identifiers of points. In the latter case, the preprocessor given as input to the Metric Learner will allow to retreive points from identifiers. The preprocessor is basically a callable that is called on a list of identifiers (or just on one identifier ? we'll have to decide), and that returns the points associated to this identifiers. If instead of a callable the user provides an array X, it will automatically create the "indexing" preprocessor (i.e. a preprocessor such that preprocessor(id) returns X[id]). We could also imagine other shortcuts like this (for instance the user could provide a string of a root folder containing images and then it would create a preprocessor such that preprocessor("img2.png") would return the vector associated to the image located at "rootfolder/img2.png")

    Note: This PR branched from MahalanobisMixin PR, so as soon as MahalanobisMixin is merged the diff should become more readable.

    TODO:

    • [x] Add comments and more info to this PR
    • [x] Add tests: in progress: add tests to check that the output format of check_input is as specified - [ ] Add documentation -> probably in another PR
    • [x] Make it simpler with unified check_input function
    • [x] Refactor the check_input function and its tests to be cleaner
    • [x] Write docstrings
    • [x] Refactor tests to have the list of metric learners at one place
    • [x] Fix the linalg bug
    opened by wdevazelhes 25
  • Push a new release to PyPI

    Push a new release to PyPI

    There are a lot of good changes since v0.3, so I think we're almost ready to release v0.4.

    Once most (hopefully all) of the remaining items on the milestone are finished, we should tag a commit on Github and push the new builds to PyPI.

    @terrytangyuan: I'll update this issue when we're good to go.

    opened by perimosocordiae 25
  • [MRG+1] Threshold for pairs learners

    [MRG+1] Threshold for pairs learners

    This PR fits a threshold for tuples learners to allow a predict (and scoring) on pairs

    Fixes #131 Fixes #165

    Finally, we decided that it would be good to have at least a minimal implementation of threshold calibration inside metric-learn, so that _PairsClassifiers can have a threshold hence a predict directly out of the box, without the need for a MetaEstimator. A MetaEstimator could however be used outside the algorithm for more precise threshold calibration (with cross-validation).

    The following features should be implemented:

    • [x] We should have two methods for _PairsClassifiers: set_threshold() and calibrate_threshold(validation_set, method='max_tpr', args={'min_tnr': 0.1}). set_threshold will set the threshold to a hard value, and calibrate_threshold will take a validation set and a method ('accuracy', 'f1'...) and will find the threshold which optimizes the metric in method on the validation set. We went for the same syntax that scikit-learn's PR.

    • [ ] At fit time, we should either have a simple rule to set a threhold (for instance median of distances, or mean between positive pairs distances mean and negative pairs distances mean), or we should return calibrate_threshold(trainset), and in this case also raise a warning at the end that says that the threshold has been fitted on the trainset, so we should check scikit-learn's utilities for calibration to have a calibration less prone to overfitting. Also in this case we could allow to put arguments in fit like `fit(pairs, y, threshold_method='max_tpr', threshold_args={'min_tnr': 0.1})

    • [x] The following scores should be implemented:

      • 'accuracy'
      • ‘f_beta’
      • ‘max_tpr’
      • 'max_tnr' See scikit-learn's calibration PR (https://github.com/scikit-learn/scikit-learn/pull/10117) for more details, and the documentation of it here https://35753-843222-gh.circle-artifacts.com/0/doc/modules/calibration.html
    • [x] For some estimators for which a natural threshold exist (like ITML: the mean between the lower threshold and the higher threshold), we should put this threshold

    • [x] Decide what to do by default, rule of thumb scoring or calibration on trainset ?

    Questions:

    • Should we do the same thing for QuadrupletsLearners ? (the natural threshold is 0, so I don't think it would make as much sense here, and users would maybe rather use the future meta estimator from scikit-learn's PR https://github.com/scikit-learn/scikit-learn/pull/10117), but since we will already have it implemented for pairs, and maybe for coherence, we could have it also for Quadruplets Learners

    TODO:

    • [x] Implement tests that check that we can use custom scores in cross val (cf #165). This needs to have a predict hence is related to this PR.
    • [x] Implement API tests (that the behaviour is as expected)
    • [x] Implement numerical tests (on examples where we know the f_1 score etc
    • [x] Implement the actual method (cf. features above)
    • [x] Add this in the doc: also talk in the doc about the CalibratedClassifierCV
    • [x] Think about which scoring make sense with quadruplets and what impact is has on the code
    • [x] Use/Adapt CalibratedClassifierCV or an equivalent for quadruplets
    • [x] Maybe test CalibratedClassifierCV that it returns a coherent value (like all that have predict_proba = 0.8 have indeed 80% success) CalibratedClassifier's behaviour should be tested in scikit-learn: in metric learn we should just test the API (but on a small draft example I tested it for ITML and it worked more or less)
    • [ ] Add a case (and a test case) where the best accuracy is when we need to reject all points (so threshold = best score + 1). See if this applies too to other strategies
    opened by wdevazelhes 24
  • [MRG] FIX: make proposal for sdml formulation

    [MRG] FIX: make proposal for sdml formulation

    Digging into SDML's code and paper, I don't understand some parts of the implementation. I think it should be as proposed in this PR. Tell me if I'm wrong

    Looking at these this paper about SDML https://icml.cc/Conferences/2009/papers/46.pdf, line , and this paper about Graphical Lasso http://statweb.stanford.edu/~tibs/ftp/graph.pdf: it seems that in SDML, we want to optimize equation 8, which can indeed be done with Graphical Lasso according to the paper on Graphical Lasso. In fact, equation 8 in SDML's paper is the same as equation 1 in Graphical Lasso's paper (up to a minus sign). The following variables are equivalent:

    |SDML's paper | Graphical Lasso's paper| |-------------------|-------------------------------| |M|theta| |P|S|

    where in SDML's paper, P = M_0^-1 + etha * X.L.X^T.

    But note that in SDML's paper, M_0^-1 is the inverse of the a priori Mahalanobis matrix, which can be indeed initialized to the inverse of the covariance matrix. So then M_0^-1 will be the inverse of the inverse of the covariance matrix hence the covariance matrix itself.

    So we should just compute P = emp. covariance matrix + self.balance_param * loss_matrix and do graphical lasso on this (and not: inverse the covariance matrix, do P = this_inverse + self.balance_param * loss_matrix, inverse the result, and compute Graphical Lasso on this as it is done currently)

    And in both cases, we want to evaluate the sparse inverse of M/theta, so things are OK for that

    Also, I didn't get the hack to ensure positive semidefinite, doesn't it change the result ?

    This PR's modification does not fix the issues we had (like plot_sandwich.py does not have better results with this). So maybe let's not merge it until we have a whole fix for SDML.

    There are other things that could explain why SDML doesn't work, like choosing the optimization parameter alpha, and also that graph_lasso seems to sometimes work badly, see https://github.com/scikit-learn/scikit-learn/issues/6887 and https://github.com/scikit-learn/scikit-learn/issues/11417)

    So first tell me if you agree with this modification (not on merging it or not, just whether it's the right formula or not), so that we can look elsewhere to see how to fix SDML's error.

    TODO:

    • [x] Put a message just before merge, in the release draft, to announce that skggm is needed for SDML
    • [x] replace sklearn's pinvh (deprecated) by scipy's pinvh
    • [x] deal with the 1D input case for sdml
    • [x] Add a small test on the non SPD case that skggm can solve
    • [x] Make travis indeed install skggm if the version allows it
    opened by wdevazelhes 24
  • Fix covariance initialization when matrix is not invertible

    Fix covariance initialization when matrix is not invertible

    This PR fixes #276, an issue that arises in the context of covariance initialization on an algorithm that doesn't require a PD matrix. It can happen that the calculated matrix is singular and in consequence isn't invertible, leading to non-explicative warnings. This is fixed by the use of a pseudo-inverse when a PD matrix is not required.

    opened by grudloff 23
  • [MRG] Export notebook to gallery

    [MRG] Export notebook to gallery

    Fixes #141 #153

    Hi, I've just converted @bhargavvader 's notebook from #27 into a sphinx-gallery file (with this snippet: https://gist.github.com/chsasank/7218ca16f8d022e02a9c0deb94a310fe). This way, it will appear nicely in the documentation, and can also allow us to check if every algorithms work fine. There are a few things to change to make the PR mergeable (to compile the doc, you need sphinx-gallery):

    • [x] As dicussed with @bellet, the iris dataset is maybe not the most expressive dataset for metric learning, we might want to find a dataset where classes are even more mixed and where metric learning gives a very advantageous separation
    • [x] Some parts seem to be broken (see the logo "broken"), I need to see why
    • [x] On my computer, the plan of the notebook appears in the left toolbar, we might not want that (we might want to see only two tabs (because we have two examples in metric-learn/examples) on the left sidebar and not tens of tabs)
    • [x] Some examples seem not to work super well in terms of separation, I need to see why
    opened by wdevazelhes 21
  • [MRG+2] Update repo to work with both new and old scikit-learn

    [MRG+2] Update repo to work with both new and old scikit-learn

    Fixes #311

    • I added the workaround suggested by @bellet here: https://github.com/scikit-learn-contrib/metric-learn/issues/311#issuecomment-804229956, so that imports work both in sklearn <0.24 and >= 0.24 (EDIT: actually the old import strategy was already deprecated in 0.22, so I compared the version to 0.22)
    • as well as an additional travis job to test (I chose the one with python 3.6 since it's the oldest python so I thought if someting goes wrong it might be this one, and also with skggm, since I also I'm thinking something might have more chance to go wrong when using another package... I could have put all the checks (python3.6+3.7, with/without skggm), but then I thought the travis test suite might take quite some time)
    opened by wdevazelhes 18
  • Adjustment of the validation of the number of target neighbors

    Adjustment of the validation of the number of target neighbors

    Before the actual optimization process, it is checked whether the parameters are valid. In the lines 175 - 177 it is checked if the chosen k is valid in the context of the training data. According to the definition of LMNN by Weinberger et al. each class must have at least k+1 elements, so that there are at least k target neighbors for each data point. In the implementation, however, it is only checked whether self.n_neighbors<= required_k (in fact the code checks the opposite in order to throw an error), where required_k is the number of elements of the smallest class. This check indicates that the choice of k is valid for a class that has exactly k elements, which shouldn’t be the case. However, this leads to selecting a point as its own target neighbor, if this small class. For the determination of the target neighbors, a distance matrix of all points within the class is computed. To prevent that the point itself is recognized as nearest neighbor, the diagonal of this matrix is set to infinity. If a class has only k elements, all elements of the class are chosen as target neighbors, including the current point itself (even if it has a distance of infinity to itself according to the distance matrix). This results in each point of such a class effectively having one target neighbor less than classes with more training data, which can have unintended influences on the final transformation depending on the dataset used.

    To prevent this, it is sufficient to adjust the validation so thatself.n_neighbors < required_k must apply.

    opened by JanekBlankenburg 1
  • Refactor LMNN as a triplets learner

    Refactor LMNN as a triplets learner

    Addresses my request in #210.

    1. Introduces a _BaseLMNN/ LMNN class to operate on triplets, s.t. d(triplets[i, 0],triplets[i, 1]) < d(triplets[i, 0], triplets[i, 2]) - the same setup as SCML.
    • from this definition of triplets, create a 'label mask', an nxn matrix with mask[i,j] = 1 and mask[i,k] = -1 for the set triplet[i,j,k] (else 0).
    • This simply reformulates the loss_grad / _find_impostors to operate on this label mask (imposters that violate the large margin) are detected by evaluating the squared distances of entries implied by -1 values in the mask.
    • the desired parameter k can be inferred from the triplets by counting unique occurrences of genuine and imposter pairs
    1. Renames LMNN to LMNN_Supervised.
    opened by zdk123 0
  • SCML: Add warm_start  parameter

    SCML: Add warm_start parameter

    opened by maxi-marufo 2
  • [WIP] Add model selection example with LFW dataset and KNN task

    [WIP] Add model selection example with LFW dataset and KNN task

    I created a model selection example for supervised Mahalanobis learners, to show the effectiveness of the linear transformation.

    I use a "large" dataset from sklearn: Labeled Faces in the Wild (LFW) people dataset (classification). That it's a bit more complex than using iris, and for the same reason I use PCA to reduce dimentionality.

    The usual pipeline would be: PCA-> Classifier, but in this case we try PCA-> Metric learner-> Classifier, and we compare how precision, recall and f1 scores vary to the first scenario that I call a baseline.

    To compare models I fixed the last Classifier being a KNeighborsClassifier.

    In general, all supervised learners are able to outperform the baseline.

    I think this example can be useful to users, because its hard to know beforehand which model will perform the best with our dataset.

    Note: The models's parameters are not tuned, this example act as a "final" comparison between models.

    opened by mvargas33 0
  • [DOC] [WIP] Developers documentation page

    [DOC] [WIP] Developers documentation page

    @bellet @perimosocordiae @terrytangyuan @wdevazelhes

    Motivated by #259 , I made this docs for new developers like a guide in How to contribute to the package. I followed the scikit-learn guideline here, but talking with @bellet it's better to keep things simple in terms of governance, for instance.

    I also considered comments at #205 and in #13 .

    I also propose that for API or major changes, a MLEP (Metric Learning Enhancement Proposal) document is needed, being Github Discussions the palce to put it and review it, because sometimes, huge API changes are linked to more than one PR. Take the OASIS discussion at #336 as a very simple and informal MLEP. (In general I took this idea from sklearn).

    The main sections are:

    • Contributing: Introduction, values, general guideline, how to contribute code (PR process), how to test, how to compile the docs.
    • Metric learn governance: Roles, decision-making process, in general. How to proceed with API changes (MLEP)
    • Implement a new algorithm: Criteria of selection, and how to proceed.
    • API Structure: A quick review of the API for devs to know what classes to inherit from, and wich methods to implement, and where.
    • MLEP Template: The template to be used in Github discussion for major changes.

    What is left:

    • How to make a release
    • How to publish the docs (the gh-page branch thing)
    • How to update at Pypi and Conda.

    And because this has not been discussed in the past, take this draft as what it is: a draft. Moreover the governance part, and the MLEP part. Maybe something much simpler is enough, like the Github discussion #336 that I did, but with a general template.

    Best! 😁

    Ps: I'm testing CSS usage in some parts, ignore it

    opened by mvargas33 1
  • 3. [WIP] OASIS algorithm implementation

    3. [WIP] OASIS algorithm implementation

    Hi!

    I am currently implementing the OASIS algorithm and I open this PR to make the implementation transparent while working on it. Any discussion, question or comments is very welcomed.

    This PR is under the WIP (Work In Progress) tag because as of now, I have a draft implementation of the algorithm out-of-the-package itself. It's a file in the root directory, with a test file in root as well.

    Over these days I will move the algorithm to metric_learn folder to make it compatible with the current API. Same for testing.

    Current testing only checks that nothing is broken, I'll make some test regarding KNN tasks to verify that the algorithm performs better at least for a handmade toy test.

    This PR depends on the Bilinear PR #329 acceptance beforehand.

    opened by mvargas33 1
Releases(v0.6.2)
  • v0.6.2(Jul 2, 2020)

  • v0.6.1(Jul 2, 2020)

  • v0.6.0(Jul 1, 2020)

    This release features various fixes and improvements, as well as a new triplet-based algorithm, SCML (see http://researchers.lille.inria.fr/abellet/papers/aaai14.pdf), and an associated Triplets API. Triplets-based metric learning algorithms are used in settings where we have an "anchor" sample that we want to be closer with a "positive" sample than with a "negative" sample. Consistently with related packages like scikit-learn, we have also dropped support for Python 2 and Python 3.5.

    New algorithms

    • Add Sparse Compositional Metric Learning (SCML) (#278)

    General updates on the package

    • Drop support for python 2 and python 3.5 (#291)
    • Add the Triplets API (#279)
    • Solve issues in the documentation (#265, #266, #271, #274, #280)
    • Allow installation from conda (#283)
    • Fix covariance initialization when matrix is not invertible (#277)
    • Add more robusts checks that an estimator is fitted (#267)

    Improvements to existing algorithms

    • Improve LMNN's verbose (#253)
    • Fix chunk generation in RCA (#254, #263)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jul 18, 2019)

    This is a major release in which the API (in particular for weakly-supervised algorithms) was largely refurbished in order to make it more unified and largely compatible with scikit-learn. Note that for this reason, you might encounter a significant amount of DeprecationWarning and ChangedBehaviourWarning. These warnings will disappear in version 0.6.0. The changes are summarized below:

    • All algorithms:

      • Uniformize initialization for all algorithms: all algorithms that have a 'prior' or an 'init' as a parameter, can now choose it in a unified way, between (more) different choices ('identity', 'random', etc...) (#195 )
      • Rename num_dims to n_components for algorithms that have such a parameter. (#193)
      • metric() method has been renamed into get_mahalanobis_matrix (#152)
      • You can now use the function score_pairs to score a bunch of pair of points (return the distance between them), or get_metric to get a metric function that can be plugged into scikit-learn estimators like any scipy distance.
    • Weakly supervised algorithms

      • major API changes (#139, #217, #220, #197, #168) allowing greater compatibility with scikit-learn routines:
        • in order to fit weakly supervised algorithms, users now have to provide 3d arrays of tuples (and possibly an array of labels y). For pairs learners, instead of X and [a, b, c, d] as before, we should have an array pairs such that pairs[i] = X[a[k], b[k]] if y[i] == 1 or X[c[k], d[k]] if y[i] != 1, where k is some integer (you can obtain such a representation by stacking horizontally a and b, then c and d, stacking these vertically, and taking X[this array of indices]). For quadruplets learners, one should have the same form of input, instead that there is no need for y, and that the 3d array will be an array of 4-uples instead of 2-uples. The two first elements of each quadruplet are the ones that we want to be more similar to each other than the last two.
        • Alternatively, a "preprocessor" can be used, if users instead want to give tuples of indices and not tuples of plain points, for less redundant manipulation of data. Custom preprocessor can be easily written for advanced use (e.g., to load and encode images from file paths).
        • You can also use predict on a given pair or quadruplet, i.e. predict whether the pair is similar or not, or in the case of quadruplets, whether a given new quadruplet is in the right ordering or not
        • For pairs, this prediction depends on a threshold that can be set with set_threshold and calibrated on some data with calibrate_threshold.
        • For pairs, a default score is defined, which is the AUC (Area under the ROC Curve). For quadruplets, the default score is the accuracy (proportion of quadruplets given in the right order).
        • All of the above allows the algorithms to be compatible with scikit-learn for cross-validation, grid-search etc...
        • For more information about these changes, see the new documentation
    • Supervised algorithms

      • deprecation of num_labeled parameter (#119):
      • ITML_supervised bounds must now be set in init and not fit anymore (#163)
      • deprecation of use_pca in LMNN (#231).
      • the random seed for generating constraints has now to be put at initialization rather than fit time (#224).
      • removed preprocessing the data for RCA (#194).
      • removed shogun dependency for LMNN (#216).
    • Improved documentation:

      • mathematical formulation of algorithms (#178)
      • general introduction to metric learning, use cases, different problem formulations (#145)
      • description of the API in the user guide (#208 and #229)
    • Bug fixes:

      • scikit-learn's fix https://github.com/scikit-learn/scikit-learn/pull/13276 fixed SDML when the matrix to reconstruct is PSD, and the use of skggm fixed it in cases where the matrix is not PSD but we can still converge. The use of skggm is now recommended (i.e. we recommend to install skggm to use SDML).
      • For all the algorithms that had a parameter num_dims (renamed to n_components, see above), it will now be checked to be between 1 and n_features, with n_features the number of dimensions of the input space
      • LMNN did not update impostors at each iteration, which could result in problematic cases. Impostors are now recomputed at each iteration, which solves these problems (#228).
      • The pseudo-inverse is now used in Covariance instead of the plain inverse, which allows to make Covariance work even in the case where the covariance matrix is not invertible (e.g. if the data lies on a space of smaller dimension).(#206)
      • There was an error in #101 that caused LMNN to return a wrong gradient (one dot product with L was missing). This has been fixed in #201.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Sep 5, 2018)

    • Two newly introduced algorithms:
      • MLKR (Metric Learning for Kernel Regression)
      • MMC (Mahalanobis Metric for Clustering)
    • Improved documentation and examples
    • Performance improvements
    • Minor bug fixes
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 13, 2016)

    Constraints are now managed with a unified interface (metric_learn.Constraints), which makes it easy to generate various input formats from (possibly) partial label information.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(May 16, 2016)

  • v0.2.0(Nov 7, 2015)

  • v0.1.1(Oct 7, 2015)

    This minor release adds two new methods:

    • Local Fisher Discriminant Analysis (LFDA)
    • Relative Components Analysis (RCA)

    The performance of the non-Shogun LMNN implementation has also been improved, and it should now consume less memory.

    This release also includes the new Sphinx documentation and improved docstrings for many of the classes and methods,

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Sep 16, 2015)

Practical Time-Series Analysis, published by Packt

Practical Time-Series Analysis This is the code repository for Practical Time-Series Analysis, published by Packt. It contains all the supporting proj

Packt 325 Dec 23, 2022
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 04, 2022
Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Bottleneck Bottleneck is a collection of fast, NaN-aware NumPy array functions written in C. As one example, to check if a np.array has any NaNs using

Python for Data 835 Dec 27, 2022
A simple python program that draws a tree for incrementing values using the Collatz Conjecture.

Collatz Conjecture A simple python program that draws a tree for incrementing values using the Collatz Conjecture. Values which can be edited: Length

davidgasinski 1 Oct 28, 2021
PySpark ML Bank Churn Prediction

PySpark-Bank-Churn Surname: corresponds to the record (row) number and has no effect on the output. CreditScore: contains random values and has no eff

kemalgunay 2 Nov 11, 2021
Napari sklearn decomposition

napari-sklearn-decomposition A simple plugin to use with napari This napari plug

1 Sep 01, 2022
Python ML pipeline that showcases mltrace functionality.

mltrace tutorial Date: October 2021 This tutorial builds a training and testing pipeline for a toy ML prediction problem: to predict whether a passeng

Log Labs 28 Nov 09, 2022
Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Self Supervised clusterer Combined IIC, and Moco architectures, with some SimCLR notions, to get state of the art unsupervised clustering while retain

Bendidi Ihab 9 Feb 13, 2022
Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

1 Dec 22, 2021
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
Pragmatic AI Labs 421 Dec 31, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

DIAL | Notre Dame 220 Dec 13, 2022
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
Generate music from midi files using BPE and markov model

Generate music from midi files using BPE and markov model

Aditya Khadilkar 37 Oct 24, 2022
ETNA is an easy-to-use time series forecasting framework.

ETNA is an easy-to-use time series forecasting framework. It includes built in toolkits for time series preprocessing, feature generation, a variety of predictive models with unified interface - from

Tinkoff.AI 674 Jan 07, 2023
MLOps pipeline project using Amazon SageMaker Pipelines

This project shows steps to build an end to end MLOps architecture that covers data prep, model training, realtime and batch inference, build model registry, track lineage of artifacts and model drif

AWS Samples 3 Sep 16, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023