🌊 River is a Python library for online machine learning.

Overview

river_logo


tests documentation roadmap pypi pepy bsd_3_license


River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.

⚑️ Quickstart

As a quick example, we'll train a logistic regression to classify the website phishing dataset. Here's a look at the first observation in the dataset.

>>> from pprint import pprint
>>> from river import datasets

>>> dataset = datasets.Phishing()

>>> for x, y in dataset:
...     pprint(x)
...     print(y)
...     break
{'age_of_domain': 1,
 'anchor_from_other_domain': 0.0,
 'empty_server_form_handler': 0.0,
 'https': 0.0,
 'ip_in_url': 1,
 'is_popular': 0.5,
 'long_url': 1.0,
 'popup_window': 0.0,
 'request_from_other_domain': 0.0}
True

Now let's run the model on the dataset in a streaming fashion. We sequentially interleave predictions and model updates. Meanwhile, we update a performance metric to see how well the model is doing.

>>> from river import compose
>>> from river import linear_model
>>> from river import metrics
>>> from river import preprocessing

>>> model = compose.Pipeline(
...     preprocessing.StandardScaler(),
...     linear_model.LogisticRegression()
... )

>>> metric = metrics.Accuracy()

>>> for x, y in dataset:
...     y_pred = model.predict_one(x)      # make a prediction
...     metric = metric.update(y, y_pred)  # update the metric
...     model = model.learn_one(x, y)      # make the model learn

>>> metric
Accuracy: 89.20%

πŸ›  Installation

River is intended to work with Python 3.6 or above. Installation can be done with pip:

pip install river

There are wheels available for Linux, MacOS, and Windows, which means that you most probably won't have to build River from source.

You can install the latest development version from GitHub as so:

pip install git+https://github.com/online-ml/river --upgrade

Or, through SSH:

pip install git+ssh://[email protected]/online-ml/river.git --upgrade

🧠 Philosophy

Machine learning is often done in a batch setting, whereby a model is fitted to a dataset in one go. This results in a static model which has to be retrained in order to learn from new data. In many cases, this isn't elegant nor efficient, and usually incurs a fair amount of technical debt. Indeed, if you're using a batch model, then you need to think about maintaining a training set, monitoring real-time performance, model retraining, etc.

With River, we encourage a different approach, which is to continuously learn a stream of data. This means that the model process one observation at a time, and can therefore be updated on the fly. This allows to learn from massive datasets that don't fit in main memory. Online machine learning also integrates nicely in cases where new data is constantly arriving. It shines in many use cases, such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. If you're bored with retraining models and want to instead build dynamic models, then online machine learning (and therefore River!) might be what you're looking for.

Here are some benefits of using River (and online machine learning in general):

  • Incremental: models can update themselves in real-time.
  • Adaptive: models can adapt to concept drift.
  • Production-ready: working with data streams makes it simple to replicate production scenarios during model development.
  • Efficient: models don't have to be retrained and require little compute power, which lowers their carbon footprint
  • Fast: when the goal is to learn and predict with a single instance at a time, then River is an order of magnitude faster than PyTorch, Tensorflow, and scikit-learn.

πŸ”₯ Features

  • Linear models with a wide array of optimizers
  • Nearest neighbors, decision trees, naΓ―ve Bayes
  • Progressive model validation
  • Model pipelines as a first-class citizen
  • Anomaly detection
  • Recommender systems
  • Time series forecasting
  • Imbalanced learning
  • Clustering
  • Feature extraction and selection
  • Online statistics and metrics
  • Built-in datasets
  • And much more

πŸ”— Useful links

πŸ‘οΈ Media

πŸ‘ Contributing

Feel free to contribute in any way you like, we're always open to new ideas and approaches.

There are three ways for users to get involved:

  • Issue tracker: this place is meant to report bugs, request for minor features, or small improvements. Issues should be short-lived and solved as fast as possible.
  • Discussions: you can ask for new features, submit your questions and get help, propose new ideas, or even show the community what you are achieving with River! If you have a new technique or want to port a new functionality to River, this is the place to discuss.
  • Roadmap: you can check what we are doing, what are the next planned milestones for River, and look for cool ideas that still need someone to make them become a reality!

Please check out the contribution guidelines if you want to bring modifications to the code base. You can view the list of people who have contributed here.

❀️ They've used us

These are companies that we know have been using River, be it in production or for prototyping.

companies

Feel welcome to get in touch if you want us to add your company logo!

🀝 Affiliations

Sponsors

sponsors

Collaborating institutions and groups

collaborations

πŸ’¬ Citation

If river has been useful for your research and you would like to cite it in an scientific publication, please refer to this paper:

@misc{2020river,
      title={River: machine learning for streaming data in Python},
      author={Jacob Montiel and Max Halford and Saulo Martiello Mastelini
              and Geoffrey Bolmier and Raphael Sourty and Robin Vaysse
              and Adil Zouitine and Heitor Murilo Gomes and Jesse Read
              and Talel Abdessalem and Albert Bifet},
      year={2020},
      eprint={2012.04740},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

πŸ“ License

River is free and open-source software licensed under the 3-clause BSD license.

Comments
  • refactoring neighbors models to use simple collections queue

    refactoring neighbors models to use simple collections queue

    This is a follow up to discussion in https://github.com/online-ml/river/issues/891#issuecomment-1080993289.

    The current implementation of nearest neighbors was done for a performance tweak, but it is limited because we cannot be flexible to add slightly varying features (the vector size for X is fixed and you get an error if you vary from the first one) or to add additional metadata about a point in the window such as a UID that can be used to look up a point later. From an online ML standpoint (as I understand it) we should optimize for this kind of flexibility over a small performance tweak (but coming from the world of HPC I totally get the performance bit!)

    This refactor takes the base logic from creme.neighbors, and adds some additional features,

    1. a minimum distance to consider adding a new datum to the window, ensuring we have a more variable window that does better at learning. The default for my model was 0.05 because I had ~250K points with quite a bit of repetition so adding redundancy was expensive and led to bad results, but I chose a default of 0.0 here so the default miimcs what you would expect.
    2. Given the window can change (and classes within it) I realized that the prediction dict would potentially have a lot of extra "no longer existing in the window" classes. So I added a class cleanup function for classification to iterate over data in the window and ensure the classes known to the class accurately reflect the current window. The default is False so this will not run (and the model maintains all memory of classes it has seen and returns 0 probability of the class) but it can be set to True to always run after learn (potentially at the cost of performance) or can be run on demand (e.g., I can see an online-ml server setting it to False, and then loading and running once a day or something like that.
    3. optional support to add a UID, which is just another variable in the list added to the queue! This I've found incredibly useful in production cases where you do not just want a prediction back, but you want to look more into the closest points, metadata wise. The data in the window should be useful beyond having x , y, and class, and allowing an additional identifier allows the implementer to put more information elsewhere.

    For the reasons in 3, I renamed the BaseNeighbors class to be KNeighbors because people can use it as a very simple model to return the neighbors verbatim.

    This is likely a start (and I have not run tests locally) so we can discuss further changes that need to be done, a strategy for doing them, and what features / additional classes of the old implementation we want to preserve. I had this on another branch, but for some reason running pre-commit made changes outside of my changes (perhaps a different version of black or flake8 or something?) so I created a fresh clone and added the changed files verbatim.

    Signed-off-by: vsoch [email protected]

    opened by vsoch 63
  • Using Creme on non scikit learn dataset.

    Using Creme on non scikit learn dataset.

    Hi @MaxHalford ! Apologies for the beginner question.

    How do we use creme on non scikit learn data sets. Bascilly I have two dictionaries.

    X=

    {(Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'AAPL'): -0.344719768623466, (Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'CSCO'): 0.20302817925763325, (Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'INTC'): -0.13517037149368347,

    And Y =

    {(Timestamp('2012-01-10 00:00:00'), 'AAPL'): -0.0013231888852133222, (Timestamp('2012-01-10 00:00:00'), 'CSCO'): 0.005841741901221553, (Timestamp('2012-01-10 00:00:00'), 'INTC'): -0.006252442360296984, (Timestamp('2012-01-10 00:00:00'), 'MSFT'): -0.014727011494252928, (Timestamp('2012-01-10 00:00:00'), 'SPY'): -0.003097653527452948, (Timestamp('2012-01-11 00:00:00'), 'AAPL'): -0.0004970178926441138, (Timestamp('2012-01-11 00:00:00'), 'CSCO'): 0.002621919244887305, (Timestamp('2012-01-11 00:00:00'), 'INTC'): 0.0019379844961240345,

    And I want to do a basic iterative linear regression on the data sets.

    Best, Andrew

    opened by andrewczgithub 34
  • Numpy import error with River

    Numpy import error with River

    Hi,

    I read through all previous issues to make sure i have the right versions. I have:

    • numpy 1.20.1
    • river 0.10.1

    However I still got the "ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 232 from C header, got 216 from PyObject"

    What should I do? Thank you!

    opened by mai-n-coleman 32
  • Bandits regressors for model selection (new PR to use Github CI/CD)

    Bandits regressors for model selection (new PR to use Github CI/CD)

    Description

    The PR introduce bandits (epsilon-greedy and UCB) for model selection (see issue #270 ). The PR concerns only regressors, but I can add the classifiers in a subsequent PR.

    The use of the classes are straightforward :

    bandit = UCBRegressor(models=models, metric=metrics.MSE()),
    
    for (x, y) in data.take(N):    
            y_pred = bandit.predict_one(x=x)
            bandit.learn_one(x=x, y=y)
    
    best_model = bandit.best_model
    

    There are convenience methods such as :

    • percentage_pulled : to get the percentage each arm was pulled
    • best_model : return the model with the highest average reward

    Also I added a method add_models where the user can add models on the fly.

    I am also working on a notebook that studies the behavior of the bandits for model selection. The notebook also include Exp3, which seems promising but has numerical stability issue and yields counter-intuitive results (see section 3 of the NB). That's why I kept it out of this PR. More generally, the performances of UCB and epsilon-greedy are rather good but there seems to be some variance in the performance.

    Improvements

    It's still WIP on the following points :

    • docstring, mainly add examples + cleaning.
    • some comments might be removed.
    • the name of the classes and the methods are open for changes
    Feature 
    opened by etiennekintzler 25
  • Implement bandits

    Implement bandits

    We now have SuccessiveHalvingClassifier and SuccessiveHalvingRegressor in the model_selection to module to perform, well, model selection. This allows doing hyperparameter-tuning by initializing a model with different parameter configuration and running them against each other. All in all it seems to be working pretty well and we should be getting some feedback on it soon. The implementations are handy because they implement the fit_one/predict_one interface and therefore make the whole process transparent to users. In other words you can use them as you would any other model. This design will go a long way and should make things simple in terms of deployment (I'm thinking of you chantilly).

    The next step would be to implement multi-armed bandits. In progressive validation, all the remaining models are updated. This is called a "full feedback" situation. Bandits, on the other hand, use partial feedback, because only one model is picked and trained at a time. This is more efficient because it results in less model evaluations, but might also converge more slowly. Most bandit algorithms assume that the performance of the models is constant through time (this is called "stationarity"). However, the performance of each model is bound to change through time because the act of picking modifies the model. Therefore ideally we need to looking into non-stationary bandit algorithms. Here are some more details and references.

    Here are the algortithms I would like to see implemented:

    Also see this paper. We can probably open separate issues for each implementation. I think that the current best way of proceding is to provide one implementation for regression and one for classification in each case, much like what is done for successive halving. There's also this paper that discusses delayed feedback and how it affects bandits.

    Feature 
    opened by MaxHalford 24
  • Error when using OutputCodeClassifier for Code-size greater than 20

    Error when using OutputCodeClassifier for Code-size greater than 20

    Hello again,

    I was testing the OCC classifier with more than 90 classes and the accuracy is very poor. I assume I need a huge code size, however I was testing different code-sizes (staring with a code-size of 10) and recording the accuracy when I came to a code-size of 40 and received the following error: OverflowError: Python int too large to convert to C ssize_t. Is there a way we can modify the occ classifier to allow for compact codes (as short as possible) while still providing enough discriminating power between the different classes.

    opened by Yasmen-Wahba 23
  • Adaptive Random Forest Regressor/Hoeffding Tree Regressor splitting gives AttributeError

    Adaptive Random Forest Regressor/Hoeffding Tree Regressor splitting gives AttributeError

    Versions

    river version: River 0.1.0 Python version:Python 3.7.1 Operating system: Windows 10 Enterprise v1803

    Describe the issue

    Have been playing around with River/Creme for a couple of weeks now, and it's super useful for a project I am currently working on. That being said, I'm still pretty new to the workings of the algorithms, so unsure whether this is a bug or an issue with my setup.

    When I call learn_one on the ARFR or HTR, I receive: "Attribute Error "NoneType" object has no attribute '_left'" from line 113 in best_evaluated_split_suggestion.py.

    I have implemented a delayed prequential evaluation algorithm, and inspecting the loop, the error seems to be thrown when the model after the first delay period has been exceeded - ie when the model can first make a prediction that isn't zero. Before this point, learn_one doesn't throw this error.

    Currently, I am using the default ARFR as in the example, with a simple Linear Regression as the leaf_model. The linear regression model itself has been working with my data, when not used in the ARFR. I want to try the ensemble/tree models with the data to see if accuracy is improved due to the drift detection methods that are included.

    Has anyone also seen this error being thrown, or know the causes of it? Let me know if more information is needed! Thanks.

    Bug 
    opened by JPalm1 23
  • Feature/l1 implementation

    Feature/l1 implementation

    Yo @MaxHalford check out this

    [Thanks to @gbolmier for inspiration]

    Addresses #618 I've decided to implement L1 cumulative only because:

    • L1 naive is just not efficient plus will likely take more hassle than it should (e.g. sign method for VectorDict)
    • Truncateed Gradient approach acc to paper is overshadowed by L1 cumulative plus takes 3 params to tune instead of 1 (no thanks, as if online model tuning is not tricky enough)

    Done some tests and showcase: https://gist.github.com/ColdTeapot273K/b7865bbfb9ad2e473b474c39c3d40413

    Bonus:

    • works with learn_many out of the box
    • appears to be better than scikit-learn's impl of l1 on SGDRegressor (i think it uses truncated under the hood?)

    As you might notice this implementation implies either l1 or l2 is used at a time. A small price to pay for the ability to finally do proper, optimal sparse feature selection.

    Got problems with vectorizing the penalty. Didn't see a graceful way of imitating boolean indexing in VectorDict or at least a way to transfer from numpy restricted computation domain into VectorDict. So decided to just leave vectorized version as a dry-run thing with a test attached (see gist), maybe you'll have better ideas (@MaxHalford you should have the access to my forks).

    Cheers.

    opened by ColdTeapot273K 22
  • Using Creme model in Python 2

    Using Creme model in Python 2

    Hi Guys,

    Suppose, I trained a logistic-regression classifier using Python 3.x as Creme only supports Python 3. How can I use the model for classification in Python 2?

    opened by aamirkhan34 22
  • Question about centroid changes

    Question about centroid changes

    heyo! So this might be better suited for chantilly, but it's generally about river so I hope it's okay to put here. I'm creating a server that has a cluster model on the backend to handle clustering types of errors. I was planning on using a cluster model and then assigning new errors (saved as objects in the database) to clusters, but I realize if I have a preference for models with a changing number of centers then my assignments would also likely change over time (and not be valid). This makes the idea of saving the cluster id with the error object not a great idea. But then I was thinking - as long as we have some kind of ability for a model to output changes (e.g.,:

    • Cluster 8 no longer exists
    • Clusters 10 and 12 were merged into 10
    • Cluster 5 was split into 5 and 101

    Then it could be feasible to act on those events, and in the case of the above:

    • Remove the assignment of any errors to cluster 8 (reclassify if it's computationally easy)
    • find objects assigned to 10 and 12, assign all to 10
    • Remove the assignment of 5, and reclassify if feasible

    So my question - how do we integrate this functionality into river, at least exposing enough for a wrapper of some kind to assess changes? Or is it just a bad design idea on my part?

    opened by vsoch 21
  • Timeseries forecast evaluation

    Timeseries forecast evaluation

    2 matters here:

    1. after introducing the ThersholdFilter in the time_series pipeline, it is now hard to evaluate models with vectors that have anomalies. Is it possible to "pass-through" the anomaly score 1/0 to the output of the function model.forecast() ? this way one could skip metrics.update for those anomalies when looping around the dataset.

    2. for a-posteriori evaluation of a time_series forecaster, i am try to replay the dataset with a pre-trained model and see how that model would have performed for "reconstructing" the whole dataset. to make this i would simply need to call model.forecast() using values in the past. but is this possible? how should i call the forecast method? And more in general, what is the difference between horizon, and xs in the forecast signature?

    opened by dberardo-com 18
  • venv caching for faster CI

    venv caching for faster CI

    As discussed with @MaxHalford on Discord, caching the virtual environment allows us to bypass completely pip installs. The existing approach caches downloaded pip packages, although pip install takes considerable amounts of time even then.

    opened by boragokbakan 0
  • Onelearn implementation: AMF & Mondrian Tree Classifiers

    Onelearn implementation: AMF & Mondrian Tree Classifiers

    Description of the PR

    Hi ! πŸ‘‹

    This is a first version of Onelearn's library (classifiers only) implementation in River. It contains:

    • Mondrian Tree (Base and Classifier)
    • Aggregated Mondrian Forest (Base and Classifier)

    The original repository with proof of working implementation can be found here (see script.py).

    image

    Known drawbacks of the current implementation

    • Management of labels: labels must be positive integers at the moment, since it just makes my life easier. Please tell me how you would prefer this to be implemented in River's framework (I've seen labels presented as dictionary, but I'm not sure what these dictionaries contains: int? string?). This might be a simple trick for me to change how labels are managed I think, just need to know your preferences.
    • Tree branches: I use branches mainly as the global tree structure, I'm not quite exactly sure how to use them better. Feel free to comment on that too if you have better suggestions !
    • Examples of usage: MondrianTreeClassifier and AMFClassifier would need examples of implementations for users. I actually implemented one on my repo here already, but I didn't manage to compile River so I couldn't get the scores 😒
    • Random state: I have a random state attribute, but it's not used at the moment. I'm not exactly sure where to put the random state in the Mondrian process, I'm afraid it'd be breaking the whole thing. If any expert in Random Forest could advise me on where random state should be placed, that would be great ❀️

    Notes on the utils

    Currently I placed two functions in the utils section:

    • sample_discrete
    • log_sum_2_exp

    It might seem overkill to place them as utils right now looking at where they're used in the code, but I'll need them for the regressors too when the times come. Maybe there's a better place for them keeping the regressors in mind though.

    opened by AlexandreChaussard 6
  • add xstream

    add xstream

    We are currently working to implement PySAD into River, for the "polytechnique-project". It is a pull request for adding the method "xstream".

    Sophie Normand

    opened by Sophie-Normand 0
  • Refactor benchmarks

    Refactor benchmarks

    Hi πŸ‘‹,

    I spend some time on the benchmarks and tried to refactor them. I used vega for visualising the performances. This requires the mkdocs-charts-plugin which sometimes fails for the livedocs when reloading the documentation. Further, the index.md in the benchmark folder might get pretty big at some point.

    As there are some drawbacks, I would like some feedback from you to see if I am on the right track.

    Best, Cedric image

    Improvement 
    opened by kulbachcedric 6
Releases(0.14.0)
Owner
OnlineML
Online machine learning in Python
OnlineML
Getting Profit and Loss Make Easy From Binance

Getting Profit and Loss Make Easy From Binance I have been in Binance Automated Trading for some time and have generated a lot of transaction records,

17 Dec 21, 2022
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
Real-time domain adaptation for semantic segmentation

Advanced-Machine-Learning This repository contains the code for the project Real

Andrea Cavallo 1 Jan 30, 2022
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 02, 2023
moDel Agnostic Language for Exploration and eXplanation

moDel Agnostic Language for Exploration and eXplanation Overview Unverified black box model is the path to the failure. Opaqueness leads to distrust.

Model Oriented 1.2k Jan 04, 2023
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 05, 2023
A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

FEATURE ENGINEERING Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is

Pinar Oner 7 Dec 18, 2021
pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks l

Ibotta 84 Dec 29, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023
Software Engineer Salary Prediction

Based on 2021 stack overflow data, this machine learning web application helps one predict the salary based on years of experience, level of education and the country they work in.

Jhanvi Mimani 1 Jan 08, 2022
Firebase + Cloudrun + Machine learning

A simple end to end consumer lending decision engine powered by Google Cloud Platform (firebase hosting and cloudrun)

Emmanuel Ogunwede 8 Aug 16, 2022
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera

Intel Corporation 858 Dec 25, 2022
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 09, 2022
CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

ZhihuiYangCS 8 Jun 07, 2022
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart β†’ ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 06, 2023
Anytime Learning At Macroscale

On Anytime Learning At Macroscale Learning from sequential data dumps (key) Requirements Python 3.7 Pytorch 1.9.0 Hydra 1.1.0 (pip install hydra-core

Meta Research 8 Mar 29, 2022