🌊 River is a Python library for online machine learning.

Last update: Jan 03, 2023

Overview

River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.

⚡️ Quickstart

As a quick example, we'll train a logistic regression to classify the website phishing dataset. Here's a look at the first observation in the dataset.

>>> from pprint import pprint
>>> from river import datasets

>>> dataset = datasets.Phishing()

>>> for x, y in dataset:
...     pprint(x)
...     print(y)
...     break
{'age_of_domain': 1,
 'anchor_from_other_domain': 0.0,
 'empty_server_form_handler': 0.0,
 'https': 0.0,
 'ip_in_url': 1,
 'is_popular': 0.5,
 'long_url': 1.0,
 'popup_window': 0.0,
 'request_from_other_domain': 0.0}
True

Now let's run the model on the dataset in a streaming fashion. We sequentially interleave predictions and model updates. Meanwhile, we update a performance metric to see how well the model is doing.

>>> from river import compose
>>> from river import linear_model
>>> from river import metrics
>>> from river import preprocessing

>>> model = compose.Pipeline(
...     preprocessing.StandardScaler(),
...     linear_model.LogisticRegression()
... )

>>> metric = metrics.Accuracy()

>>> for x, y in dataset:
...     y_pred = model.predict_one(x)      # make a prediction
...     metric = metric.update(y, y_pred)  # update the metric
...     model = model.learn_one(x, y)      # make the model learn

>>> metric
Accuracy: 89.20%

🛠 Installation

River is intended to work with Python 3.6 or above. Installation can be done with pip:

pip install river

There are wheels available for Linux, MacOS, and Windows, which means that you most probably won't have to build River from source.

You can install the latest development version from GitHub as so:

pip install git+https://github.com/online-ml/river --upgrade

Or, through SSH:

pip install git+ssh://[email protected]/online-ml/river.git --upgrade

🧠 Philosophy

Machine learning is often done in a batch setting, whereby a model is fitted to a dataset in one go. This results in a static model which has to be retrained in order to learn from new data. In many cases, this isn't elegant nor efficient, and usually incurs a fair amount of technical debt. Indeed, if you're using a batch model, then you need to think about maintaining a training set, monitoring real-time performance, model retraining, etc.

With River, we encourage a different approach, which is to continuously learn a stream of data. This means that the model process one observation at a time, and can therefore be updated on the fly. This allows to learn from massive datasets that don't fit in main memory. Online machine learning also integrates nicely in cases where new data is constantly arriving. It shines in many use cases, such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. If you're bored with retraining models and want to instead build dynamic models, then online machine learning (and therefore River!) might be what you're looking for.

Here are some benefits of using River (and online machine learning in general):

Incremental: models can update themselves in real-time.
Adaptive: models can adapt to concept drift.
Production-ready: working with data streams makes it simple to replicate production scenarios during model development.
Efficient: models don't have to be retrained and require little compute power, which lowers their carbon footprint
Fast: when the goal is to learn and predict with a single instance at a time, then River is an order of magnitude faster than PyTorch, Tensorflow, and scikit-learn.

🔥 Features

Linear models with a wide array of optimizers
Nearest neighbors, decision trees, naïve Bayes
Progressive model validation
Model pipelines as a first-class citizen
Anomaly detection
Recommender systems
Time series forecasting
Imbalanced learning
Clustering
Feature extraction and selection
Online statistics and metrics
Built-in datasets
And much more

🔗 Useful links

👁️ Media

PyData Amsterdam 2019 presentation (slides, video)
Toulouse Data Science Meetup presentation
Machine learning for streaming data with creme
Hong Kong Data Science Meetup presentation

👍 Contributing

Feel free to contribute in any way you like, we're always open to new ideas and approaches.

There are three ways for users to get involved:

Issue tracker: this place is meant to report bugs, request for minor features, or small improvements. Issues should be short-lived and solved as fast as possible.
Discussions: you can ask for new features, submit your questions and get help, propose new ideas, or even show the community what you are achieving with River! If you have a new technique or want to port a new functionality to River, this is the place to discuss.
Roadmap: you can check what we are doing, what are the next planned milestones for River, and look for cool ideas that still need someone to make them become a reality!

Please check out the contribution guidelines if you want to bring modifications to the code base. You can view the list of people who have contributed here.

❤️ They've used us

These are companies that we know have been using River, be it in production or for prototyping.

Feel welcome to get in touch if you want us to add your company logo!

🤝 Affiliations

Sponsors

Collaborating institutions and groups

💬 Citation

If river has been useful for your research and you would like to cite it in an scientific publication, please refer to this paper:

@misc{2020river,
      title={River: machine learning for streaming data in Python},
      author={Jacob Montiel and Max Halford and Saulo Martiello Mastelini
              and Geoffrey Bolmier and Raphael Sourty and Robin Vaysse
              and Adil Zouitine and Heitor Murilo Gomes and Jesse Read
              and Talel Abdessalem and Albert Bifet},
      year={2020},
      eprint={2012.04740},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

📝 License

River is free and open-source software licensed under the 3-clause BSD license.

Comments

refactoring neighbors models to use simple collections queue
This is a follow up to discussion in https://github.com/online-ml/river/issues/891#issuecomment-1080993289.

The current implementation of nearest neighbors was done for a performance tweak, but it is limited because we cannot be flexible to add slightly varying features (the vector size for X is fixed and you get an error if you vary from the first one) or to add additional metadata about a point in the window such as a UID that can be used to look up a point later. From an online ML standpoint (as I understand it) we should optimize for this kind of flexibility over a small performance tweak (but coming from the world of HPC I totally get the performance bit!)

This refactor takes the base logic from creme.neighbors, and adds some additional features,

a minimum distance to consider adding a new datum to the window, ensuring we have a more variable window that does better at learning. The default for my model was 0.05 because I had ~250K points with quite a bit of repetition so adding redundancy was expensive and led to bad results, but I chose a default of 0.0 here so the default miimcs what you would expect.

Given the window can change (and classes within it) I realized that the prediction dict would potentially have a lot of extra "no longer existing in the window" classes. So I added a class cleanup function for classification to iterate over data in the window and ensure the classes known to the class accurately reflect the current window. The default is False so this will not run (and the model maintains all memory of classes it has seen and returns 0 probability of the class) but it can be set to True to always run after learn (potentially at the cost of performance) or can be run on demand (e.g., I can see an online-ml server setting it to False, and then loading and running once a day or something like that.

optional support to add a UID, which is just another variable in the list added to the queue! This I've found incredibly useful in production cases where you do not just want a prediction back, but you want to look more into the closest points, metadata wise. The data in the window should be useful beyond having x , y, and class, and allowing an additional identifier allows the implementer to put more information elsewhere.

For the reasons in 3, I renamed the BaseNeighbors class to be KNeighbors because people can use it as a very simple model to return the neighbors verbatim.

This is likely a start (and I have not run tests locally) so we can discuss further changes that need to be done, a strategy for doing them, and what features / additional classes of the old implementation we want to preserve. I had this on another branch, but for some reason running pre-commit made changes outside of my changes (perhaps a different version of black or flake8 or something?) so I created a fresh clone and added the changed files verbatim.

Signed-off-by: vsoch [email protected]
opened by vsoch 63
Using Creme on non scikit learn dataset.

Hi @MaxHalford ! Apologies for the beginner question.

How do we use creme on non scikit learn data sets. Bascilly I have two dictionaries.

X=

{(Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'AAPL'): -0.344719768623466, (Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'CSCO'): 0.20302817925763325, (Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'INTC'): -0.13517037149368347,

And Y =

{(Timestamp('2012-01-10 00:00:00'), 'AAPL'): -0.0013231888852133222, (Timestamp('2012-01-10 00:00:00'), 'CSCO'): 0.005841741901221553, (Timestamp('2012-01-10 00:00:00'), 'INTC'): -0.006252442360296984, (Timestamp('2012-01-10 00:00:00'), 'MSFT'): -0.014727011494252928, (Timestamp('2012-01-10 00:00:00'), 'SPY'): -0.003097653527452948, (Timestamp('2012-01-11 00:00:00'), 'AAPL'): -0.0004970178926441138, (Timestamp('2012-01-11 00:00:00'), 'CSCO'): 0.002621919244887305, (Timestamp('2012-01-11 00:00:00'), 'INTC'): 0.0019379844961240345,

And I want to do a basic iterative linear regression on the data sets.

Best, Andrew

opened by andrewczgithub 34
Numpy import error with River
Hi,

I read through all previous issues to make sure i have the right versions. I have:

numpy 1.20.1

river 0.10.1

However I still got the "ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 232 from C header, got 216 from PyObject"

What should I do? Thank you!
opened by mai-n-coleman 32
Bandits regressors for model selection (new PR to use Github CI/CD)
Description

The PR introduce bandits (epsilon-greedy and UCB) for model selection (see issue #270 ). The PR concerns only regressors, but I can add the classifiers in a subsequent PR.

The use of the classes are straightforward :

bandit = UCBRegressor(models=models, metric=metrics.MSE()), for (x, y) in data.take(N): y_pred = bandit.predict_one(x=x) bandit.learn_one(x=x, y=y) best_model = bandit.best_model

There are convenience methods such as :

percentage_pulled : to get the percentage each arm was pulled

best_model : return the model with the highest average reward

Also I added a method add_models where the user can add models on the fly.

I am also working on a notebook that studies the behavior of the bandits for model selection. The notebook also include Exp3, which seems promising but has numerical stability issue and yields counter-intuitive results (see section 3 of the NB). That's why I kept it out of this PR. More generally, the performances of UCB and epsilon-greedy are rather good but there seems to be some variance in the performance.

Improvements

It's still WIP on the following points :

docstring, mainly add examples + cleaning.

some comments might be removed.

the name of the classes and the methods are open for changes

Feature
opened by etiennekintzler 25
Implement bandits
We now have SuccessiveHalvingClassifier and SuccessiveHalvingRegressor in the model_selection to module to perform, well, model selection. This allows doing hyperparameter-tuning by initializing a model with different parameter configuration and running them against each other. All in all it seems to be working pretty well and we should be getting some feedback on it soon. The implementations are handy because they implement the fit_one/predict_one interface and therefore make the whole process transparent to users. In other words you can use them as you would any other model. This design will go a long way and should make things simple in terms of deployment (I'm thinking of you chantilly).

The next step would be to implement multi-armed bandits. In progressive validation, all the remaining models are updated. This is called a "full feedback" situation. Bandits, on the other hand, use partial feedback, because only one model is picked and trained at a time. This is more efficient because it results in less model evaluations, but might also converge more slowly. Most bandit algorithms assume that the performance of the models is constant through time (this is called "stationarity"). However, the performance of each model is bound to change through time because the act of picking modifies the model. Therefore ideally we need to looking into non-stationary bandit algorithms. Here are some more details and references.

Here are the algortithms I would like to see implemented:

Epsilon greedy

UCB

Discounted UCB

Sliding-Window UCB

Also see this paper. We can probably open separate issues for each implementation. I think that the current best way of proceding is to provide one implementation for regression and one for classification in each case, much like what is done for successive halving. There's also this paper that discusses delayed feedback and how it affects bandits.
Feature
opened by MaxHalford 24
Error when using OutputCodeClassifier for Code-size greater than 20

Hello again,

I was testing the OCC classifier with more than 90 classes and the accuracy is very poor. I assume I need a huge code size, however I was testing different code-sizes (staring with a code-size of 10) and recording the accuracy when I came to a code-size of 40 and received the following error: OverflowError: Python int too large to convert to C ssize_t. Is there a way we can modify the occ classifier to allow for compact codes (as short as possible) while still providing enough discriminating power between the different classes.

opened by Yasmen-Wahba 23
Adaptive Random Forest Regressor/Hoeffding Tree Regressor splitting gives AttributeError

Versions

river version: River 0.1.0 Python version:Python 3.7.1 Operating system: Windows 10 Enterprise v1803

Describe the issue

Have been playing around with River/Creme for a couple of weeks now, and it's super useful for a project I am currently working on. That being said, I'm still pretty new to the workings of the algorithms, so unsure whether this is a bug or an issue with my setup.

When I call learn_one on the ARFR or HTR, I receive: "Attribute Error "NoneType" object has no attribute '_left'" from line 113 in best_evaluated_split_suggestion.py.

I have implemented a delayed prequential evaluation algorithm, and inspecting the loop, the error seems to be thrown when the model after the first delay period has been exceeded - ie when the model can first make a prediction that isn't zero. Before this point, learn_one doesn't throw this error.

Currently, I am using the default ARFR as in the example, with a simple Linear Regression as the leaf_model. The linear regression model itself has been working with my data, when not used in the ARFR. I want to try the ensemble/tree models with the data to see if accuracy is improved due to the drift detection methods that are included.

Has anyone also seen this error being thrown, or know the causes of it? Let me know if more information is needed! Thanks.
Bug

opened by JPalm1 23
Feature/l1 implementation
Yo @MaxHalford check out this

[Thanks to @gbolmier for inspiration]

Addresses #618 I've decided to implement L1 cumulative only because:

L1 naive is just not efficient plus will likely take more hassle than it should (e.g. sign method for VectorDict)

Truncateed Gradient approach acc to paper is overshadowed by L1 cumulative plus takes 3 params to tune instead of 1 (no thanks, as if online model tuning is not tricky enough)

Done some tests and showcase: https://gist.github.com/ColdTeapot273K/b7865bbfb9ad2e473b474c39c3d40413

Bonus:

works with learn_many out of the box

appears to be better than scikit-learn's impl of l1 on SGDRegressor (i think it uses truncated under the hood?)

As you might notice this implementation implies either l1 or l2 is used at a time. A small price to pay for the ability to finally do proper, optimal sparse feature selection.

Got problems with vectorizing the penalty. Didn't see a graceful way of imitating boolean indexing in VectorDict or at least a way to transfer from numpy restricted computation domain into VectorDict. So decided to just leave vectorized version as a dry-run thing with a test attached (see gist), maybe you'll have better ideas (@MaxHalford you should have the access to my forks).

Cheers.
opened by ColdTeapot273K 22
Using Creme model in Python 2

Hi Guys,

Suppose, I trained a logistic-regression classifier using Python 3.x as Creme only supports Python 3. How can I use the model for classification in Python 2?

opened by aamirkhan34 22
Question about centroid changes
heyo! So this might be better suited for chantilly, but it's generally about river so I hope it's okay to put here. I'm creating a server that has a cluster model on the backend to handle clustering types of errors. I was planning on using a cluster model and then assigning new errors (saved as objects in the database) to clusters, but I realize if I have a preference for models with a changing number of centers then my assignments would also likely change over time (and not be valid). This makes the idea of saving the cluster id with the error object not a great idea. But then I was thinking - as long as we have some kind of ability for a model to output changes (e.g.,:

Cluster 8 no longer exists

Clusters 10 and 12 were merged into 10

Cluster 5 was split into 5 and 101

Then it could be feasible to act on those events, and in the case of the above:

Remove the assignment of any errors to cluster 8 (reclassify if it's computationally easy)

find objects assigned to 10 and 12, assign all to 10

Remove the assignment of 5, and reclassify if feasible

So my question - how do we integrate this functionality into river, at least exposing enough for a wrapper of some kind to assess changes? Or is it just a bad design idea on my part?
opened by vsoch 21
Timeseries forecast evaluation
2 matters here:

after introducing the ThersholdFilter in the time_series pipeline, it is now hard to evaluate models with vectors that have anomalies. Is it possible to "pass-through" the anomaly score 1/0 to the output of the function model.forecast() ? this way one could skip metrics.update for those anomalies when looping around the dataset.

for a-posteriori evaluation of a time_series forecaster, i am try to replay the dataset with a pre-trained model and see how that model would have performed for "reconstructing" the whole dataset. to make this i would simply need to call model.forecast() using values in the past. but is this possible? how should i call the forecast method? And more in general, what is the difference between horizon, and xs in the forecast signature?
opened by dberardo-com 18
venv caching for faster CI

As discussed with @MaxHalford on Discord, caching the virtual environment allows us to bypass completely pip installs. The existing approach caches downloaded pip packages, although pip install takes considerable amounts of time even then.

opened by boragokbakan 0
Onelearn implementation: AMF & Mondrian Tree Classifiers
Description of the PR

Hi ! 👋

This is a first version of Onelearn's library (classifiers only) implementation in River. It contains:

Mondrian Tree (Base and Classifier)

Aggregated Mondrian Forest (Base and Classifier)

The original repository with proof of working implementation can be found here (see script.py).

Known drawbacks of the current implementation

Management of labels: labels must be positive integers at the moment, since it just makes my life easier. Please tell me how you would prefer this to be implemented in River's framework (I've seen labels presented as dictionary, but I'm not sure what these dictionaries contains: int? string?). This might be a simple trick for me to change how labels are managed I think, just need to know your preferences.

Tree branches: I use branches mainly as the global tree structure, I'm not quite exactly sure how to use them better. Feel free to comment on that too if you have better suggestions !

Examples of usage: MondrianTreeClassifier and AMFClassifier would need examples of implementations for users. I actually implemented one on my repo here already, but I didn't manage to compile River so I couldn't get the scores 😢

Random state: I have a random state attribute, but it's not used at the moment. I'm not exactly sure where to put the random state in the Mondrian process, I'm afraid it'd be breaking the whole thing. If any expert in Random Forest could advise me on where random state should be placed, that would be great ❤️

Notes on the utils

Currently I placed two functions in the utils section:

sample_discrete

log_sum_2_exp

It might seem overkill to place them as utils right now looking at where they're used in the code, but I'll need them for the regressors too when the times come. Maybe there's a better place for them keeping the regressors in mind though.
opened by AlexandreChaussard 6
add xstream

We are currently working to implement PySAD into River, for the "polytechnique-project". It is a pull request for adding the method "xstream".

Sophie Normand

opened by Sophie-Normand 0
Refactor benchmarks

Hi 👋,

I spend some time on the benchmarks and tried to refactor them. I used vega for visualising the performances. This requires the mkdocs-charts-plugin which sometimes fails for the livedocs when reloading the documentation. Further, the index.md in the benchmark folder might get pretty big at some point.

As there are some drawbacks, I would like some feedback from you to see if I am on the right track.

Best, Cedric
Improvement

opened by kulbachcedric 6

Releases(0.14.0)

0.14.0(Oct 27, 2022)
https://riverml.xyz/0.14.0/releases/0.14.0/

https://pypi.org/project/river/0.14.0/

Source code(tar.gz)
Source code(zip)
0.13.0(Sep 19, 2022)
https://riverml.xyz/0.13.0/releases/0.13.0/

https://pypi.org/project/river/0.13.0/

Source code(tar.gz)
Source code(zip)
0.12.1(Sep 2, 2022)
https://riverml.xyz/0.12.1/releases/0.12.1/ and https://riverml.xyz/0.12.1/releases/0.12.0/

https://pypi.org/project/river/0.12.1/

Source code(tar.gz)
Source code(zip)
0.11.1(Jun 6, 2022)
https://riverml.xyz/0.11.1/releases/0.11.1/

https://pypi.org/project/river/0.11.1/

Source code(tar.gz)
Source code(zip)
0.11.0(May 28, 2022)
https://riverml.xyz/0.11.0/releases/0.11.0/

https://pypi.org/project/river/0.11.0/

Source code(tar.gz)
Source code(zip)
0.10.0(Feb 4, 2022)
https://riverml.xyz/latest/releases/0.10.0/

https://pypi.org/manage/project/river/release/0.10.0/

Source code(tar.gz)
Source code(zip)
0.9.0(Dec 1, 2021)
https://riverml.xyz/latest/releases/0.9.0/

https://pypi.org/manage/project/river/release/0.9.0/

Source code(tar.gz)
Source code(zip)

Owner

OnlineML

Online machine learning in Python

GitHub Repository https://riverml.xyz

Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

4 Nov 11, 2021

YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

5 Sep 27, 2022

Python library for multilinear algebra and tensor factorizations

scikit-tensor is a Python module for multilinear algebra and tensor factorizations

394 Dec 09, 2022

Pandas Machine Learning and Quant Finance Library Collection

148 Dec 07, 2022

customer churn prediction prevention in telecom industry using machine learning and survival analysis

Telco Customer Churn Prediction - Plotly Dash Application Description This dash application allows you to predict telco customer churn using machine l

3 Nov 20, 2021

Coursera Machine Learning - Python code

Coursera Machine Learning This repository contains python implementations of certain exercises from the course by Andrew Ng. For a number of assignmen

859 Dec 10, 2022

Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

1 Jan 23, 2022

MLFlow in a Dockercontainer based on Azurite and Postgres

mlflow-azurite-postgres docker This is a MLFLow image which works with a postgres DB and a local Azure Blob Storage Instance (Azurite). This image is

2 May 29, 2022

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

696 Dec 26, 2022

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

24 Oct 27, 2022

Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

Feature-Engineering Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared. When the dataset

5 Apr 21, 2022

A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Author Ibrahim Koné From-Machine-Learning-Models-To-WebAPI A collection of Machine Learning Models To Web Api which are built on open source technolog

2 May 24, 2022

🌊 River is a Python library for online machine learning.

Related tags

Overview

⚡️ Quickstart

🛠 Installation

🧠 Philosophy

🔥 Features

🔗 Useful links

👁️ Media

👍 Contributing

❤️ They've used us

🤝 Affiliations

💬 Citation

📝 License

Comments

Description

Improvements

Versions

Describe the issue

Description of the PR

Known drawbacks of the current implementation

Notes on the utils

Releases(0.14.0)

0.14.0(Oct 27, 2022)

0.13.0(Sep 19, 2022)

0.12.1(Sep 2, 2022)

0.11.1(Jun 6, 2022)

0.11.0(May 28, 2022)

0.10.0(Feb 4, 2022)

0.9.0(Dec 1, 2021)

Owner

OnlineML

Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

YouTube Spam Detection with python

Python library for multilinear algebra and tensor factorizations

Pandas Machine Learning and Quant Finance Library Collection

customer churn prediction prevention in telecom industry using machine learning and survival analysis

Coursera Machine Learning - Python code

Decision tree is the most powerful and popular tool for classification and prediction

MLFlow in a Dockercontainer based on Azurite and Postgres

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Learn how to responsibly deliver value with ML.

Sequence learning toolkit for Python

AP1 Transcription Factor Binding Site Prediction

A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

🤖 ⚡ scikit-learn tips

Distributed deep learning on Hadoop and Spark clusters.

Fit interpretable models. Explain blackbox machine learning.

BigDL: Distributed Deep Learning Framework for Apache Spark