A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Overview

funk-svd Build Status License

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M RMSE MAE Time
Surprise 0.88 0.68 10 min 40 sec
Funk-svd 0.88 0.68 42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

where K stands for known ratings.

Then, we can estimate any rating by applying:

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

Owner
Geoffrey Bolmier
Geoffrey Bolmier
Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

AZİZE SULTAN PALALI 2 Oct 23, 2021
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Extended Isolation Forest for Anomaly Detection

Table of contents Extended Isolation Forest Summary Motivation Isolation Forest Extension The Code Installation Requirements Use Citation Releases Ext

Sahand Hariri 377 Dec 18, 2022
A repository of PyBullet utility functions for robotic motion planning, manipulation planning, and task and motion planning

pybullet-planning (previously ss-pybullet) A repository of PyBullet utility functions for robotic motion planning, manipulation planning, and task and

Caelan Garrett 260 Dec 27, 2022
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 04, 2023
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Dec 22, 2022
PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

PROTEIN-EXPRESSION-ANALYSIS-FOR-DOWN-SYNDROME Down syndrome (DS) is a chromosomal disorder where organisms have an extra chromosome 21, sometimes know

1 Jan 20, 2022
EbookMLCB - ebook Machine Learning cơ bản

Mã nguồn cuốn ebook "Machine Learning cơ bản", Vũ Hữu Tiệp. ebook Machine Learning cơ bản pdf-black_white, pdf-color. Mọi hình thức sao chép, in ấn đề

943 Jan 02, 2023
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
This is the code repository for Interpretable Machine Learning with Python, published by Packt.

Interpretable Machine Learning with Python, published by Packt

Packt 299 Jan 02, 2023
2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

Fluid Simulation Usage Download this repo and store it in your computer. Open a terminal and go to the root directory of this folder. Make sure you ha

Mariana Ávalos Arce 5 Dec 02, 2022
Real-time stream processing for python

Streamz Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelin

Python Streamz 1.1k Dec 28, 2022
Machine Learning Study 혼자 해보기

Machine Learning Study 혼자 해보기 기여자 (Contributors) ✨ Teddy Lee 🏠 HongJaeKwon 🏠 Seungwoo Han 🏠 Tae Heon Kim 🏠 Steve Kwon 🏠 SW Song 🏠 K1A2 🏠 Wooil

Teddy Lee 1.7k Jan 01, 2023
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 03, 2023
Evaluate on three different ML model for feature selection using Breast cancer data.

Anomaly-detection-Feature-Selection Evaluate on three different ML model for feature selection using Breast cancer data. ML models: SVM, KNN and MLP.

Tarek idrees 1 Mar 17, 2022
A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

Utsav 33 Dec 03, 2022
This is the material used in my free Persian course: Machine Learning with Python

This is the material used in my free Persian course: Machine Learning with Python

Yara Mohamadi 4 Aug 07, 2022
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

Tom Gustafsson 297 Dec 13, 2022