High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Last update: Jan 08, 2023

Overview

What is xLearn?

xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM), all of which can be used to solve large-scale machine learning problems. xLearn is especially useful for solving machine learning problems on large-scale sparse data. Many real world datasets deal with high dimensional sparse feature vectors like a recommendation system where the number of categories and users is on the order of millions. In that case, if you are the user of liblinear, libfm, and libffm, now xLearn is your another better choice.

Get Started! (English)

Get Started! (中文)

Performance

xLearn is developed by high-performance C++ code with careful design and optimizations. Our system is designed to maximize CPU and memory utilization, provide cache-aware computation, and support lock-free learning. By combining these insights, xLearn is 5x-13x faster compared to similar systems.

Ease-of-use

xLearn does not rely on any third-party library and users can just clone the code and compile it by using cmake. Also, xLearn supports very simple Python and CLI interface for data scientists, and it also offers many useful features that have been widely used in machine learning and data mining competitions, such as cross-validation, early-stop, etc.

Scalability

xLearn can be used for solving large-scale machine learning problems. First, xLearn supports out-of-core training, which can handle very large data (TB) by just leveraging the disk of a PC. In addition, xLearn supports distributed training, which scales beyond billions of example across many machines by using the Parameter Server framework.

How to Contribute

xLearn has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

Please contribute if you find any bug in xLearn.
Contribute new features you want to see in xLearn.
Contribute to the tests to make it more reliable.
Contribute to the documents to make it clearer for everyone.
Contribute to the examples to share your experience with other users.
Open issue if you met problems during development.

Note that, please post iusse and contribution in English so that everyone can get help from them.

Contributors (rank randomly)

For Enterprise Users and Call for Sponsors

If you are enterprise users and find xLearn is useful in your work, please let us know, and we are glad to add your company logo here. We also welcome you become a sponsor to make this project better.

What's New

2019-10-13 Andrew Kane add Ruby bindings for xLearn!
2019-4-25 xLearn 0.4.4 version release. Main update:
- Support Python DMatrix
- Better Windows support
- Fix bugs in previous version
2019-3-25 xLearn 0.4.3 version release. Main update:
- Fix bugs in previous version
2019-3-12 xLearn 0.4.2 version release. Main update:
- Release Windows version of xLearn
2019-1-30 xLearn 0.4.1 version release. Main update:
- More flexible data reader
2018-11-22 xLearn 0.4.0 version release. Main update:
- Fix bugs in previous version
- Add online learning for xLearn
2018-11-10 xLearn 0.3.8 version release. Main update:
- Fix bugs in previous version.
- Update early-stop mechanism.
2018-11-08. xLearn gets 2000 star! Congs!
2018-10-29 xLearn 0.3.7 version release. Main update:
- Add incremental Reader, which can save 50% memory cost.
2018-10-22 xLearn 0.3.5 version release. Main update:
- Fix bugs in 0.3.4.
2018-10-21 xLearn 0.3.4 version release. Main update:
- Fix bugs in on-disk training.
- Support new file format.
2018-10-14 xLearn 0.3.3 version release. Main update:
- Fix segmentation fault in prediction task.
- Update early-stop meachnism.
2018-09-21 xLearn 0.3.2 version release. Main update:
- Fix bugs in previous version
- New TXT format for model output
2018-09-08 xLearn uses the new logo:

2018-09-07 The Chinese document is available now!
2018-03-08 xLearn 0.3.0 version release. Main update:
- Fix bugs in previous version
- Solved the memory leak problem for on-disk learning
- Support TXT model checkpoint
- Support Scikit-Learn API
2017-12-18 xLearn 0.2.0 version release. Main update:
- Fix bugs in previous version
- Support pip installation
- New Documents
- Faster FTRL algorithm
2017-11-24 The first version (0.1.0) of xLearn release !

Comments

Different results at each run

I am not sure if we expect that but the following example predict different results each time I run it:

import numpy as np
import xlearn as xl
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


iris_data = load_iris()
X = iris_data['data']
y = (iris_data['target'] == 2)

X_train,   \
X_val,     \
y_train,   \
y_val = train_test_split(X, y, test_size=0.3, random_state=0)

linear_model = xl.LRModel(task='binary', init=0.1,
                          epoch=10, lr=0.1,
                          reg_lambda=1.0, opt='sgd')


linear_model.fit(X_train, y_train,
                 eval_set=[X_val, y_val],
                 is_lock_free=False)


y_pred = linear_model.predict(X_val)

This is version 0.4.2 executed via sklearn api on windows: y_pred is never the same. Are we expecting that?

Thank you

opened by octoscore 19

cmake version > 3.0 ?

when I install from pip , Exception: Please install CMake first but I the centOS has cmake 2.8.12.2

when I build from source code ./build.sh CMake Error at CMakeLists.txt:24 (cmake_minimum_required): CMake 3.0 or higher is required. You are running version 2.8.12.2

can have a simple way install xlearn ? or have a simple way update cmake to 3.0

opened by xxyy1 11
initial implementation of sklearn interface

Please see discussion in https://github.com/aksnzhy/xlearn/issues/68. Currently, the sklearn interface converts numpy array to libsvm format internally with the use of temporary files. Further improvement could be done once the in-memory conversion is ready.

opened by randxie 10
run_example failed in MacOS

I installed a new version with the build.sh in MacOS (Xcode and CMake installed first). When running the run_example.sh, it aborted with the error message:

MacOS: Mojave 10.14.4 xlearn: 0.43

(the pip installed one in version 0.40a works on my computer, but has a bug just fixed in latest version)

opened by cuiwow 9
Application in recommender system
Hi, aksnzhy! Thank you for this library. Can you please guide me a bit? I have a dataset with four columns: transaction_count, user, item, item_colour. I want to recommend some items to users, based on transaction_count. I can use ALS with transaction_count, user and item columns, for example with "implicit" library. But if i want to take in account item_colour i need to use for example ffm. So, i create ffm-formatted file:

transaction_count user_id:value_id:1 item_id:value_id:1 item_colour_id:value_id:1 5 0:0:1 1:3:1 2:5:1 3 0:1:1 1:4:1 2:6:1 8 0:2:1 1:3:1 2:7:1

and train my model. But, if i want to recommend top-5 items with some colours to a user, i need to create all combinations of user:item:colour rows, score them and then sort among each user all variants of item:colour by modeled probabilities and select 5 best among them. The problem is that such a list of all possible combinations explodes with my dimensions (users=80000, items=14000, colours=5), and impossible to operate. Is there any hack for implementation?
opened by Tych0n 9
xlearn for windows
xLearn for Windows

Build on VS2017(x64), and pass all existing tests on windows10. Then check xLearn for unix-like version on WSL. There are some information about this pull request.

Finished:

xLearn for Windows including CLI and Python-package

Unfinished:

test unittests with pthread.

compile x86 release version(because fm_score_test haven't been passed for x86 release version).

Supplement:

I did not find what changes with the latest update for common.h, so I modify the program base on the last one before the latest update. There is only format change in some files.
opened by etveritas 9

segmentation fault!! No other failed messages.

Thanks to your wonderful works! Here is my failed message.

I don't have other information to check the error point. Could you help me?

I ran the same code for three times and get the same error.

My environment is Python 3.5.2, Ubuntu 16.04. The RAM is 32GB.

HERE is my error messages:

[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (../data/cfg/ffm_train.txt.bin) found. Skip converting text to binary.
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (../data/cfg/ffm_valid.txt.bin) found. Skip converting text to binary.
[------------] Number of Feature: 32870706
[------------] Number of Field: 50
[------------] Time cost for reading problem: 38.60 (sec)
[ ACTION     ] Initialize model ...
[1]    26156 segmentation fault (core dumped)  python3 train_xlearn.py

opened by fuxuemingzhu 9

max feature count in one line

Seems there is a limit in featureid count in one line in sample in training file. max size is 10000?

[ ACTION ] Read Problem ... [------------] First check if the text file has been already converted to binary format. [------------] Binary file (train.problem.5.bin) NOT found. Convert text file to binary file.

opened by HyperGroups 8
ffm_model.setTXTModel 保存的 model.txt模型中参数的疑问.

例如: v_1025805_41 中的 1025805 和输入的特征中的特征id是对应的吗?

如果输入样本中没有出现的feature id 就应该不计算在内吧~~那我输入的feature id 的编码岂不是与model.txt中的 id 会有差别吗? (输入的样本的feature id会考虑全部的feature,有部分可能不会在样本中出现)

举个例子: 输入的样本中特征编码(libffm格式) 1 1:1:1 2:3:1 0 1:1:1 2:1:1 如果field2 中的特征有 1, 2, 3三个但是样本中只出现了 1,3 那么model.txt中对应的 v_featureid_fields 会考虑 fields2中的 2 特征?

opened by fly12357 7
L2 regularization seems to be reduplicated for FTRL optimization

L2 regularization seems to be reduplicated for FTRL optimization. Take LR as an example. Proximal operator in FTRL has cover the L2 regularization, so the former one seems to be reduplicated. FM and FFM have similar problem.

opened by matricer 7
python setup.py install 报错

首先我在windows文件夹用 Visual Studio2017编译成功了，然后我进入xlearn\python-package, 打开命令行，输入python setup.py install,结果报错如下： Traceback (most recent call last): File "setup.py", line 29, in LIB_PATH = os.path.relpath(libfile, CURRENT_DIR) for libfile in libpa nd_lib_path'] File "xlearn/libpath.py", line 63, in find_lib_path 'Cannot find xlearn Library in the candidate path' XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path 我看也有其他用户遇到这个问题，请问怎么解决？

opened by huichengxiao 6
Failed to convert feature matrix X and label y to xlearn data format

I got an error when I run "example_FM_wine.py" file.

Failed to convert feature matrix X and label y to xlearn data format During handling of the above exception, another exception occurred: File "D:\Src\example_FM_wine.py", line 50, in fm_model.fit(X_train,

The same error occurs in"example_LR_iris.py" file. The error does not occur in "example_FFM_criteo.py" file. (Because this file uses a txt file as a parameter when calling the fit function.)

how do I fix it? Thanks a lot.

opened by dudududa007 0
Library crashing when using it in a HDFS environment

In the model training step, specifically when we are standing in the "fit" step, we use the command and it crashes the kernel.

train and test are txt files readed from hdfs.

train = train.decode() test = test.decode() ffm_model = xl.create_ffm() ffm_model.setTrain(train) ffm_model.fit(param,str)

where str is the hdfs path to our production folder.

I am sorry if the issue is badly explained and/or contextualized , is the first time I report something as an issue to the library's owner.

What have we tried? import subprocess cat = subprocess.Popen(["hadoop", "fs", "-cat", str], stdout=subprocess.PIPE) ffm_model.fit(param,cat)

The file specified in str, already exists but is empty.

opened by diegomaca 0
add new functions

At present, the metric indicators calculated by the algorithm are all output in the log, and the indicator values cannot be obtained, so I additionally calculated logloss, acc, and prec. In the calculation process, it is found that logloss needs a value of 0 to 1, and acc and prec need a label value of 0 and 1, so it is necessary to call the setSign and setSigmoid methods, but after continuous calls, it is found that these two functions do not cover the last time. value, but add a new value, so add disableSigmoid, disableSign these two methods

opened by mmarzl17 0
, vector）”">

How xlearn process the dataset before using the c++ api "Train(vector, vector）”

你好,在使用xlearn时，我们会把数据X传给python的fit接口，python会一层层的调用，直到使用底层的c++的train接口，我看这个c++接口使用的是reader数据类型，所以在从python侧调用c++侧的过程中一定会进行数据的处理，我对通过python调用c++这个过程不熟悉，请问如何才能看到这个数据处理都过程?它会包含在哪个文件里吗？最后，像“_LIB.XLearnSetTrain”这样的函数定义在哪个文件里啊？我找不到它，希望您能回复一下，不胜感谢！ Thank！

opened by jason1894 0
xlearn was installed on windows anaconda successfully but is not working

xlearn was installed on windows anaconda successfully but is not working error is

import xlearn as xl Traceback (most recent call last): File "", line 1, in File "C:\Users\sndr\anaconda3_Apr_2020\lib\site-packages\xlearn_init_.py", line 18, in from .xlearn import * File "C:\Users\sndr\anaconda3_Apr_2020\lib\site-packages\xlearn\xlearn.py", line 19, in from .base import _LIB, XLearnHandle File "C:\Users\sndr\anaconda3_Apr_2020\lib\site-packages\xlearn\base.py", line 34, in _LIB = _load_lib() File "C:\Users\sndr\anaconda3_Apr_2020\lib\site-packages\xlearn\base.py", line 27, in _load_lib lib_path = find_lib_path() File "C:\Users\sndr\anaconda3_Apr_2020\lib\site-packages\xlearn\libpath.py", line 59, in find_lib_path 'Cannot find xlearn Library in the candidate path' xlearn.libpath.XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

opened by Sandy4321 1
xlearn python API's .predict method in doesn't kill the created threads after execution in python API, which leads to resource exhausted.

I was getting strange resource exhausted bug when running xlearn fm model predict method for a while.

When I profiled the processes via htop, I have noticed that the number of threads gradually increases by 8 when invoking model.predict("model/model.out", f"output/output.txt") which leads to resource exhausted when the number of threads reaches a critical level.

One solution, I found to solve this problem is invoke the model.predict in a separate process via the multiprocessing module, however this solution is extremely slow in cases when model.predict needs to be invoked many times.

Is there a way to kill the created threads after the execution of the predict method has completed?

opened by HovhannesManushyan 2

Releases(v0.4.4)

v0.4.4(Apr 25, 2019)
Main update:

Support python DMatrix.

Better Windows support.

Source code(tar.gz)
Source code(zip)
xlearn-0.4.4-py2.py3-none-manylinux1_x86_64.whl(221.03 KB)
xlearn-0.4.4-py2.py3-none-win_amd64.whl(365.72 KB)
xlearn-0.4.4.tar.gz(4.91 MB)
v0.4.3(Mar 25, 2019)

Main Update: *Fix bugs in previous version *Provide binary python package on windows. Support Python(x64) in these versions: py2.7,py3.4,py3.5,py3.6,py3.7
Source code(tar.gz)
Source code(zip)
xlearn-0.4.3-py2.py3-none-win_amd64.whl(365.72 KB)
v0.4.2(Mar 12, 2019)
Main update:

Release Windows version of xLearn

Source code(tar.gz)
Source code(zip)
v0.4.1(Jan 30, 2019)
Main update:

More flexible data reader

Source code(tar.gz)
Source code(zip)
v0.4.0(Nov 22, 2018)
Fix bugs in previous version

Add online learning for xLearn

Source code(tar.gz)
Source code(zip)
v0.3.8(Nov 10, 2018)
2018-11-10 xLearn 0.3.8 version release. Main update:

Fix bugs in previous version.

Update early-stop meachnism.

Source code(tar.gz)
Source code(zip)
v0.3.7(Oct 29, 2018)
Main Update:

Fix bugs in 0.3.6

Source code(tar.gz)
Source code(zip)
v0.3.6(Oct 29, 2018)
Main update:

* Add incremental Reader, which can save 50% memory cost.
Source code(tar.gz)
Source code(zip)
v(Oct 29, 2018)

Source code(tar.gz)
Source code(zip)
0.3.5(Oct 22, 2018)

Fix bugs in 0.3.4 version.
Source code(tar.gz)
Source code(zip)
v0.3.4(Oct 21, 2018)
Main Update:

Fix bugs in on-disk training in previous version.

Support new file format.

Source code(tar.gz)
Source code(zip)
0.3.3(Oct 14, 2018)
Main update:

Solve segmentation fault in prediction

Update Early-stop

Source code(tar.gz)
Source code(zip)
0.3.2(Sep 22, 2018)

Fix bugs in previous version and re-design the TXT model output format.
Source code(tar.gz)
Source code(zip)
v0.3.1(Mar 9, 2018)

Fix the memory leak bug in xLearn 0.3.0
Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 9, 2018)

Source code(tar.gz)
Source code(zip)

Owner

Chao Ma

I focus on distributed systems and large-scale machine learning.

GitHub Repository https://xlearn-doc.readthedocs.io/en/latest/index.html

MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

10 Dec 19, 2022

GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

Generator of Rad Names from Decent Paper Acronyms

264 Nov 08, 2022

Summer: compartmental disease modelling in Python

Summer: compartmental disease modelling in Python Summer is a Python-based framework for the creation and execution of compartmental (or "state-based"

6 May 13, 2022

PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

1.1k Jan 04, 2023

Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

1.6k Dec 29, 2022

Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

Payment-Date-Prediction Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

15 Sep 09, 2022

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learn

8.1k Dec 30, 2022

A data preprocessing package for time series data. Design for machine learning and deep learning.

152 Jan 07, 2023

Random Forest Classification for Neural Subtypes

Random Forest classifier for neural subtypes extracted from extracellular recordings from human brain organoids.

1 Jan 31, 2022

CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

CyLP CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL). CyLP’s unique feature is that you can use i

161 Dec 14, 2022

Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

419 Jan 01, 2023

Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

1.3k Dec 28, 2022

Reggy - Regressions with arbitrarily complex regularization terms

reggy Regressions with arbitrarily complex regularization terms. Currently suppo

1 Jan 20, 2022

Learning --> Numpy January 2022 - winter'22

Numerical-Python Numpy NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along

0 Mar 12, 2022

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Related tags

Overview

What is xLearn?

Performance

Ease-of-use

Scalability

How to Contribute

Contributors (rank randomly)

For Enterprise Users and Call for Sponsors

What's New

Comments

Releases(v0.4.4)

v0.4.4(Apr 25, 2019)

v0.4.3(Mar 25, 2019)

v0.4.2(Mar 12, 2019)

v0.4.1(Jan 30, 2019)

v0.4.0(Nov 22, 2018)

v0.3.8(Nov 10, 2018)

v0.3.7(Oct 29, 2018)

v0.3.6(Oct 29, 2018)

v(Oct 29, 2018)

0.3.5(Oct 22, 2018)

v0.3.4(Oct 21, 2018)

0.3.3(Oct 14, 2018)

0.3.2(Sep 22, 2018)

v0.3.1(Mar 9, 2018)

v0.3.0(Mar 9, 2018)