sk-dist: Distributed scikit-learn meta-estimators in PySpark
What is it?
sk-dist
is a Python package for machine learning built on top of scikit-learn and is distributed under the Apache 2.0 software license. The sk-dist
module can be thought of as "distributed scikit-learn" as its core functionality is to extend the scikit-learn
built-in joblib
parallelization of meta-estimator training to spark. A popular use case is the parallelization of grid search as shown here:
Check out the blog post for more information on the motivation and use cases of sk-dist
.
Main Features
- Distributed Training -
sk-dist
parallelizes the training ofscikit-learn
meta-estimators with PySpark. This allows distributed training of these estimators without any constraint on the physical resources of any one machine. In all cases, spark artifacts are automatically stripped from the fitted estimator. These estimators can then be pickled and un-pickled for prediction tasks, operating identically at predict time to theirscikit-learn
counterparts. Supported tasks are:- Grid Search: Hyperparameter optimization techniques, particularly GridSearchCV and RandomizedSeachCV, are distributed such that each parameter set candidate is trained in parallel.
- Multiclass Strategies: Multiclass classification strategies, particularly OneVsRestClassifier and OneVsOneClassifier, are distributed such that each binary probelm is trained in parallel.
- Tree Ensembles: Decision tree ensembles for classification and regression, particularly RandomForest and ExtraTrees, are distributed such that each tree is trained in parallel.
- Distributed Prediction -
sk-dist
provides a prediction module which builds vectorized UDFs for PySpark DataFrames using fittedscikit-learn
estimators. This distributes thepredict
andpredict_proba
methods ofscikit-learn
estimators, enabling large scale prediction withscikit-learn
. - Feature Encoding -
sk-dist
provides a flexible feature encoding utility calledEncoderizer
which encodes mix-typed feature spaces using either default behavior or user defined customizable settings. It is particularly aimed at text features, but it additionally handles numeric and dictionary type feature spaces.
Installation
Dependencies
sk-dist
requires:
Dependency Notes
- versions of
numpy
,scipy
andjoblib
that are compatible with any supported version ofscikit-learn
should be sufficient forsk-dist
sk-dist
is not supported with Python 2
Spark Dependencies
Most sk-dist
functionality requires a spark installation as well as PySpark. Some functionality can run without spark, so spark related dependencies are not required. The connection between sk-dist and spark relies solely on a sparkContext
as an argument to various sk-dist
classes upon instantiation.
A variety of spark configurations and setups will work. It is left up to the user to configure their own spark setup. The testing suite runs spark 2.4
and spark 3.0
, though any spark 2.0+
versions are expected to work.
Additional spark related dependecies are pyarrow
, which is used only for skdist.predict
functions. This uses vectorized pandas UDFs which require pyarrow>=0.8.0
, tested with pyarrow==0.16.0
. Depending on the spark version, it may be necessary to set spark.conf.set("spark.sql.execution.arrow.enabled", "true")
in the spark configuration.
User Installation
The easiest way to install sk-dist
is with pip
:
pip install --upgrade sk-dist
You can also download the source code:
git clone https://github.com/Ibotta/sk-dist.git
Testing
With pytest
installed, you can run tests locally:
pytest sk-dist
Examples
The package contains numerous examples on how to use sk-dist
in practice. Examples of note are:
- Grid Search with XGBoost
- Spark ML Benchmark Comparison
- Encoderizer with 20 Newsgroups
- One-Vs-Rest vs One-Vs-One
- Large Scale Sklearn Prediction with PySpark UDFs
Gradient Boosting
sk-dist
has been tested with a number of popular gradient boosting packages that conform to the scikit-learn
API. This includes xgboost
and catboost
. These will need to be installed in addition to sk-dist
on all nodes of the spark cluster via a node bootstrap script. Version compatibility is left up to the user.
Support for lightgbm
is not guaranteed, as it requires additional installations on all nodes of the spark cluster. This may work given proper installation but has not beed tested with sk-dist
.
Background
The project was started at Ibotta Inc. on the machine learning team and open sourced in 2019.
It is currently maintained by the machine learning team at Ibotta. Special thanks to those who contributed to sk-dist
while it was initially in development at Ibotta:
Thanks to James Foley for logo artwork.