lightweight python wrapper for vowpal wabbit

Last update: Nov 24, 2022

Related tags

Deep Learning vowpal_porpoise

Overview

vowpal_porpoise

Lightweight python wrapper for vowpal_wabbit.

Why: Scalable, blazingly fast machine learning.

Install

Install vowpal_wabbit. Clone and run make
Install cython. pip install cython
Clone vowpal_porpoise
Run: python setup.py install to install.

Now can you do: import vowpal_porpoise from python.

Examples

Standard Interface

Linear regression with l1 penalty:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test',    # a name for the model
        passes=10,         # vw arg: passes
        loss='quadratic',  # vw arg: loss
        learning_rate=10,  # vw arg: learning_rate
        l1=0.01)           # vw arg: l1

# Inside the with training() block a vw process will be 
# open to communication
with vw.training():
    for instance in ['1 |big red square',\
                      '0 |small blue circle']:
        vw.push_instance(instance)

    # here stdin will close
# here the vw process will have finished

# Inside the with predicting() block we can stream instances and 
# acquire their labels
with vw.predicting():
    for instance in ['1 |large burnt sienna rhombus',\
                      '0 |little teal oval']:
        vw.push_instance(instance)

# Read the predictions like this:
predictions = list(vw.read_predictions_())

L-BFGS with a rank-5 approximation:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test_lbfgs', # a name for the model
        passes=10,            # vw arg: passes
        lbfgs=True,           # turn on lbfgs
        mem=5)                # lbfgs rank

Latent Dirichlet Allocation with 100 topics:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test_lda',  # a name for the model
        passes=10,           # vw arg: passes
        lda=100,             # turn on lda
        minibatch=100)       # set the minibatch size

Scikit-learn Interface

vowpal_porpoise also ships with an interface into scikit-learn, which allows awesome experiment-level stuff like cross-validation:

from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from vowpal_porpoise.sklearn import VW_Classifier

GridSearchCV(
        VW_Classifier(loss='logistic', moniker='example_sklearn',
                      passes=10, silent=True, learning_rate=10),
        param_grid=parameters,
        score_func=f1_score,
        cv=StratifiedKFold(y_train),
).fit(X_train, y_train)

Check out example_sklearn.py for more details

Library Interace (DISABLED as of 2013-08-12)

Via the VW interface:

with vw.predicting_library():
    for instance in ['1 |large burnt sienna rhombus', \
                      '1 |little teal oval']:
        prediction = vw.push_instance(instance)

Now the predictions are returned directly to the parent process, rather than having to read from disk. See examples/example1.py for more details.

Alternatively you can use the raw library interface:

import vw_c
vw = vw_c.VW("--loss=quadratic --l1=0.01 -f model")
vw.learn("1 |this is a positive example")
vw.learn("0 |this is a negative example")
vw.finish()

Currently does not support passes due to some limitations in the underlying vw C code.

Need more examples?

example1.py: SimpleModel class wrapper around VP (both standard and library flavors)
example_library.py: Demonstrates the low-level vw library wrapper, classifying lines of alice in wonderland vs through the looking glass.

Why

vowpal_wabbit is insanely fast and scalable. vowpal_porpoise is slower, but only during the initial training pass. Once the data has been properly cached it will idle while vowpal_wabbit does all the heavy lifting. Furthermore, vowpal_porpoise was designed to be lightweight and not to get in the way of vowpal_wabbit's scalability, e.g. it allows distributed learning via --nodes and does not require data to be batched in memory. In our research work we use vowpal_porpoise on an 80-node cluster running over multiple terabytes of data.

The main benefit of vowpal_porpoise is allowing rapid prototyping of new models and feature extractors. We found that we had been doing this in an ad-hoc way using python scripts to shuffle around massive gzipped text files, so we just closed the loop and made vowpal_wabbit a python library.

How it works

Wraps the vw binary in a subprocess and uses stdin to push data, temporary files to pull predictions. Why not use the prediction labels vw provides on stdout? It turns out that the python GIL basically makes streamining in and out of a process (even asynchronously) painfully difficult. If you know of a clever way to get around this, please email me. In other languages (e.g. in a forthcoming scala wrapper) this is not an issue.

Alternatively, you can use a pure api call (vw_c, wrapping libvw) for prediction.

Contact

Joseph Reisinger @josephreisinger

Contributors

Austin Waters ([email protected])
Joseph Reisinger ([email protected])
Daniel Duckworth ([email protected])

License

Apache 2.0

Comments

Issue with example1.py

Hi, guys!

When I run example1.py it raises exeception. """ [email protected]:~/vowpal_porpoise/examples$ python example1.py example1: training [DEBUG] No existing model file or not options.incremental [DEBUG] Running command: "vw --learning_rate=15.000000 --power_t=1.000000 --passes 10 --cache_file /home/kolesman/vowpal_porpoise/examples/example1.cache -f /home/kolesman/vowpal_porpoise/examples/example1.model" done streaming. final_regressor = /home/kolesman/vowpal_porpoise/examples/example1.model Num weight bits = 18 learning rate = 15 initial_t = 0 power_t = 1 decay_learning_rate = 1 creating cache_file = /home/kolesman/vowpal_porpoise/examples/example1.cache Reading datafile = num sources = 1 average since example example current current current loss last counter weight label predict features 0.360904 0.360904 3 3.0 1.0000 0.7933 5 0.266263 0.171622 6 6.0 0.0000 0.2465 5 -nan -nan 11 11.0 0.0000 0.0000 5 h -nan -nan 22 22.0 0.0000 0.0000 5 h -nan -nan 44 44.0 1.0000 1.0000 5 h Traceback (most recent call last): File "example1.py", line 86, in for (instance, prediction) in SimpleModel('example1').train(instances).predict(instances): File "example1.py", line 44, in train print 'done streaming.' File "/usr/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.py", line 167, in training self.close_process() File "/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.py", line 203, in close_process (self.vw_process.pid, self.vw_process.command, self.vw_process.returncode)) Exception: vw_process 22007 (vw --learning_rate=15.000000 --power_t=1.000000 --passes 10 --cache_file /home/kolesman/vowpal_porpoise/examples/example1.cache -f /home/kolesman/vowpal_porpoise/examples/example1.model) exited abnormally with return code -11 """

Do you have any ideas what is the source of problem?

opened by kolesman 2
Make tagged VW data work

For whatever reason, when the VW data is tagged, the parser barfs on reading the prediction file because it gets the prediction value and the tag back. This fixes it for me.

opened by mswimmer 1
Added support for nn (single layer) in sklearn interface

Adding support for nn to be called from the wrapper.

[DEBUG] Running command: "vw --learning_rate=5.000000 --l2=0.000010 --oaa=10 --nn=4 --passes 10 --cache_file /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.cache -f /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.model" [DEBUG] Running command: "vw --learning_rate=5.000000 --l2=0.000010 --oaa=10 --nn=4 -t -i /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.model -p /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.predictionecqcFA" Confusion Matrix: [[34 0 0 0 1 0 0 0 0 0] [ 0 29 0 0 0 0 0 0 0 7] [ 0 0 35 0 0 0 0 0 0 0] [ 0 0 0 24 0 4 0 3 6 0] [ 0 0 0 0 34 0 0 0 3 0] [ 0 0 0 0 0 37 0 0 0 0] [ 0 0 0 0 0 0 37 0 0 0] [ 0 0 0 1 0 0 0 32 2 1] [ 0 1 0 0 0 1 0 1 30 0] [ 0 0 0 0 0 2 0 1 3 31]] 0.89717036724 Adding @aboSamoor (as he is interested in this CL too)

opened by viveksck 1
Encode Cython as a setup-time dependency of vowpal porpoise

Encoding Cython as a setup-time dependency makes it much easer to use vowpal porpoise in nicely packaged distributions.

Without Cython as a setup-time dependency, you might have a requirements.txt with these lines: Cython git+http://github.com/josephreisinger/vowpal_porpoise.git#egg=vowpal_porpoise and try to execute "pip install -r requirements.txt" (or, for instance, push to Heroku and expect it to do so).

Unfortunately the installation process for Cython will not be completed before vowpal porpoise needs it. By specifying Cython as a setup-time dependency in the vowpal porpoise setup.py, Cython will be downloaded and available before it is needed, and you don't have to specify it as a dependency elsewhere. Using my modified setup.py, I can now run "pip install git+http://github.com/josephreisinger/vowpal_porpoise.git#egg=vowpal_porpoise" without any mention of Cython.

opened by mattbornski 0

Update example_sklearn.py

y must be a binary list, otherwise will result in an error:

[DEBUG] No existing model file or not options.incremental
[DEBUG] Running command: "vw --learning_rate=10.000000 --l2=0.000010 --loss_function=logistic --passes 10 --cache_file /Users/datle/Desktop/example_sklearn.cache -f /Users/datle/Desktop/example_sklearn.model"
[DEBUG] Running command: "vw --learning_rate=10.000000 --l2=0.000010 --loss_function=logistic -t -i /Users/datle/Desktop/example_sklearn.model -p /Users/datle/Desktop/example_sklearn.predictiond9d1DV"
Traceback (most recent call last):
  File "test.py", line 72, in <module>
    main()
  File "test.py", line 58, in main
    ).fit(X_train, y_train)
  File "/Library/Python/2.7/site-packages/sklearn/grid_search.py", line 732, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Library/Python/2.7/site-packages/sklearn/grid_search.py", line 505, in _fit
    for parameters in parameter_iterable
  File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
    self.dispatch(function, args, kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
    self.results = func(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1478, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
  File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1534, in _score
    score = scorer(estimator, X_test, y_test)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/scorer.py", line 201, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/base.py", line 295, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 179, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 84, in _check_targets
    "".format(type_true, type_pred))
ValueError: Can't handle mix of binary and continuous

opened by lenguyenthedat 0

Can't run example 1

Hi If I try to run example1 after installing everything, I get the following error:

File "example1.py", line 86, in <module>
    for (instance, prediction) in SimpleModel('example1').train(instances).predict(instances):
  File "example1.py", line 37, in train
    with self.model.training():
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "build/bdist.macosx-10.9-intel/egg/vowpal_porpoise/vw.py", line 168, in training
  File "build/bdist.macosx-10.9-intel/egg/vowpal_porpoise/vw.py", line 194, in start_training
  File "build/bdist.macosx-10.9-intel/egg/vowpal_porpoise/vw.py", line 266, in make_subprocess
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

I'd hazard a guess the cache file is not getting created. Please help?

opened by Kaydeeb0y 0

Doesn't work on ipython notebook

I'm trying to use vowpal porpoise from my Ipython Notebook web interface Running this code:

from vowpal_porpoise import VW
vw = VW(vw='vw_new', 
   passes=2,
   moniker='log_train.vw', 
   loss='logistic')
with vw.training():
    pass

I get this:

---------------------------------------------------------------------------
UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-7-39be08ecca54> in <module>()
      3    moniker='log_train.vw',
      4    loss='logistic')
----> 5 with vw.training():
      6     pass

/usr/lib/python2.7/contextlib.pyc in __enter__(self)
     15     def __enter__(self):
     16         try:
---> 17             return self.gen.next()
     18         except StopIteration:
     19             raise RuntimeError("generator didn't yield")

/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in training(self)
    166     @contextmanager
    167     def training(self):
--> 168         self.start_training()
    169         yield
    170         self.close_process()

/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in start_training(self)
    192 
    193         # Run the actual training
--> 194         self.vw_process = self.make_subprocess(self.vw_train_command(cache_file, model_file))
    195 
    196         # set the instance pusher

/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in make_subprocess(self, command)
    264             stderr.write(command + '\n')
    265         self.log.debug('Running command: "%s"' % str(command))
--> 266         result = subprocess.Popen(shlex.split(str(command)), stdin=subprocess.PIPE, stdout=stdout, stderr=stderr, close_fds=True, universal_newlines=True)
    267         result.command = command
    268         return result

/usr/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    670         (p2cread, p2cwrite,
    671          c2pread, c2pwrite,
--> 672          errread, errwrite) = self._get_handles(stdin, stdout, stderr)
    673 
    674         self._execute_child(args, executable, preexec_fn, close_fds,

/usr/lib/python2.7/subprocess.pyc in _get_handles(self, stdin, stdout, stderr)
   1063             else:
   1064                 # Assuming file-like object
-> 1065                 errwrite = stderr.fileno()
   1066 
   1067             return (p2cread, p2cwrite,

/usr/local/lib/python2.7/dist-packages/IPython/kernel/zmq/iostream.pyc in fileno(self)
    192 
    193     def fileno(self):
--> 194         raise UnsupportedOperation("IOStream has no fileno.")
    195 
    196     def write(self, string):

UnsupportedOperation: IOStream has no fileno.

opened by khalman-m 0

Make input format for cross validation consistent with that of VW

First of all, this is a great wrapper! It was very nice to see the linear regression with l1 penalty example take input in the VW format. However, it would be great for beginners like me to have a similar example for getting the GridSearch to work with VW.

opened by Legend 0
GridSearchCV with n_jobs > 1 (Parallelized) with VW classfier results in a Broken Pipe error
127 with self.vw_.training(): 128 for instance in examples: 129 self.vw_.push_instance(instance) <----- 130 131 # learning done after "with" statement 132 return self 133

........................................................................... /usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in push_instance_stdin(self=<vowpal_porpoise.vw.VW instance>, instance='2 | 42:2.000000 29:16.000000 60:13.000000 61:16....3:16.000000 52:16.000000 33:7.000000 37:16.000000') 204 if self.vw_process.wait() != 0: 205 raise Exception("vw_process %d (%s) exited abnormally with return code %d" %
206 (self.vw_process.pid, self.vw_process.command, self.vw_process.returncode)) 207 208 def push_instance_stdin(self, instance): --> 209 self.vw_process.stdin.write(('%s\n' % instance).encode('utf8')) 210 211 def start_predicting(self): 212 model_file = self.get_model_file() 213 # Be sure that the prediction file has a unique filename, since many processes may try to

IOError: [Errno 32] Broken pipe

To reproduce: Just pass the parameter n_jobs = 10 to GridSearchCV in example_sklearn.py
opened by viveksck 0

Releases(0.3)

0.3(Aug 13, 2013)

Source code(tar.gz)
Source code(zip)

Owner

Joseph Reisinger

GitHub Repository http://josephreisinger.github.io/vowpal_porpoise/

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data. Flexible EM-Inspired Discriminant Analysis is a robust supervised classification algorithm that performs well i

0 Sep 06, 2022

PyTorch implementation for our paper "Deep Facial Synthesis: A New Challenge"

FSGAN Here is the official PyTorch implementation for our paper "Deep Facial Synthesis: A New Challenge". This project achieve the translation between

32 Oct 10, 2022

Epidemiology analysis package

zEpid zEpid is an epidemiology analysis package, providing easy to use tools for epidemiologists coding in Python 3.5+. The purpose of this library is

111 Jan 08, 2023

Methods to get the probability of a changepoint in a time series.

Bayesian Changepoint Detection Methods to get the probability of a changepoint in a time series. Both online and offline methods are available. Read t

554 Dec 30, 2022

A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

315 Jan 01, 2023

Haze Removal can remove slight to extreme cases of haze affecting an image

Haze Removal can remove slight to extreme cases of haze affecting an image. Its most typical use is for landscape photography where the haze causes low contrast and low saturation, but it can also be

3 Feb 15, 2022

Pytorch implementation of Integrating Tree Path in Transformer for Code Representation

This is an official Pytorch implementation of the approaches proposed in: Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, Zhi Jin “Integrating Tree Path in

16 Dec 23, 2022

OpenL3: Open-source deep audio and image embeddings

OpenL3 OpenL3 is an open-source Python library for computing deep audio and image embeddings. Please refer to the documentation for detailed instructi

326 Jan 02, 2023

DRLib：A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos.

DRLib：A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos A concise deep reinforcement learning libr

329 Jan 03, 2023

CondenseNet V2: Sparse Feature Reactivation for Deep Networks

CondenseNetV2 This repository is the official Pytorch implementation for "CondenseNet V2: Sparse Feature Reactivation for Deep Networks" paper by Le Y

74 Dec 12, 2022

Towards Interpretable Deep Metric Learning with Structural Matching

DIML Created by Wenliang Zhao*, Yongming Rao*, Ziyi Wang, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for paper Towards Interpr

75 Nov 11, 2022

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

DeepProbLog DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predic

94 Dec 18, 2022

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

3 Aug 20, 2022

Repository for the semantic WMI loss

Installation: pip install -e . Installing DL2: First clone DL2 in a separate directory and install it using the following commands: git clone https:/

4 Sep 15, 2022

ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

ByteTrack_ReID ByteTrack is the SOTA tracker in MOT benchmarks with strong detector YOLOX and a simple association strategy only based on motion infor

46 Dec 29, 2022

Automatically replace ONNX's RandomNormal node with Constant node.

onnx-remove-random-normal This is a script to replace RandomNormal node with Constant node. Example Imagine that we have something ONNX model like the

1 Dec 11, 2021

A small tool to joint picture including gif

README 做设计的时候遇到拼接长图的情况，但是发现没有什么好用的能拼接gif的工具。于是自己写了个gif拼接小工具。可以自动拼接gif、png和jpg等常见格式。效果从上至下从下至上从左至右从右至左使用克隆仓库 git clone https://github.com/Dels

3 Dec 15, 2021

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

512x512 flowers after 12 hours of training, 1 gpu 256x256 flowers after 12 hours of training, 1 gpu Pizza 'Lightweight' GAN Implementation of 'lightwe

1.5k Jan 02, 2023

PyTorch Implement for Path Attention Graph Network

SPAGAN in PyTorch This is a PyTorch implementation of the paper "SPAGAN: Shortest Path Graph Attention Network" Prerequisites We prefer to create a ne

38 Dec 28, 2022

A modular application for performing anomaly detection in networks

Deep-Learning-Models-for-Network-Annomaly-Detection The modular app consists for mainly three annomaly detection algorithms. The system supports model

1 Dec 09, 2021