a feature engineering wrapper for sklearn

Last update: Nov 18, 2022

Related tags

Overview

Few

Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine learning algorithm in order to improve model estimation and prediction. In doing so, Few is able to provide the user with a set of concise, engineered features that describe their data.

Few uses genetic programming to generate, search and update engineered features. It incorporates feedback from the ML process to select important features, while also scoring them internally.

Install

You can use pip to install FEW from PyPi as:

pip install few

or you can clone the git repo and add it to your Python path. Then from the repo, run

python setup.py install

Mac users

Some Mac users have reported issues when installing with old versions of gcc (like gcc-4.2) because the random.h library is not included (basically this issue). I recommend installing gcc-4.8 or greater for use with Few. After updating the compiler, you can reinstall with

CC=gcc-4.8 python setupy.py install

Usage

Few uses the same nomenclature as sklearn supervised learning modules. Here is a simple example script:

# import few
from few import FEW
# initialize
learner = FEW(generations=100, population_size=25, ml = LassoLarsCV())
# fit model
learner.fit(X,y)
# generate prediction
y_pred = learner.predict(X_unseen)
# get feature transformation
Phi = learner.transform(X_unseen)

You can also call Few from the terminal as

python -m few.few data_file_name

try python -m few.few --help to see options.

Examples

Check out few_example.py to see how to apply FEW to a regression dataset.

Publications

If you use Few, please reference our publications:

La Cava, W., and Moore, J.H. A general feature engineering wrapper for machine learning using epsilon-lexicase survival. Proceedings of the 20th European Conference on Genetic Programming (EuroGP 2017), Amsterdam, Netherlands. preprint

La Cava, W., and Moore, J.H. Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods. GECCO '17: Proceedings of the 2017 Genetic and Evolutionary Computation Conference. Berlin, Germany. arxiv

Acknowledgments

This method is being developed to study the genetic causes of human disease in the Epistasis Lab at UPenn. Work is partially supported by the Warren Center for Network and Data Science. Thanks to Randy Olson and TPOT for Python guidance.

Comments

Added roc_auc as a fit_choice

Tested on a single sample dataset and it seams to work well.

Currently, it is not compatible with any lexicase selection variants because there is no function that returns a vector of roc auc values. I am not sure what such a function would look like, because it is impossible to compute the roc auc of a single prediction.

opened by erp12 12
Error with installation

Hello,

Trying to attempt this package but running into some issues, any idea? I have VS 14.16 now on PC and getting this error when typing 'pip install few'. At first it was asking for eigency but now after that installation this error popped up.

Sincerely, G
bug windows

opened by GinoWoz1 8
Lexicase survival

This has issues and doesn't run. Creating request for analysis in this separate branch.

Gives error 'bool' is not a type identifier in

cdef void _epsilon_lexicase "epsilon_lexicase"(Map[ArrayXXd] & F, int n, int d, int num_selections, Map[ArrayXi] & locs, bool lex_size, Map[ArrayXi] &sizes)

opened by rgupta90 8
feat vs few?

Greetings!

I would like to know if there is any practical difference between the two projects. I'm asking this because testing feat would require a lot more effort than few and, as such, I need to know if it is worth it.

Thanks in advance!

opened by echo66 5
ImportError: dlopen ... symbol not found

Hi, I've cloned few, built and installed on OS X 10.12 using:

CC=gcc-7 python setup.py install

But I'm getting a symbol not found error on import of the few module.

I note a few warnings during the build process beginning with: #warning "Using deprecated NumPy API, disable it by ...

and then finally:

g++ -bundle -undefined dynamic_lookup -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/few/lib/few_lib.o -o build/lib.macosx-10.7-x86_64-3.6/few_lib.cpython-36m-darwin.so clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]

Any advice what to check next? Otherwise, I'm not entirely clear on why I'm seeing a clang message, so that, along with the indicated warning is my first avenue to explore.

opened by jay-reynolds 5
cross_val changes

I am not sure which is the best approach, commenting out the previous code or removing it. Thus, commented as of now. Also, kept self._training_features and self._training_labels assigned as these are being used in functions in other python files, which are being called here.

opened by rgupta90 4
normalize feature transformations

normalize feature transformations automatically before feeding them into the ML fit method. store the transformer so that it can be used in prediction/transformation as well.
enhancement

opened by lacava 4
few.model() and few.print_model()

Hello!

Thanks for sharing your work, this is really cool!

I was wondering if you could provide a bit of explanation as to the difference between these two outputs of the algorithm.

Also, is there any (outside) documentation on all this?

Thanks in advance!

Kind regards, Theodore.

opened by TheodoreGalanos 4

random numbers seed not working?

Greetings!

I have the following code:

feats_gen = FEW(
                ml=DecisionTreeClassifier(random_state=10, max_depth=None, min_samples_leaf=5), 
                population_size=100, tourn_size=2,                 
                mutation_rate=0.5, crossover_rate=0.5, 
                sel='epsilon_lexicase',   
                clean=True,                
                mdr=True, boolean=True, 
                random_state=10, verbosity=1, 
                scoring_function=roc_auc_score, 
                max_depth=10, min_depth=1, max_depth_init=1, 
                classification=True, 
                generations=50, max_stall=None, 
                names=list(X_train.select_dtypes(include=[np.number]).columns))

feats_gen.fit(X_train.select_dtypes(include=[np.number]).values, 
              y_train.astype(int).values)

test_ = preprocessing_pipeline.transform(e.test)

X_test = test_.X
y_test = test_[test_.target_name].astype(int)

roc_auc_score(y_test, feats_gen._best_estimator.predict_proba(feats_gen.transform(X_test.select_dtypes(include=[np.number]).values))[:, 1])

Everytime I run this code, I get different ROC AUC values in both training and test. I'm pretty sure preprocessing_pipeline is deterministic.

opened by echo66 3

If original features are found by FEW, transform() method fails with TypeError

If original features are found by FEW, transform() method fails with TypeError

eg:

print('Model: {}'.format(learner.print_model()))

Model: original features

Phi = learner.transform(X_test.values)

TypeError Traceback (most recent call last) in () ----> 1 Phi = learner.transform(X_test.values)

~/anaconda3/envs/ml/lib/python3.6/site-packages/FEW-0.0.38-py3.6-macosx-10.7-x86_64.egg/few/few.py in transform(self, x, inds, labels) 395 # return np.asarray(Parallel(n_jobs=10)(delayed(self.out)(I,x,labels,self.otype) for I in self._best_inds)).transpose() 396 return np.asarray( --> 397 [self.out(I,x,labels,self.otype) for I in self._best_inds]).transpose() 398 399

TypeError: 'NoneType' object is not iterable

opened by jay-reynolds 3

Revert "standard scaler pipeline"

Reverts lacava/few#26

this passes the tests but is throwing an error on the dataset i'm currently applying to. the error comes in the fit method, line 314 in few.py:

$python -m few.few ../../data/maize/d_maize-dent-tass.csv -p 100 -max_depth 3 -ms 25 --weight_parents 

warning: ValueError in ml fit. X.shape: (100, 100) y_t shape: (146,)
First ten entries X: [[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]]
First ten entries y_t: [ 826.47  884.33  904.55  848.71  879.46  885.12  905.36  886.69  821.05
  912.51]
equations: ['x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55']
FEW parameters: {'min_depth': 1, 'ml': None, 'elitism': True, 'clean': False, 'erc': False, 'classification': False, 'crossover_rate': 0.5, 'c': True, 'op_weight': 1, 'scoring_function': None, 'population_size': '100', 'mdr': False, 'weight_parents': True, 'max_depth': 3, 'tourn_size': 2, 'mutation_rate': 0.5, 'max_depth_init': 2, 'random_state': None, 'max_stall': 25, 'otype': 'f', 'generations': 100, 'sel': 'epsilon_lexicase', 'verbosity': 1, 'track_diversity': False, 'seed_with_ml': True, 'disable_update_check': False, 'fit_choice': None, 'boolean': False}
Traceback (most recent call last):
  File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 314, in fit
    self.ml.fit(self.X[self.valid_loc(),:].transpose(),y_t)
  File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 1141, in fit
    axis=0)(all_alphas)
  File "/home/bill/anaconda3/lib/python3.5/site-packages/scipy/interpolate/interpolate.py", line 483, in __init__
    "least %d entries" % minval)
ValueError: x and y arrays must have at least 2 entries

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 906, in <module>
    main()
  File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 893, in main
    learner.fit(training_features, training_labels)
  File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 330, in fit
    raise(ValueError)
ValueError

opened by lacava 2

Recent scikit learn support

I was trying to test the package, but ran into a few issues with a recent scikit-learn.

Although not used, Parallel and delayed were imported. They no longer are in sklearn. Replaced the import to joblib. Also, Imputer is now SimpleImputer and no longer in preprocessing but in impute.

These changes allow for compatibility with scikit-learn 0.23

opened by fdion 0
Issues with current ML validation score
Hello,

Thanks for the help so far. I was able to get the tool up and running in windows.

However, 2 weird things I am observing.

When I use Gradient Boost Regressor - my score gets worse by the generation even when I switched the scoring function sign. The first score is nearly my best score I have gotten by myself (no feature engineering done on data set).

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_GB.ipynb

When I use Random Forest - same scorer - current ML validation score returns as 0 and runs really fast

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_RF.ipynb

I think I am missing something on how to use this tool but no idea what. I am trying to use this in tandem with TPOT as I am exploring feature creation GA/GP based tools. Sincerely appreciate any advice/guidance you can provide.

Sincerely, G
opened by GinoWoz1 3
add encoding operators for GWAS

add operators that re-encode input SNPs based on different encodings. include (add, dom, rec, het, sub-add, super-add). Need to resolve how underlying data would be represented; maybe assume the input is additive?
enhancement

opened by lacava 0
low GPU utilization with tf option
I'm getting low utilization of the GPU using the tensorflow evaluation strategy. There are a few things to try:

use this method to profile tensorflow and see where the inefficiencies lay.

according to this, using feed_dict is not a good idea. need to look into using pipelines or variables for feeding input data to the graphs.

bug enhancement
opened by lacava 0

Releases(0.0.8)

0.0.8(Dec 15, 2016)

first published release
Source code(tar.gz)
Source code(zip)

Owner

William La Cava

Research associate at UPenn, developing ML for applications in biomedical informatics

GitHub Repository https://lacava.github.io/few

A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

187 Dec 22, 2022

Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

1.2k Jan 04, 2023

An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

6.4k Jan 05, 2023

open-source feature selection repository in python

scikit-feature Feature selection repository scikit-feature in Python. scikit-feature is an open-source feature selection repository in Python develope

1.3k Jan 05, 2023

scikit-learn addon to operate on set/"group"-based features

skl-groups skl-groups is a package to perform machine learning on sets (or "groups") of features in Python. It extends the scikit-learn library with s

41 Apr 06, 2022

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

374 Dec 15, 2022

a feature engineering wrapper for sklearn

Related tags

Overview

Few

Install

Mac users

Usage

Examples

Publications

Acknowledgments

Comments

Added roc_auc as a fit_choice

Error with installation

Lexicase survival

feat vs few?

ImportError: dlopen ... symbol not found

cross_val changes

normalize feature transformations

few.model() and few.print_model()

random numbers seed not working?

If original features are found by FEW, transform() method fails with TypeError

Revert "standard scaler pipeline"

Recent scikit learn support

Issues with current ML validation score

add encoding operators for GWAS

low GPU utilization with tf option

Releases(0.0.8)

0.0.8(Dec 15, 2016)

Owner

William La Cava

A fast xgboost feature selection algorithm

Python implementations of the Boruta all-relevant feature selection method.

An open source python library for automated feature engineering

open-source feature selection repository in python

scikit-learn addon to operate on set/"group"-based features

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Automatic extraction of relevant features from time series:

A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

a feature engineering wrapper for sklearn

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API