a feature engineering wrapper for sklearn

Overview

Build Status Code Health Coverage Status DOI

Few

Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine learning algorithm in order to improve model estimation and prediction. In doing so, Few is able to provide the user with a set of concise, engineered features that describe their data.

Few uses genetic programming to generate, search and update engineered features. It incorporates feedback from the ML process to select important features, while also scoring them internally.

Install

You can use pip to install FEW from PyPi as:

pip install few

or you can clone the git repo and add it to your Python path. Then from the repo, run

python setup.py install

Mac users

Some Mac users have reported issues when installing with old versions of gcc (like gcc-4.2) because the random.h library is not included (basically this issue). I recommend installing gcc-4.8 or greater for use with Few. After updating the compiler, you can reinstall with

CC=gcc-4.8 python setupy.py install

Usage

Few uses the same nomenclature as sklearn supervised learning modules. Here is a simple example script:

# import few
from few import FEW
# initialize
learner = FEW(generations=100, population_size=25, ml = LassoLarsCV())
# fit model
learner.fit(X,y)
# generate prediction
y_pred = learner.predict(X_unseen)
# get feature transformation
Phi = learner.transform(X_unseen)

You can also call Few from the terminal as

python -m few.few data_file_name 

try python -m few.few --help to see options.

Examples

Check out few_example.py to see how to apply FEW to a regression dataset.

Publications

If you use Few, please reference our publications:

La Cava, W., and Moore, J.H. A general feature engineering wrapper for machine learning using epsilon-lexicase survival. Proceedings of the 20th European Conference on Genetic Programming (EuroGP 2017), Amsterdam, Netherlands. preprint

La Cava, W., and Moore, J.H. Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods. GECCO '17: Proceedings of the 2017 Genetic and Evolutionary Computation Conference. Berlin, Germany. arxiv

Acknowledgments

This method is being developed to study the genetic causes of human disease in the Epistasis Lab at UPenn. Work is partially supported by the Warren Center for Network and Data Science. Thanks to Randy Olson and TPOT for Python guidance.

Comments
  • Added roc_auc as a fit_choice

    Added roc_auc as a fit_choice

    Tested on a single sample dataset and it seams to work well.

    Currently, it is not compatible with any lexicase selection variants because there is no function that returns a vector of roc auc values. I am not sure what such a function would look like, because it is impossible to compute the roc auc of a single prediction.

    opened by erp12 12
  • Error with installation

    Error with installation

    Hello,

    Trying to attempt this package but running into some issues, any idea? I have VS 14.16 now on PC and getting this error when typing 'pip install few'. At first it was asking for eigency but now after that installation this error popped up.

    image

    Sincerely, G

    bug windows 
    opened by GinoWoz1 8
  • Lexicase survival

    Lexicase survival

    This has issues and doesn't run. Creating request for analysis in this separate branch.

    Gives error 'bool' is not a type identifier in

    cdef void _epsilon_lexicase "epsilon_lexicase"(Map[ArrayXXd] & F, int n, int d, int num_selections, Map[ArrayXi] & locs, bool lex_size, Map[ArrayXi] &sizes)

    opened by rgupta90 8
  • feat vs few?

    feat vs few?

    Greetings!

    I would like to know if there is any practical difference between the two projects. I'm asking this because testing feat would require a lot more effort than few and, as such, I need to know if it is worth it.

    Thanks in advance!

    opened by echo66 5
  • ImportError: dlopen  ... symbol not found

    ImportError: dlopen ... symbol not found

    Hi, I've cloned few, built and installed on OS X 10.12 using:

    CC=gcc-7 python setup.py install

    But I'm getting a symbol not found error on import of the few module.

    I note a few warnings during the build process beginning with: #warning "Using deprecated NumPy API, disable it by ...

    and then finally:

    g++ -bundle -undefined dynamic_lookup -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/few/lib/few_lib.o -o build/lib.macosx-10.7-x86_64-3.6/few_lib.cpython-36m-darwin.so clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]

    Any advice what to check next? Otherwise, I'm not entirely clear on why I'm seeing a clang message, so that, along with the indicated warning is my first avenue to explore.

    opened by jay-reynolds 5
  • cross_val changes

    cross_val changes

    I am not sure which is the best approach, commenting out the previous code or removing it. Thus, commented as of now. Also, kept self._training_features and self._training_labels assigned as these are being used in functions in other python files, which are being called here.

    opened by rgupta90 4
  • normalize feature transformations

    normalize feature transformations

    normalize feature transformations automatically before feeding them into the ML fit method. store the transformer so that it can be used in prediction/transformation as well.

    enhancement 
    opened by lacava 4
  • few.model() and few.print_model()

    few.model() and few.print_model()

    Hello!

    Thanks for sharing your work, this is really cool!

    I was wondering if you could provide a bit of explanation as to the difference between these two outputs of the algorithm.

    Also, is there any (outside) documentation on all this?

    Thanks in advance!

    Kind regards, Theodore.

    opened by TheodoreGalanos 4
  • random numbers seed not working?

    random numbers seed not working?

    Greetings!

    I have the following code:

    feats_gen = FEW(
                    ml=DecisionTreeClassifier(random_state=10, max_depth=None, min_samples_leaf=5), 
                    population_size=100, tourn_size=2,                 
                    mutation_rate=0.5, crossover_rate=0.5, 
                    sel='epsilon_lexicase',   
                    clean=True,                
                    mdr=True, boolean=True, 
                    random_state=10, verbosity=1, 
                    scoring_function=roc_auc_score, 
                    max_depth=10, min_depth=1, max_depth_init=1, 
                    classification=True, 
                    generations=50, max_stall=None, 
                    names=list(X_train.select_dtypes(include=[np.number]).columns))
    
    feats_gen.fit(X_train.select_dtypes(include=[np.number]).values, 
                  y_train.astype(int).values)
    
    test_ = preprocessing_pipeline.transform(e.test)
    
    X_test = test_.X
    y_test = test_[test_.target_name].astype(int)
    
    roc_auc_score(y_test, feats_gen._best_estimator.predict_proba(feats_gen.transform(X_test.select_dtypes(include=[np.number]).values))[:, 1])
    

    Everytime I run this code, I get different ROC AUC values in both training and test. I'm pretty sure preprocessing_pipeline is deterministic.

    opened by echo66 3
  • If original features are found by FEW, transform() method fails with TypeError

    If original features are found by FEW, transform() method fails with TypeError

    If original features are found by FEW, transform() method fails with TypeError

    eg:

    print('Model: {}'.format(learner.print_model()))

    Model: original features

    Phi = learner.transform(X_test.values)


    TypeError Traceback (most recent call last) in () ----> 1 Phi = learner.transform(X_test.values)

    ~/anaconda3/envs/ml/lib/python3.6/site-packages/FEW-0.0.38-py3.6-macosx-10.7-x86_64.egg/few/few.py in transform(self, x, inds, labels) 395 # return np.asarray(Parallel(n_jobs=10)(delayed(self.out)(I,x,labels,self.otype) for I in self._best_inds)).transpose() 396 return np.asarray( --> 397 [self.out(I,x,labels,self.otype) for I in self._best_inds]).transpose() 398 399

    TypeError: 'NoneType' object is not iterable

    opened by jay-reynolds 3
  • Revert

    Revert "standard scaler pipeline"

    Reverts lacava/few#26

    this passes the tests but is throwing an error on the dataset i'm currently applying to. the error comes in the fit method, line 314 in few.py:

    $python -m few.few ../../data/maize/d_maize-dent-tass.csv -p 100 -max_depth 3 -ms 25 --weight_parents 
    
    warning: ValueError in ml fit. X.shape: (100, 100) y_t shape: (146,)
    First ten entries X: [[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
     [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
       1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]]
    First ten entries y_t: [ 826.47  884.33  904.55  848.71  879.46  885.12  905.36  886.69  821.05
      912.51]
    equations: ['x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_16313', 'x_16313', 'x_16313', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55', 'x_16313', 'x_55', 'x_55', 'x_55', 'x_55', 'x_55']
    FEW parameters: {'min_depth': 1, 'ml': None, 'elitism': True, 'clean': False, 'erc': False, 'classification': False, 'crossover_rate': 0.5, 'c': True, 'op_weight': 1, 'scoring_function': None, 'population_size': '100', 'mdr': False, 'weight_parents': True, 'max_depth': 3, 'tourn_size': 2, 'mutation_rate': 0.5, 'max_depth_init': 2, 'random_state': None, 'max_stall': 25, 'otype': 'f', 'generations': 100, 'sel': 'epsilon_lexicase', 'verbosity': 1, 'track_diversity': False, 'seed_with_ml': True, 'disable_update_check': False, 'fit_choice': None, 'boolean': False}
    Traceback (most recent call last):
      File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 314, in fit
        self.ml.fit(self.X[self.valid_loc(),:].transpose(),y_t)
      File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/pipeline.py", line 270, in fit
        self._final_estimator.fit(Xt, y, **fit_params)
      File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 1141, in fit
        axis=0)(all_alphas)
      File "/home/bill/anaconda3/lib/python3.5/site-packages/scipy/interpolate/interpolate.py", line 483, in __init__
        "least %d entries" % minval)
    ValueError: x and y arrays must have at least 2 entries
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
        "__main__", mod_spec)
      File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 906, in <module>
        main()
      File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 893, in main
        learner.fit(training_features, training_labels)
      File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 330, in fit
        raise(ValueError)
    ValueError
    
    opened by lacava 2
  • Recent scikit learn support

    Recent scikit learn support

    I was trying to test the package, but ran into a few issues with a recent scikit-learn.

    Although not used, Parallel and delayed were imported. They no longer are in sklearn. Replaced the import to joblib. Also, Imputer is now SimpleImputer and no longer in preprocessing but in impute.

    These changes allow for compatibility with scikit-learn 0.23

    opened by fdion 0
  • Issues with current ML validation score

    Issues with current ML validation score

    Hello,

    Thanks for the help so far. I was able to get the tool up and running in windows.

    However, 2 weird things I am observing.

    1. When I use Gradient Boost Regressor - my score gets worse by the generation even when I switched the scoring function sign. The first score is nearly my best score I have gotten by myself (no feature engineering done on data set).

    https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_GB.ipynb

    1. When I use Random Forest - same scorer - current ML validation score returns as 0 and runs really fast

    https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_RF.ipynb

    I think I am missing something on how to use this tool but no idea what. I am trying to use this in tandem with TPOT as I am exploring feature creation GA/GP based tools. Sincerely appreciate any advice/guidance you can provide.

    Sincerely, G

    opened by GinoWoz1 3
  • add encoding operators for GWAS

    add encoding operators for GWAS

    add operators that re-encode input SNPs based on different encodings. include (add, dom, rec, het, sub-add, super-add). Need to resolve how underlying data would be represented; maybe assume the input is additive?

    enhancement 
    opened by lacava 0
  • low GPU utilization with tf option

    low GPU utilization with tf option

    I'm getting low utilization of the GPU using the tensorflow evaluation strategy. There are a few things to try:

    • use this method to profile tensorflow and see where the inefficiencies lay.

    • according to this, using feed_dict is not a good idea. need to look into using pipelines or variables for feeding input data to the graphs.

    bug enhancement 
    opened by lacava 0
Releases(0.0.8)
Owner
William La Cava
Research associate at UPenn, developing ML for applications in biomedical informatics
William La Cava
A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Master status: Development status: Package information: MDR A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (M

Epistasis Lab at UPenn 122 Jul 06, 2022
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 05, 2022
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 05, 2023
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 03, 2023
Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

1.2k Jan 04, 2023
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
open-source feature selection repository in python

scikit-feature Feature selection repository scikit-feature in Python. scikit-feature is an open-source feature selection repository in Python develope

Jundong Li 1.3k Jan 05, 2023
scikit-learn addon to operate on set/"group"-based features

skl-groups skl-groups is a package to perform machine learning on sets (or "groups") of features in Python. It extends the scikit-learn library with s

Danica J. Sutherland 41 Apr 06, 2022