Python package for concise, transparent, and accurate predictive modeling

Overview


Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use.

๐Ÿ“š docs โ€ข ๐Ÿ“– demo notebooks

Modern machine-learning models are increasingly complex, often making them difficult to interpret. This package provides a simple interface for fitting and using state-of-the-art interpretable models, all compatible with scikit-learn. These models can often replace black-box models (e.g. random forests) with simpler models (e.g. rule lists) while improving interpretability and computational efficiency, all without sacrificing predictive accuracy! Simply import a classifier or regressor and use the fit and predict methods, same as standard scikit-learn models.

from imodels import BoostedRulesClassifier, FIGSClassifier, SkopeRulesClassifier
from imodels import RuleFitRegressor, HSTreeRegressorCV, SLIMRegressor

model = BoostedRulesClassifier()  # initialize a model
model.fit(X_train, y_train)   # fit model
preds = model.predict(X_test) # predictions: shape is (n_test, 1)
preds_proba = model.predict_proba(X_test) # predicted probabilities: shape is (n_test, n_classes)
print(model) # print the rule-based model

-----------------------------
# the model consists of the following 3 rules
# if X1 > 5: then 80.5% risk
# else if X2 > 5: then 40% risk
# else: 10% risk

Installation

Install with pip install imodels (see here for help).

Supported models

Model Reference Description
Rulefit rule set ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Fits a sparse linear model on rules extracted from decision trees
Skope rule set ๐Ÿ—‚๏ธ , ๐Ÿ”— Extracts rules from gradient-boosted trees, deduplicates them,
then linearly combines them based on their OOB precision
Boosted rule set ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Sequentially fits a set of rules with Adaboost
Slipper rule set ๐Ÿ—‚๏ธ , ใ…คใ…ค ๐Ÿ“„ Sequentially learns a set of rules with SLIPPER
Bayesian rule set ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Finds concise rule set with Bayesian sampling (slow)
Optimal rule list ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Fits rule list using global optimization for sparsity (CORELS)
Bayesian rule list ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Fits compact rule list distribution with Bayesian sampling (slow)
Greedy rule list ๐Ÿ—‚๏ธ , ๐Ÿ”— Uses CART to fit a list (only a single path), rather than a tree
OneR rule list ๐Ÿ—‚๏ธ , ใ…คใ…ค ๐Ÿ“„ Fits rule list restricted to only one feature
Optimal rule tree ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Fits succinct tree using global optimization for sparsity (GOSDT)
Greedy rule tree ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Greedily fits tree using CART
C4.5 rule tree ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Greedily fits tree using C4.5
Iterative random
forest
๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Repeatedly fit random forest, giving features with
high importance a higher chance of being selected
Sparse integer
linear model
๐Ÿ—‚๏ธ , ใ…คใ…ค ๐Ÿ“„ Sparse linear model with integer coefficients
Greedy tree sums ๐Ÿ—‚๏ธ , ใ…คใ…ค ๐Ÿ“„ Sum of small trees with very few total rules (FIGS)
Hierarchical
shrinkage wrapper
๐Ÿ—‚๏ธ , ใ…คใ…ค ๐Ÿ“„ Improve any tree-based model with ultra-fast, post-hoc regularization
Distillation
wrapper
๐Ÿ—‚๏ธ Train a black-box model,
then distill it into an interpretable model
More models โŒ› (Coming soon!) Lightweight Rule Induction, MLRules, ...

Docs ๐Ÿ—‚๏ธ , Reference code implementation ๐Ÿ”— , Research paper ๐Ÿ“„

What's the difference between the models?

The final form of the above models takes one of the following forms, which aim to be simultaneously simple to understand and highly predictive:

Rule set Rule list Rule tree Algebraic models

Different models and algorithms vary not only in their final form but also in different choices made during modeling, such as how they generate, select, and postprocess rules:

Rule candidate generation Rule selection Rule postprocessing
Ex. RuleFit vs. SkopeRules RuleFit and SkopeRules differ only in the way they prune rules: RuleFit uses a linear model whereas SkopeRules heuristically deduplicates rules sharing overlap.
Ex. Bayesian rule lists vs. greedy rule lists Bayesian rule lists and greedy rule lists differ in how they select rules; bayesian rule lists perform a global optimization over possible rule lists while Greedy rule lists pick splits sequentially to maximize a given criterion.
Ex. FPSkope vs. SkopeRules FPSkope and SkopeRules differ only in the way they generate candidate rules: FPSkope uses FPgrowth whereas SkopeRules extracts rules from decision trees.

Demo notebooks

Demos are contained in the notebooks folder.

Quickstart demo Shows how to fit, predict, and visualize with different interpretable models
Quickstart colab demo Shows how to fit, predict, and visualize with different interpretable models
Clinical decision rule notebook Shows an example of using imodels for deriving a clinical decision rule
Posthoc analysis We also include some demos of posthoc analysis, which occurs after fitting models: posthoc.ipynb shows different simple analyses to interpret a trained model and uncertainty.ipynb contains basic code to get uncertainty estimates for a model

Support for different tasks

Different models support different machine-learning tasks. Current support for different models is given below (each of these models can be imported directly from imodels (e.g. from imodels import RuleFitClassifier):

Model Binary classification Regression Notes
Rulefit rule set RuleFitClassifier RuleFitRegressor
Skope rule set SkopeRulesClassifier
Boosted rule set BoostedRulesClassifier
SLIPPER rule set SlipperClassifier
Bayesian rule set BayesianRuleSetClassifier Fails for large problems
Optimal rule list (CORELS) OptimalRuleListClassifier Requires corels, fails for large problems
Bayesian rule list BayesianRuleListClassifier
Greedy rule list GreedyRuleListClassifier
OneR rule list OneRClassifier
Optimal rule tree (GOSDT) OptimalTreeClassifier Requires gosdt, fails for large problems
Greedy rule tree (CART) GreedyTreeClassifier GreedyTreeRegressor
C4.5 rule tree C45TreeClassifier
Iterative random forest IRFClassifier Requires irf
Sparse integer linear model SLIMClassifier SLIMRegressor Requires extra dependencies for speed
Greedy tree sums (FIGS) FIGSClassifier FIGSRegressor
Hierarchical shrinkage HSTreeClassifierCV HSTreeRegressorCV Wraps any sklearn tree-based model
Distillation DistilledRegressor Wraps any sklearn-compatible models

Extras

Data-wrangling functions for working with popular tabular datasets (e.g. compas). These functions, in conjunction with imodels-data and imodels-experiments, make it simple to download data and run experiments on new models.
Explain classification errors with a simple posthoc function. Fit an interpretable model to explain a previous model's errors (ex. in this notebook ๐Ÿ““ ).
Fast and effective discretizers for data preprocessing.
Discretizer Reference Description
MDLP ๐Ÿ—‚๏ธ , ๐Ÿ”— , ๐Ÿ“„ Discretize using entropy minimization heuristic
Simple ๐Ÿ—‚๏ธ , ๐Ÿ”— Simple KBins discretization
Random Forest ๐Ÿ—‚๏ธ Discretize into bins based on random forest split popularity
Rule-based utils for customizing models The code here contains many useful and customizable functions for rule-based learning in the [util folder](https://csinva.io/imodels/util/index.html). This includes functions / classes for rule deduplication, rule screening, and converting between trees, rulesets, and neural networks.

Our favorite models

After developing and playing with imodels, we developed a few new models to overcome limitations of existing interpretable models.

FIGS: Fast interpretable greedy-tree sums

๐Ÿ“„ Paper, ๐Ÿ”— Post, ๐Ÿ“Œ Citation

Fast Interpretable Greedy-Tree Sums (FIGS) is an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes CART to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, keeping the model interpretable. Experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance when restricted to just a few splits (e.g. less than 20).

Example FIGS model. FIGS learns a sum of trees with a flexible number of trees; to make its prediction, it sums the result from each tree.

Hierarchical shrinkage: post-hoc regularization for tree-based methods

๐Ÿ“„ Paper, ๐Ÿ”— Post, ๐Ÿ“Œ Citation

Hierarchical shinkage is an extremely fast post-hoc regularization method which works on any decision tree (or tree-based ensemble, such as Random Forest). It does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors (using a single regularization parameter). Experiments over a wide variety of datasets show that hierarchical shrinkage substantially increases the predictive performance of individual decision trees and decision-tree ensembles.

References

Readings
  • Interpretable ML good quick overview: murdoch et al. 2019, pdf
  • Interpretable ML book: molnar 2019, pdf
  • Case for interpretable models rather than post-hoc explanation: rudin 2019, pdf
  • Review on evaluating interpretability: doshi-velez & kim 2017, pdf
Reference implementations (also linked above) The code here heavily derives from the wonderful work of previous projects. We seek to to extract out, unify, and maintain key parts of these projects.
Related packages
  • gplearn: symbolic regression/classification
  • pysr: fast symbolic regression
  • pygam: generative additive models
  • interpretml: boosting-based gam
  • h20 ai: gams + glms (and more)
  • optbinning: data discretization / scoring models
Updates
  • For updates, star the repo, see this related repo, or follow @csinva_
  • Please make sure to give authors of original methods / base implementations appropriate credit!
  • Contributing: pull requests very welcome!

If it's useful for you, please star/cite the package, and make sure to give authors of original methods / base implementations credit:

@software{
    imodels2021,
    title        = {{imodels: a python package for fitting interpretable models}},
    journal      = {Journal of Open Source Software}
    publisher    = {The Open Journal},
    year         = {2021},
    author       = {Singh, Chandan and Nasseri, Keyan and Tan, Yan Shuo and Tang, Tiffany and Yu, Bin},
    volume       = {6},
    number       = {61},
    pages        = {3192},
    doi          = {10.21105/joss.03192},
    url          = {https://doi.org/10.21105/joss.03192},
}
Comments
  • Added Gini Importances

    Added Gini Importances

    Hi @csinva how do the new Gini importances look?

    I based the calculation off sklearn's code from here and here, though it needed to be made recursive as we do not have arrays of all the nodes and their properties.

    There is a demo of the new code in the FIGS_viz_demo.ipynb notebook. I am a bit concerned with the None impurity in the root node of the second tree:

    node_id: 0, left.node_id: 1, right.node_id: 2, impurity: None
    

    I filled it with 0 for the calculation for now:

                    importance_data_tree[node.feature] += (
                        np.sum(node.value_sklearn) * (node.impurity if node.impurity is not None else 0.) -
                        np.sum(node.left.value_sklearn) * node.left.impurity -
                        np.sum(node.right.value_sklearn) * node.right.impurity
                    )
    

    Is None expected if the tree has just one split?

    Also, after taking the mean and normalizing most of the importances are negative. I think this is fine, as we just care about the relative order of the features, but wanted to get your opinion as well: image

    BTW I noticed that we have an unused variable in plot():

    criterion = "squared_error" if isinstance(self, RegressorMixin) else "gini"
    

    Is this need for anything, or should we delete it?

    opened by mepland 15
  • Fixed FIGS plotting

    Fixed FIGS plotting

    Fixed Issue 132, FIGS plots not appearing correctly.

    The primary bug was in the assignment of node ids here.

                right = next(node_counter)
                left = next(node_counter)
    

    They were being improperly set during the recursion of _update_node(nd). I've fixed this by assigning a new node_num variable after the trees are created during fit() here and using that instead:

            # add node_num to final tree
            for tree_ in self.trees_:
                node_counter = iter(range(0, int(1e06)))
                def _add_node_num(node: Node):
                    if node is None:
                        return
                    node.setattrs(node_num=next(node_counter))
                    _add_node_num(node.left)
                    _add_node_num(node.right)
    
                _add_node_num(tree_)
    

    I also took the opportunity to return a real sklearn DecisionTreeClassifier or DecisionTreeRegressor object, filling the parameters, including tree_, with the __setstate__() method, building on this SO question. In order to do this, I needed the impurity at each node and the "value" as expected by sklearn, i.e. value = np.array([neg_count, pos_count], dtype=float). If we further rewrite the FIGS class to save this 2D "value" along side the current value, perhaps as value_sklearn, I wouldn't need X_train, y_train for the extract_sklearn_tree_from_figs function, and the subsequent plotting functions.

    @csinva does my implementation of the impurity variable look correct? I see the impurities are recomputed after I grab my impurity values, so I expect not. Perhaps you could fix this, or let me know the best way to get the final impurity at each node? I'll also wait for the go ahead on adding the value_sklearn variable, and refactoring away the dependence on X_train, y_train in the plotting functions.

    opened by mepland 8
  • 'BoostedRulesClassifier' object has no attribute 'complexity_'

    'BoostedRulesClassifier' object has no attribute 'complexity_'

    After imodel being updated to 1.3.8, we've got the error msg 'BoostedRulesClassifier' object has no attribute 'complexity_'. Wonder is it removed or renamed? It is generally better to keep public apis/attributes unchanged during minor releases, any plan to add it back?

    opened by yinweisu 6
  • FIGS Fixes

    FIGS Fixes

    • Added SKompiler integration, which required the new n_features_in_ member variable.
      • Note the demo FIGS model currently requires https://github.com/mepland/SKompiler/tree/fixes to run which fixes a bug in SKompiler. TLDR SKompiler was not letting trees run if they use less than all the available features, like the demo FIGS tree 0.
    • Fixed bug in n_features
    -    n_features = np.unique(features[np.where( 0 < features )]).size
    +    n_features = np.unique(features[np.where( 0 <= features )]).size
    
    • Improved markdown comments in FIGS_viz_demo.ipynb
    opened by mepland 4
  • HSTree Multiclass Classification Support

    HSTree Multiclass Classification Support

    Does HSTree support multiclass classification problems with RandomForest / ExtraTrees as the estimator?

    From my initial tests it appears buggy. Calling predict_proba with the final model results in lots of NaN predictions, along with warnings during training such as:

    /Users/neerick/workspace/virtual/autogluon/lib/python3.8/site-packages/imodels/tree/hierarchical_shrinkage.py:87: RuntimeWarning: invalid value encountered in double_scalars
      val = tree.value[i][0, 1] / (tree.value[i][0, 0] + tree.value[i][0, 1])  # binary classification
    

    If helpful I can try to create a reproducible example.

    Here is an example result comparing with sklearn default RF (_og_) with accuracy metric. Because HSTree returns many NaN predictions, the scores are very low.

    One observation is the scores get worse the more trees there are in HSTree forests. I'd guess the likelihood of returning a NaN result is increasing with the number of trees.

                           model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
    0       RandomForest_og_n300    0.711651   0.723618        0.985573       0.050956  0.519926                 0.985573                0.050956           0.519926            1       True          1
    1       RandomForest_og_n100    0.710154   0.748744        0.453769       0.019050  0.170951                 0.453769                0.019050           0.170951            1       True          2
    2        WeightedEnsemble_L2    0.710154   0.748744        0.464755       0.019376  0.295161                 0.010986                0.000326           0.124210            2       True         36
    3        RandomForest_og_n40    0.700636   0.698492        0.193009       0.010738  0.088012                 0.193009                0.010738           0.088012            1       True          3
    4        RandomForest_og_n20    0.692039   0.698492        0.103616       0.007549  0.057396                 0.103616                0.007549           0.057396            1       True          4
    5        RandomForest_og_n10    0.674165   0.688442        0.075296       0.006166  0.041720                 0.075296                0.006166           0.041720            1       True          5
    6     RandomForest_hs=10_n10    0.521949   0.537688        0.070260       0.005246  0.082384                 0.070260                0.005246           0.082384            1       True         15
    7     RandomForest_hs=50_n10    0.520839   0.517588        0.075151       0.004875  0.071219                 0.075151                0.004875           0.071219            1       True         20
    8    RandomForest_hs=0.1_n10    0.520796   0.537688        0.074070       0.005233  0.093299                 0.074070                0.005233           0.093299            1       True         35
    9      RandomForest_hs=1_n10    0.520692   0.542714        0.077687       0.005690  0.075061                 0.077687                0.005690           0.075061            1       True         10
    10   RandomForest_hs=100_n10    0.519246   0.517588        0.075059       0.006019  0.082536                 0.075059                0.006019           0.082536            1       True         25
    11   RandomForest_hs=500_n10    0.488877   0.517588        0.072145       0.005125  0.072223                 0.072145                0.005125           0.072223            1       True         30
    12     RandomForest_hs=1_n20    0.485125   0.472362        0.113002       0.006484  0.123639                 0.113002                0.006484           0.123639            1       True          9
    13   RandomForest_hs=0.1_n20    0.485005   0.472362        0.111342       0.005953  0.146246                 0.111342                0.005953           0.146246            1       True         34
    14    RandomForest_hs=10_n20    0.484833   0.482412        0.104076       0.006577  0.131909                 0.104076                0.006577           0.131909            1       True         14
    15    RandomForest_hs=50_n20    0.482896   0.482412        0.115057       0.006263  0.130512                 0.115057                0.006263           0.130512            1       True         19
    16   RandomForest_hs=100_n20    0.480840   0.482412        0.108625       0.006045  0.135224                 0.108625                0.006045           0.135224            1       True         24
    17   RandomForest_hs=500_n20    0.458035   0.467337        0.108658       0.006302  0.123907                 0.108658                0.006302           0.123907            1       True         29
    18     RandomForest_hs=1_n40    0.451434   0.467337        0.185129       0.010619  0.210639                 0.185129                0.010619           0.210639            1       True          8
    19   RandomForest_hs=0.1_n40    0.451382   0.467337        0.170597       0.009024  0.244322                 0.170597                0.009024           0.244322            1       True         33
    20    RandomForest_hs=10_n40    0.451322   0.467337        0.173382       0.009955  0.210795                 0.173382                0.009955           0.210795            1       True         13
    21    RandomForest_hs=50_n40    0.450350   0.467337        0.170041       0.008673  0.236081                 0.170041                0.008673           0.236081            1       True         18
    22   RandomForest_hs=100_n40    0.449119   0.467337        0.169396       0.010918  0.226784                 0.169396                0.010918           0.226784            1       True         23
    23   RandomForest_hs=500_n40    0.435832   0.472362        0.162881       0.009256  0.202447                 0.162881                0.009256           0.202447            1       True         28
    24    RandomForest_hs=1_n100    0.420419   0.452261        0.442328       0.017688  0.480776                 0.442328                0.017688           0.480776            1       True          7
    25  RandomForest_hs=0.1_n100    0.420411   0.452261        0.354523       0.018247  0.548557                 0.354523                0.018247           0.548557            1       True         32
    26   RandomForest_hs=10_n100    0.419981   0.452261        0.355097       0.017487  0.469547                 0.355097                0.017487           0.469547            1       True         12
    27   RandomForest_hs=50_n100    0.419034   0.447236        0.344341       0.021125  0.465810                 0.344341                0.021125           0.465810            1       True         17
    28  RandomForest_hs=100_n100    0.418672   0.447236        0.372041       0.018402  0.477048                 0.372041                0.018402           0.477048            1       True         22
    29  RandomForest_hs=500_n100    0.415256   0.457286        0.338696       0.017128  0.492786                 0.338696                0.017128           0.492786            1       True         27
    30  RandomForest_hs=0.1_n300    0.381049   0.391960        0.967061       0.045552  1.533075                 0.967061                0.045552           1.533075            1       True         31
    31   RandomForest_hs=10_n300    0.381049   0.391960        1.109062       0.054005  1.442369                 1.109062                0.054005           1.442369            1       True         11
    32    RandomForest_hs=1_n300    0.381040   0.391960        1.677277       0.055421  2.346773                 1.677277                0.055421           2.346773            1       True          6
    33   RandomForest_hs=50_n300    0.380945   0.391960        0.889030       0.053650  1.320377                 0.889030                0.053650           1.320377            1       True         16
    34  RandomForest_hs=100_n300    0.380885   0.391960        1.031198       0.045266  1.254918                 1.031198                0.045266           1.254918            1       True         21
    35  RandomForest_hs=500_n300    0.380816   0.391960        0.948715       0.050209  1.266396                 0.948715                0.050209           1.266396            1       True         26
    
    
    enhancement 
    opened by Innixma 4
  • Two Extractly same rules by RulefitClassifier

    Two Extractly same rules by RulefitClassifier

    Hello~

    When I use the RulefitClassifier, it will return two exactly same rules but with different coef, whether the inherent structures didn't aggregate the rules? I have tried to use the Rulefit directly, and it seems that it doesn't have the similar problem~

    The following image is part of my result image

    bug 
    opened by Yannahhh 4
  • BoostedRulesClassifier sometimes throws an exception

    BoostedRulesClassifier sometimes throws an exception

    Hi,

    When I use the BoostedRulesClassifier, it sometimes throws an exception as follows:

    This BoostedRulesClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

    I find that the exception results from the implementation of the class RuleSet: ` def _eval_weighted_rule_sum(self, X) -> np.ndarray:

        check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders'])
    
        X = check_array(X)
    
        if X.shape[1] != self.n_features_:
            raise ValueError("X.shape[1] = %d should be equal to %d, the number of features at training time."
                             " Please reshape your data."
                             % (X.shape[1], self.n_features_))
    
        df = pd.DataFrame(X, columns=self.feature_placeholders)
        selected_rules = self.rules_without_feature_names_
    
        scores = np.zeros(X.shape[0])
        for r in selected_rules: 
            features_r_uses = list(map(lambda x: x[0], r.agg_dict.keys()))
            scores[df[features_r_uses].query(str(r)).index.values] += r.args[0]
    
        return scores`
    

    Specifically, when the computer runs the check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders']), it finds that self.rules_without_feature_names_ does not exist, so the computer throws the above exception.

    And I further review my code and data set, I find that my training set is easy to train a classifier, so the training error of the estimator is close to zero, it may result in a bug in the fit function of the class BoostedRulesClassifier: ` for _ in range(self.n_estimators): # Fit a classifier with the specific weights clf = self.estimator() clf.fit(X, y, sample_weight=w) # uses w as the sampling weight! preds = clf.predict(X) self.estimator_mean_prediction_.append(np.mean(preds)) # just for printing

            # Indicator function
            miss = preds != y
    
            # Equivalent with 1/-1 to update weights
            miss2 = np.ones(miss.size)
            miss2[~miss] = -1
    
            # Error
            err_m = np.dot(w, miss) / sum(w)
            
            if err_m < 1e-3:
                return self
              
            # Alpha
            alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))
    
            # New weights
            w = np.multiply(w, np.exp([float(x) * alpha_m for x in miss2]))
    
            self.estimators_.append(deepcopy(clf))
            self.estimator_weights_.append(alpha_m)
            self.estimator_errors_.append(err_m)
    
        rules = []
    

    ` Because the error_m is zero, so it directly returns self without executing subsequent statements, in such a case, self.rules_without_feature_names_ dose not exist.

    My current solution to this bug is to modify the following code fragment in the fit function of the class BoostedRulesClassifier: ` # Error err_m = np.dot(w, miss) / sum(w)

            # modification ###########################
            if err_m < 1e-3:
                # return self
                w = np.ones(miss.size) / len(y)
                self.estimators_.append(deepcopy(clf))
                self.estimator_weights_.append(float("inf"))
                self.estimator_errors_.append(err_m)
                break
             ####################################
            # Alpha
            alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))
    

    ` I'm not sure whether it may introduce new defects, but it indeed solves the exception.

    opened by Wan-xiaohui 3
  • GreedyRuleListClassifier has wildly varying performance and sometimes crashes

    GreedyRuleListClassifier has wildly varying performance and sometimes crashes

    When running a certain number of experiments with different splits of a given dataset, I see that GreedyRuleListClassifier's accuracy wildly varies, and sometimes the code (see for loop below) crashes.

    So, for example running 10 experiments like this, with different random splits of the same set:

    import pandas
    import sklearn
    import sklearn.datasets
    from sklearn.model_selection import train_test_split
    
    from imodels import GreedyRuleListClassifier
    
    X, Y = sklearn.datasets.load_breast_cancer(as_frame=True, return_X_y=True)
    
    model = GreedyRuleListClassifier(max_depth=10)
    
    for i in range(10):
      try:
        X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
        model.fit(X_train, y_train, feature_names=X_train.columns)
        y_pred = model.predict(X_test)
        from sklearn.metrics import accuracy_score
        score = accuracy_score(y_test.values,y_pred)
        print('Accuracy:\n', score)
      except KeyError as e:
        print("Failed with KeyError")
    

    Will give as output something along the lines of

    Accuracy: 0.6081871345029239
    Failed with KeyError
    Accuracy: 0.4619883040935672
    Accuracy: 0.45614035087719296
    Accuracy: 0.2222222222222222
    Failed with KeyError
    Failed with KeyError
    Failed with KeyError
    Accuracy: 0.18128654970760233
    Failed with KeyError
    

    Is this intended behavior? While my test dataset is smallish, the variation in accuracy is still surprising for me and so is the throwing of a KeyError. I'm using scikit-learn==1.0.2 and imodels=1.3.6 and can edit the issue here to add more details.

    Incidentally, the same behaviour was observed in https://datascience.stackexchange.com/a/116283/50519, noticed by @jonnor.

    Thanks!

    opened by davidefiocco 3
  • Issue with feature_names in GreedyRuleListClassifier

    Issue with feature_names in GreedyRuleListClassifier

    when i am putting feature_names= X.columns only the first feature is appearing in the rule list and others are appearing as feat i. unable to fix this and request for your kind support.

    here is the output snippet: Selected features: Index(['Processor(P99)_Q', 'Opto(F99)_Q', 'Logic(L99)_Am', 'Qualcom', 'Toshiba', 'ABB', 'Whirlpool', 'Honeywell'], dtype='object') mean 0.6 (30 pts) if Whirlpool >= 153 then 1.0 (16 pts) mean 0.143 (14 pts) if feat 1 >= 16882885 then 1.0 (2 pts) mean 0.0 (12 pts)

    opened by pauldebdeep9 3
  • Complexity comparisons

    Complexity comparisons

    • Compare all models on several UCI datasets
    • Generate complexity-accuracy plots for each model
    • Cache comparison results for convenience
    • Set self.complexity when fitting models
    opened by keyan3 3
  • Test fixes

    Test fixes

    • Fixed an issue where the GitHub build would pass even if the tests actually failed (screenshot below)
    Screen Shot 2021-01-24 at 11 59 12 PM
    • Added missing random seeding in Skope

    I skipped testing predict_proba for Skope altogether โ€” thought behind this is that even if you write a predict_proba that uses eval_weighted_rule_sum, it still won't match the predictions since since Skope predicts based only on whether the score is positive or not. I'm not sure if our Skope needs to have this method at all (the original Skope implementation doesn't)

    opened by keyan3 3
  • FIGS Demo Notebook Update

    FIGS Demo Notebook Update

    @csinva let's wait on merging this for a few weeks, until both imodels and dtreeviz release new minor versions. I have a few changes I want to make then:

    • Remove path to ~/imodels
    • Use 'leaftype': 'barh'
    • Update color scheme
    • Possibly add numeric leaf predictions and split visualizations
    opened by mepland 0
  • Full sample_weight support for FIGS

    Full sample_weight support for FIGS

    Some parts of FIGS do not support sample_weight including the extract_sklearn_tree_from_figs() function and feature_importances_.

    Originally posted by @mepland in https://github.com/csinva/imodels/issues/89#issuecomment-1367595878

    opened by mepland 0
  • Implement Dynamic CDI

    Implement Dynamic CDI

    Implementing a Dynamic CDIs class based on FIGS.

    TODOs:

    • [ ] Implement a sklearn compatible class named D-FIGS in a new file imodels/tree/dynamic_figs.py
    • [ ] Write a test using the PECARN IAI dataset

    More details:

    • The D-FIGS class should inherit from FIGS class, and take an additional dictionary at initialization, corresponding to the features phases. When applying the fit or predict methods, the class should verify that the matrix $X$ is compatible with the features tiers. For example phase 2 features can be available (not NA) only if all phase 1 features are available (we may refine this logic later).
    • D-FIGS should infer the phase from the matrix.
    • The tests should be written in a new file named imodels/tests/dynamic_figs_test.py, using pytest (see package documentation or you can use the figs test as reference)
    • Before you start writing code, please write down a short description detailing how you are going to implement the dynamic fitting algorithm. Specifically: How does the model infer the current phase of the patient? How do you store the different models for different phases and ensure these are compatible with one another?

    @aagarwal1996

    opened by OmerRonen 1
  • Add support for `dtreeviz` visualizations

    Add support for `dtreeviz` visualizations

    Add any required translation code to allow imodels trees to be plotted with dtreeviz. This basically boils down to successfully generating a ShadowDecTree object from an imodels tree.

    We can reuse the existing ShadowSKDTree constructor by converting imodels trees into sklearn objects, then calling:

    sk_dtree = ShadowSKDTree(tree_classifier, X, y, features, target, [0, 1])
    

    Alternatively, we can make an imodels specific implementation of ShadowDecTree, similar to the sklearn implementation here, but that may be more work than necessary.

    opened by mepland 0
  • RuleFitClassifier(tree_generator = GradientBoostingClassifier()) not working as per documentation

    RuleFitClassifier(tree_generator = GradientBoostingClassifier()) not working as per documentation

    Hi,

    When using RuleFitClassifier(tree_generator = GradientBoostingClassifier()) with a GradientBoostingClassifier() object fitted and optimized separately via Scikitlearn API, it returns the next error when fitting RuleFitClassifier(tree_generator = GradientBoostingClassifier()):

    ValueError: n_estimators=1 must be larger or equal to estimators_.shape[0]=100 when warm_start==True

    When inspecting whats inside RuleFitClassifier(tree_generator = GradientBoostingClassifier()) after fitting the model, the GradientBoostingClassifier() is completely modified to other parameters different than those optimized before fitting RuleFitClassifier(), i.e., GradientBoostingClassifier(max_leaf_nodes=4, n_estimators=1, random_state=0, warm_start=True). Not sure why these parameters (from the GradientBoostingClassifier()) are changed inside the RuleFitClassifier() object.

    If RuleFitClassifier(tree_generator = None), everything works well.

    As per documentation:

    tree_generator :โ€‚Optional: this object will be used as provided to generate the rules. This will override almost all the other properties above. Must be GradientBoostingRegressor(), GradientBoostingClassifier(), or RandomForestRegressor()

    • Which are those properties from RuleFitClassifier() that are override if tree_generator=GradientBoostingClassifier()?
    • Why does this behavior occurs?

    Here is the closest solution I found in Issue #34, however the behavior is not clear.

    Any help will be highly appreciated.

    Many thanks!

    opened by Manuelhrokr 0
Releases(v1.3.11)
Owner
Chandan Singh
Working on interpretable machine learning across domains ๐Ÿง โš•๏ธ๐Ÿฆ  Let's do good with models.
Chandan Singh
Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

4 Nov 11, 2021
This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

minvar_invest_portfolio This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing var

1 Jan 06, 2022
A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Author Ibrahim Konรฉ From-Machine-Learning-Models-To-WebAPI A collection of Machine Learning Models To Web Api which are built on open source technolog

Ibrahim Konรฉ 2 May 24, 2022
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
Markov bot - A Writing bot based on Markov Chain for Data Structure Lab

ๅŸบไบŽ้ฉฌๅฐ”ๅฏๅคซ้“พ็š„ๅ†™ไฝœๆœบๅ™จไบบ ๅ‰็ซฏ ็”จhtml/cssๅฎŒๆˆ Demoๅฑ•็คบ๏ผˆๅทฒ็ป™ๅ‡บๆ–‡ๆœฌ็š„็›ธๅบ”ๅฑ•็คบ๏ผ‰ ็”จๆˆทๆไพ›็›ธๅ…ณ็š„่ฏญๆ–™ๅบ“ๅŽ่ฎญ็ปƒ็š„ๆˆๆžœ ๅŽ็ซฏ ่ฆๅฎŒๆˆ็š„ๅ‡ ไธชๆŽฅๅฃ ่งฃๆžๆ–‡

DysprosiumDy 9 May 05, 2022
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
Client - ๐Ÿ”ฅ A tool for visualizing and tracking your machine learning experiments

Weights and Biases Use W&B to build better models faster. Track and visualize all the pieces of your machine learning pipeline, from datasets to produ

Weights & Biases 5.2k Jan 03, 2023
Firebase + Cloudrun + Machine learning

A simple end to end consumer lending decision engine powered by Google Cloud Platform (firebase hosting and cloudrun)

Emmanuel Ogunwede 8 Aug 16, 2022
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
BudouX is the successor to Budou, the machine learning powered line break organizer tool.

BudouX Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool. It is standalone

Google 868 Jan 05, 2023
scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

Tom Gustafsson 297 Dec 13, 2022
Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Ibotta 282 Dec 09, 2022
Made in collaboration with Chris George for Art + ML Spring 2019.

Deepdream Eyes Made in collaboration with Chris George for Art + ML Spring 2019.

Francisco Cabrera 1 Jan 12, 2022
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022
A Python implementation of FastDTW

fastdtw Python implementation of FastDTW [1], which is an approximate Dynamic Time Warping (DTW) algorithm that provides optimal or near-optimal align

tanitter 651 Jan 04, 2023
Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

Panagiotis (Panos) Mavritsakis 4 Sep 22, 2022
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

42 Dec 23, 2022
CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

ZhihuiYangCS 8 Jun 07, 2022
UpliftML: A Python Package for Scalable Uplift Modeling

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base l

Booking.com 254 Dec 31, 2022