High performance Python GLMs with all the features!

Overview

glum

CI

Documentation

Generalized linear models (GLM) are a core statistical tool that include many common methods like least-squares regression, Poisson regression and logistic regression as special cases. At QuantCo, we have used GLMs in e-commerce pricing, insurance claims prediction and more. We have developed glum, a fast Python-first GLM library. The development was based on a fork of scikit-learn, so it has a scikit-learn-like API. We are thankful for the starting point provided by Christian Lorentzen in that PR!

glum is at least as feature-complete as existing GLM libraries like glmnet or h2o. It supports

  • Built-in cross validation for optimal regularization, efficiently exploiting a “regularization path”
  • L1 regularization, which produces sparse and easily interpretable solutions
  • L2 regularization, including variable matrix-valued (Tikhonov) penalties, which are useful in modeling correlated effects
  • Elastic net regularization
  • Normal, Poisson, logistic, gamma, and Tweedie distributions, plus varied and customizable link functions
  • Box constraints, linear inequality constraints, sample weights, offsets

This repo also includes tools for benchmarking GLM implementations in the glum_benchmarks module. For details on the benchmarking, see here. Although the performance of glum relative to glmnet and h2o depends on the specific problem, we find that it is consistently much faster for a wide range of problems.

For more information on glum, including tutorials and API reference, please see the documentation.

Why did we choose the name glum? We wanted a name that had the letters GLM and wasn't easily confused with any existing implementation. And we thought glum sounded like a funny name (and not glum at all!). If you need a more professional sounding name, feel free to pronounce it as G-L-um. Or maybe it stands for "Generalized linear... ummm... modeling?"

A classic example predicting housing prices

>> >>> # Use only select features >>> X = house_data.data[ ... [ ... "bedrooms", ... "bathrooms", ... "sqft_living", ... "floors", ... "waterfront", ... "view", ... "condition", ... "grade", ... "yr_built", ... "yr_renovated", ... ] ... ].copy() >>> >>> >>> # Model whether a house had an above or below median price via a Binomial >>> # distribution. We'll be doing L1-regularized logistic regression. >>> price = house_data.target >>> y = (price < price.median()).values.astype(int) >>> model = GeneralizedLinearRegressor( ... family='binomial', ... l1_ratio=1.0, ... alpha=0.001 ... ) >>> >>> _ = model.fit(X=X, y=y) >>> >>> # .report_diagnostics shows details about the steps taken by the iterative solver >>> diags = model.get_formatted_diagnostics(full_report=True) >>> diags[['objective_fct']] objective_fct n_iter 0 0.693091 1 0.489500 2 0.449585 3 0.443681 4 0.443498 5 0.443497 ">
>>> from sklearn.datasets import fetch_openml
>>> from glum import GeneralizedLinearRegressor
>>>
>>> # This dataset contains house sale prices for King County, which includes
>>> # Seattle. It includes homes sold between May 2014 and May 2015.
>>> house_data = fetch_openml(name="house_sales", version=3, as_frame=True)
>>>
>>> # Use only select features
>>> X = house_data.data[
...     [
...         "bedrooms",
...         "bathrooms",
...         "sqft_living",
...         "floors",
...         "waterfront",
...         "view",
...         "condition",
...         "grade",
...         "yr_built",
...         "yr_renovated",
...     ]
... ].copy()
>>>
>>>
>>> # Model whether a house had an above or below median price via a Binomial
>>> # distribution. We'll be doing L1-regularized logistic regression.
>>> price = house_data.target
>>> y = (price < price.median()).values.astype(int)
>>> model = GeneralizedLinearRegressor(
...     family='binomial',
...     l1_ratio=1.0,
...     alpha=0.001
... )
>>>
>>> _ = model.fit(X=X, y=y)
>>>
>>> # .report_diagnostics shows details about the steps taken by the iterative solver
>>> diags = model.get_formatted_diagnostics(full_report=True)
>>> diags[['objective_fct']]
        objective_fct
n_iter               
0            0.693091
1            0.489500
2            0.449585
3            0.443681
4            0.443498
5            0.443497

Installation

Please install the package through conda-forge:

conda install glum -c conda-forge
Comments
  • [Critical] Benchmarks on various data sets

    [Critical] Benchmarks on various data sets

    Based on 091fe022af7f8bd2a05210de6cc42bc1030fbb93

    Machine: r5.4xlarge (16vCPUs | 128 GB Ram)

    I ran the sklearn_fork using

    • all available data sets
    • for 1M observations and 18M observations
    • dense vs sparse
    • on different numbers of threads
    • lasso and elnet

    Results

                                                                        n_iter    runtime  intercept obj_val rel_obj_val
    problem                      num_rows storage threads library                                                       
    narrow_insurance_l2_poisson  1000000  dense   1       sklearn_fork       5     2.8577    -3.8547  0.3192           0
                                                  2       sklearn_fork       5     2.1641    -3.8547  0.3192           0
                                                  4       sklearn_fork       5     1.8084    -3.8547  0.3192           0
                                                  8       sklearn_fork       5     1.6458    -3.8547  0.3192           0
                                                  16      sklearn_fork       5     1.6295    -3.8547  0.3192           0
                                          sparse  1       sklearn_fork       5     2.4790    -3.8547  0.3192           0
                                                  2       sklearn_fork       5     1.9025    -3.8547  0.3192           0
                                                  4       sklearn_fork       5     1.5203    -3.8547  0.3192           0
                                                  8       sklearn_fork       5     2.4176    -3.8547  0.3192           0
                                                  16      sklearn_fork       5     2.4066    -3.8547  0.3192           0
                                 18000000 dense   1       sklearn_fork       5    52.9698    -3.8305  0.3189           0
                                                  2       sklearn_fork       5    40.7352    -3.8305  0.3189           0
                                                  4       sklearn_fork       5    33.9417    -3.8305  0.3189           0
                                                  8       sklearn_fork       5    30.6861    -3.8305  0.3189           0
                                                  16      sklearn_fork       5    30.5071    -3.8305  0.3189           0
                                          sparse  1       sklearn_fork       5    53.0467    -3.8305  0.3189           0
                                                  2       sklearn_fork       5    42.5917    -3.8305  0.3189           0
                                                  4       sklearn_fork       5    36.1416    -3.8305  0.3189           0
                                                  8       sklearn_fork       5    39.9059    -3.8305  0.3189           0
                                                  16      sklearn_fork       5    46.8101    -3.8305  0.3189           0
    narrow_insurance_net_poisson 1000000  dense   1       sklearn_fork       9     5.2903    -3.7820  0.3199           0
                                                  2       sklearn_fork       9     4.0050    -3.7820  0.3199           0
                                                  4       sklearn_fork       9     3.3740    -3.7820  0.3199           0
                                                  8       sklearn_fork       9     3.0434    -3.7820  0.3199           0
                                                  16      sklearn_fork       9     3.0169    -3.7820  0.3199           0
                                          sparse  1       sklearn_fork       9     4.4610    -3.7820  0.3199           0
                                                  2       sklearn_fork       9     3.4356    -3.7820  0.3199           0
                                                  4       sklearn_fork       9     2.6984    -3.7820  0.3199           0
                                                  8       sklearn_fork       9     3.6768    -3.7820  0.3199           0
                                                  16      sklearn_fork       9     4.3575    -3.7820  0.3199           0
                                 18000000 dense   1       sklearn_fork       9    98.1811    -3.7678  0.3196           0
                                                  2       sklearn_fork       9    74.8176    -3.7678  0.3196           0
                                                  4       sklearn_fork       9    63.2101    -3.7678  0.3196           0
                                                  8       sklearn_fork       9    57.3406    -3.7678  0.3196           0
                                                  16      sklearn_fork       9    56.7188    -3.7678  0.3196           0
                                          sparse  1       sklearn_fork       9    98.1400    -3.7678  0.3196           0
                                                  2       sklearn_fork       9    78.0241    -3.7678  0.3196           0
                                                  4       sklearn_fork       9    65.7398    -3.7678  0.3196           0
                                                  8       sklearn_fork       9    83.0182    -3.7678  0.3196           0
                                                  16      sklearn_fork       9    85.0394    -3.7678  0.3196           0
    real_insurance_l2_poisson    1000000  dense   1       sklearn_fork       5     5.1556    -3.3377  0.1601           0
                                                  2       sklearn_fork       5     3.7674    -3.3377  0.1616           0
                                                  4       sklearn_fork       5     3.0280    -3.3377  0.1612           0
                                                  8       sklearn_fork       5     2.6941    -3.3377  0.1623           0
                                                  16      sklearn_fork       5     2.8711    -3.3377  0.1612           0
                                          sparse  1       sklearn_fork       5    13.8171    -3.3377  0.1618           0
                                                  2       sklearn_fork       5     8.8360    -3.3377  0.1619           0
                                                  4       sklearn_fork       5     6.4723    -3.3377  0.1608           0
                                                  8       sklearn_fork       5     5.5610    -3.3377    0.16           0
                                                  16      sklearn_fork       5    10.6910    -3.3377  0.1617           0
                                 18000000 dense   1       sklearn_fork       5    88.1266    -3.3665  0.1609           0
                                                  2       sklearn_fork       5    61.3444    -3.3665  0.1608           0
                                                  4       sklearn_fork       5    48.0059    -3.3665  0.1608           0
                                                  8       sklearn_fork       5    41.3488    -3.3665  0.1608           0
                                                  16      sklearn_fork       5    41.2265    -3.3665  0.1611           0
                                          sparse  1       sklearn_fork       5   265.7090    -3.3665  0.1608           0
                                                  2       sklearn_fork       5   173.5621    -3.3665  0.1607           0
                                                  4       sklearn_fork       5   124.4030    -3.3665  0.1607           0
                                                  8       sklearn_fork       5   106.1407    -3.3665  0.1606           0
                                                  16      sklearn_fork       5   192.5957    -3.3665   0.161           0
    real_insurance_net_poisson   1000000  dense   1       sklearn_fork      10     9.7671    -3.3532  0.1612           0
                                                  2       sklearn_fork      10     6.9492    -3.3532  0.1609           0
                                                  4       sklearn_fork      10     5.5663    -3.3532  0.1614           0
                                                  8       sklearn_fork      10     4.7798    -3.3532  0.1606           0
                                                  16      sklearn_fork      10     4.8909    -3.3532  0.1603           0
                                          sparse  1       sklearn_fork      10    26.5831    -3.3532  0.1635           0
                                                  2       sklearn_fork      10    16.5875    -3.3532  0.1629           0
                                                  4       sklearn_fork      10    11.3910    -3.3532   0.162           0
                                                  8       sklearn_fork      10     9.4460    -3.3532  0.1612           0
                                                  16      sklearn_fork      10    19.7150    -3.3532  0.1608           0
                                 18000000 dense   1       sklearn_fork      10   175.4268    -3.3576  0.1618           0
                                                  2       sklearn_fork      10   121.9683    -3.3576  0.1618           0
                                                  4       sklearn_fork      10    95.6292    -3.3576  0.1617           0
                                                  8       sklearn_fork      10    82.3669    -3.3576  0.1617           0
                                                  16      sklearn_fork      10    82.6210    -3.3576  0.1618           0
                                          sparse  1       sklearn_fork      10   511.6008    -3.3576  0.1618           0
                                                  2       sklearn_fork      10   324.6799    -3.3576  0.1616           0
                                                  4       sklearn_fork      10   227.5025    -3.3576  0.1615           0
                                                  8       sklearn_fork      10   190.4679    -3.3576  0.1616           0
                                                  16      sklearn_fork      10   359.1789    -3.3576  0.1617           0
    wide_insurance_l2_poisson    1000000  dense   1       sklearn_fork      10    49.0238    -2.0280  0.1422           0
                                                  2       sklearn_fork      10    30.6427    -2.0280  0.1422           0
                                                  4       sklearn_fork      10    21.8896    -2.0280  0.1422           0
                                                  8       sklearn_fork      10    17.2667    -2.0280  0.1422           0
                                                  16      sklearn_fork      10    17.4586    -2.0280  0.1422           0
                                          sparse  1       sklearn_fork      10    11.1084    -2.0280  0.1422           0
                                                  2       sklearn_fork      10     8.6160    -2.0280  0.1422           0
                                                  4       sklearn_fork      10     7.4407    -2.0280  0.1422           0
                                                  8       sklearn_fork      10     7.2905    -2.0280  0.1422           0
                                                  16      sklearn_fork      10     7.5111    -2.0280  0.1422           0
                                 18000000 dense   1       sklearn_fork      13  1171.3334    -2.1096  0.1403           0
                                                  2       sklearn_fork      29  1546.3203    -2.1096  0.1403           0
                                                  4       sklearn_fork      10   405.3979    -2.1096  0.1403           0
                                                  8       sklearn_fork      10   322.3633    -2.1096  0.1403           0
                                                  16      sklearn_fork      10   320.1045    -2.1096  0.1403           0
                                          sparse  1       sklearn_fork      10   241.5324    -2.1096  0.1403           0
                                                  2       sklearn_fork      20   352.8254    -2.1096  0.1403           0
                                                  4       sklearn_fork      20   307.4050    -2.1096  0.1403           0
                                                  8       sklearn_fork      16   248.4476    -2.1096  0.1403           0
                                                  16      sklearn_fork      16   218.3485    -2.1096  0.1403           0
    wide_insurance_net_poisson   1000000  dense   1       sklearn_fork      13    66.1757    -2.2057  0.1426           0
                                                  2       sklearn_fork      13    42.3533    -2.2057  0.1426           0
                                                  4       sklearn_fork      13    30.3914    -2.2057  0.1426           0
                                                  8       sklearn_fork      13    24.3199    -2.2057  0.1426           0
                                                  16      sklearn_fork      13    24.6798    -2.2057  0.1426           0
                                          sparse  1       sklearn_fork      13    16.0367    -2.2057  0.1426           0
                                                  2       sklearn_fork      13    12.3721    -2.2057  0.1426           0
                                                  4       sklearn_fork      13    10.9711    -2.2057  0.1426           0
                                                  8       sklearn_fork      13    10.9589    -2.2057  0.1426           0
                                                  16      sklearn_fork      13    11.0781    -2.2057  0.1426           0
                                 18000000 dense   1       sklearn_fork      15  1416.9511    -2.3355  0.1402           0
                                                  2       sklearn_fork      15   895.1590    -2.3355  0.1402           0
                                                  4       sklearn_fork      15   622.3026    -2.3355  0.1402           0
                                                  8       sklearn_fork      15   501.4581    -2.3355  0.1402           0
                                                  16      sklearn_fork      15   491.9906    -2.3355  0.1402           0
                                          sparse  1       sklearn_fork      15   378.4437    -2.3355  0.1402           0
                                                  2       sklearn_fork      15   297.5474    -2.3355  0.1402           0
                                                  4       sklearn_fork      15   261.0684    -2.3355  0.1402           0
                                                  8       sklearn_fork      15   256.5559    -2.3355  0.1402           0
                                                  16      sklearn_fork      15   225.0999    -2.3355  0.1402           0
    
    

    Results as CSV (zipped)

    this week's work 
    opened by jtilly 20
  • [Critical] Improve performance for sparse matrices

    [Critical] Improve performance for sparse matrices

    Reported by @jtilly "I'm having some difficulties getting good results using real data (on 1 million rows for now). The script and corresponding log file that I'm using in our infrastructure are here: https://gist.github.com/jtilly/d2ff9b7bd6c690a35db052d1730e0a06 I'm comparing the sklearn_fork vs glmnet_python. Performance doesn't look that great for the sklearn_fork implementations once we add l1 penalties or make things sparse. I'm also having difficulties aligning coefficients (and predictions) between the sklearn fork and glmnet. I'm not sure what the best way to debugging the problem is. I also integrated the real data set into the glm_benchmarks package (see https://github.com/Quantco/glm_benchmarks/pull/47). If you get the chance, could you take a quick look at what I implemented so make sure I didn't screw things up anywhere? Also, any insights why glmnet and sklearn results don't align? I'm also not an export user of the glm_benchmarks package (yet), so if you have any ideas how to pull debug information out of it, please let me know"

    this week's work 
    opened by ElizabethSantorellaQC 15
  • Fuse alpha and alphas parameters

    Fuse alpha and alphas parameters

    I was wondering if it'd make sense to skip this check if alpha_search is active, since we don't use alpha in that case (ref).

    Also, we could simplify isinstance(self.alpha, float) or isinstance(self.alpha, int) to isinstance(self.alpha, (float, int)). :)

    code quality 
    opened by lbittarello 12
  • Fuse alpha and alphas parameters

    Fuse alpha and alphas parameters

    Checklist

    • [x] Added a CHANGELOG.rst entry
    • alphas is now deprecated in GeneralizedLinearRegressor (not in the CV version).
    • Instead, alpha and search_alpha are used to automatically detect the intent.

    This fixes issue #335.

    opened by MarcAntoineSchmidtQC 11
  • Adding a script to produce a nice benchmark figure against h2o/glmnet

    Adding a script to produce a nice benchmark figure against h2o/glmnet

    This PR adds a tool that produces benchmark figures for all four datasets, both regularization types and five distributions.

    For example, this figure shows performance for a lasso penalty on the intermediate-insurance dataset: image

    Checklist

    • [x] I don't think a changelog entry is needed because this is not user visible.
    • [x] Fixed up the r-glmnet benchmark: - enabled binomial distribution. I think this was failing on a previous version of glmnet but it's works now! - enabled sparse matrices. This improves performance for most problems.
    • [x] wrote docs/benchmarks/benchmark_figure.py which produces the figures and also stores the docs/benchmarks/benchmark_data.csv file with benchmark results.
    • [x] output the figures to docs/_static which is included in the repo. It might seem bad to include pdfs/pngs in the repo, but these are hard to regenerate and are going to be included in the documentation pages.
    opened by tbenthompson 10
  • [Major] Add a class for efficient operations on categorical features (sparse/dense split done)

    [Major] Add a class for efficient operations on categorical features (sparse/dense split done)

    One-hot encoding categorical variables generates matrices where all nonzero elements are 1, and there is only one nonzero element per row. It is possible to store these matrices with much less memory than a general sparse matrix and to operate on them more efficiently. We could improve performance a lot by adding a class that represents our data as a partitioned matrix composed of several one-hot encoded matrices (and perhaps also a dense block).

    this week's work performance 
    opened by ElizabethSantorellaQC 10
  • Add support for linear inequality constraints

    Add support for linear inequality constraints

    Closes #342 (or maybe not, we will see). Closes #344 (temporary adjustments to benchmark code).

    Tasks for an MVP

    • [x] Add a new solver based on scipy.optimize.minimize(method='trust-constr') (see here under "Notes")
    • [x] Make this solver available in the GLM and ensure it produces correct results when optimising without constraints
      • [x] Via pytest cases (I test it parallel to lbfgs)
      • [x] Via benchmark problems (manually for some selected cases at this point)
    • [x] Add parameters to pass linear inequality constraints to the GLM API (A_ineq θ <= b_ineq)
    • [x] Support fitting with an intercept (i.e., extend A_ineq and b_ineq)
    • [x] Formulate dedicated test cases for the new solver & new type of constraints
      • [x] Test equality of bounds and analogous inequality constraints
      • [x] ~General tests for inequality constraints (due to a lack of benchmark software, construct simple test case with a clear optimal solution)~ I suggest to skip this part, as the underlying algorithm does not differentiate between "real" inequality constraints and "quasi" bound constraints, as long as we pass them in the form A theta <= b. The latter case we test already, and do so against a trustworthy benchmark.
    • [ ] Analyze convergence behavior
      • [x] Poisson family, narrow_insurance_dataset
      • [ ] (possibly other combinations)

    Tasks for a productive version

    • [x] Support inequality constraints in the CV GLM
    • [x] Extend Docstrings for new functionality, incl. warning about runtime
    • [x] Various safety checks when inequality constraints are present
      • [x] Only allow either bounds or inequality constraints
      • [x] ~Handle case of initially infeasible starting point under inequality constraints~ (I verified this is not an issue for trust-constr)
      • [x] Analogy to check_bounds
    • [x] refactor and share _get_obj_and_derivative between the gradient descent solvers
    • [x] Add a CHANGELOG.rst entry

    Example

    import numpy as np
    import pandas as pd
    import plotnine as pn
    
    from quantcore.glm_benchmarks.problems import (
        load_data,
        generate_narrow_insurance_dataset,
    )
    from quantcore.glm import GeneralizedLinearRegressor
    
    # Load parts of the French Motor Insurance dataset
    dat = load_data(generate_narrow_insurance_dataset)
    dat["X"] = dat["X"].loc[:, lambda x: x.columns.str.startswith("DrivAge")]
    X, y = dat["X"], dat["y"]
    
    kwargs_shared = {
        "family": "poisson",
        "l1_ratio": 0,
        "alpha": 0,
        "fit_intercept": False,
    }
    
    # Define constraints (manual for now, convenience function tbd)
    A_ineq = np.zeros(shape=(2, X.shape[1]))
    b_ineq = np.zeros(shape=(2))
    
    # Bound constraint on DrivAge_0 <= -0.80
    A_ineq[0, X.columns == "DrivAge_0"] = 1
    b_ineq[0] = -0.80
    
    # Inequality constraint to ensure DrivAge_5 <= DrivAge_6
    A_ineq[1, X.columns == "DrivAge_5"] = 1
    A_ineq[1, X.columns == "DrivAge_6"] = -1
    
    # Fit models and plot coefficients
    
    mdls = {
        "auto": GeneralizedLinearRegressor(**kwargs_shared),
        "lbfgs": GeneralizedLinearRegressor(solver="lbfgs", **kwargs_shared),
        "trust-(un)constr": GeneralizedLinearRegressor(
            solver="trust-constr",
            **kwargs_shared,
        ),
        "trust-constr": GeneralizedLinearRegressor(
            solver="trust-constr",
            A_ineq=A_ineq,
            b_ineq=b_ineq,
            **kwargs_shared,
        ),
    }
    
    coefs = []
    for name, mdl in mdls.items():
        mdl.fit(X=X, y=y)
        coefs.append(pd.DataFrame(dict(name=X.columns, coef=mdl.coef_)).assign(model=name))
        print(
            f"model {name}: mean(y): {np.mean(y):.6f}, "
            f"mean(pred): {np.mean(mdl.predict(X=X)):.6f}"
        )
    
    df_coefs = pd.concat(coefs)
    
    (
        pn.ggplot(df_coefs, pn.aes(x="name", y="coef", color="model"))
        + pn.geom_point(position=pn.positions.position_jitter(width=0.15))
        + pn.geom_line(pn.aes(group="model"))
        + pn.theme_minimal()
        + pn.labs(
            x="factor",
            y="estimate",
            color="solver",
            title="Note: jitter only added horizontally",
        )
    )
    
    Screenshot 2021-02-11 at 11 40 27
    opened by PhilippRuchser 9
  • [Minor] In the objective function, drop terms that are not dependent on the parameters.

    [Minor] In the objective function, drop terms that are not dependent on the parameters.

    [EDIT] See the conversation below.

    Currently, in _eta_mu_deviance, we compute the deviance and then later multiply by 0.5 and add L1 and L2 penalty terms to compute an objective function value. This isn't actually strictly speaking the objective function value, but it should differ only by a constant dependent on y. Computing the deviance is more complicated for most distribution/link function pairs than computing the log-likelihood. For example, for Poisson, the LL is:

    y[i] * eta[i] - mu[i]
    

    whereas the deviance as currently implemented is:

            if y[i] == 0:
                unit_deviance = 2 * (-y[i] + mu_out[i])
            else:
                unit_deviance = 2 * ((y[i] * (log(y[i]) - eta_out[i] - 1)) + mu_out[i])
    

    Since we don't actually need a deviance, we should compute the log-likelihood.

    opened by tbenthompson 9
  • Different libraries may deal with constants in the log-likelihood differently

    Different libraries may deal with constants in the log-likelihood differently

    Log-likelihoods sometimes have some ugly constants, like pi for the normal distribution or a factorial for Poisson. Libraries may reasonably choose to omit these constants. This presents a problem if they make different decisions about how to treat such constants, since omitting a constant will change the strength of the regularization, leading to different optimal solutions.

    Two possible approaches:

    • Suggested by @tbenthompson: Fit cross-validated models along a regularization path, and report the one with the lowest cross-validated error. This will be slow, but sidesteps the issue.
    • Figure out how the libraries are dealing with constants and correct for it.
    question 
    opened by ElizabethSantorellaQC 9
  • New feature: add information criteria for model diagnostics

    New feature: add information criteria for model diagnostics

    Checklist:

    • [x] Added a CHANGELOG.rst entry
    • [x] Decision on what to do with ridge/elastic net regularisation. See here for more details.
    • [x] Decision on what to do with CV implementation. Decision: information criteria are primarily a metric for estimating out of sample generalisation from in sample measures. If cross validation is applicable to the setting (i.e., there is enough data) it does not seem to make sense to use these metrics so we have not supplied them on the GeneralizedLinearRegressorCV class.

    Closes #516.

    Summary: This change implements the calculations of the aic, aicc and bic information criteria for the trained model. I placed these criteria as properties/attributes on the glum.GeneralizedLinearRegressor class.

    Notes:

    • These information criteria require an "effective number of parameters" of the model. In the case of an unregularized model, this is simply the count of features/parameters. In the case of a lasso model, this is the count of the non-zero parameters on the ML fit. In the case of a ridge/elastic net model, I could not find a consensus on how this should be implemented so I opted to add a warning here. Does anyone have a better suggestion or is there something that I have missed? Decision: I have answered this in more detail here but I will argue that it does not make sense to compute these values for models that use L2 regularisation. I think keeping the warning there when L2 regularisation is used is sufficient to highlight this issue.
    • The method to compute these scores is always called from the fit method of glum.GeneralizedLinearRegressor. This is because we require the X and y data sources to compute these values. This allows for a simple interface where the score is defined as an attribute of the trained regressor, e.g: regressor.aic, regressor.aicc or regressor.bic. This could rather be done as a separate method call that accepts X and y as arguments: regressor.aic(X, y), regressor.aicc(X, y) or regressor.bic(X, y). I opted for the first choice as (1) I think this is neater and (2) it makes sense to compute these values at train time; but I am happy to change to the alternative. Decision: We have decided not to compute these statistics at train time due to the unnecessary computational overhead on the fit method. Rather, passing the dataset at call-time seems preferable.
    • These information criteria are only available when the noise model is one of BinomialDistribution, GammaDistribution, NormalDistribution, PoissonDistribution. See line 1640. This is because we require the definition of the likelihood function. Maybe I missed them but are there other families that we should include here? Decision: include the TweedieDistribution that generalises the above.
    • Calling aic, aicc or bic on the model before it has been trained returns None. We could rather log a warning or throw an exception. I am not sure if there is a standard/preferred approach here?
      Decision: raise error.
    opened by NicholasHoernleQC 8
  • Illegal instruction / segfault in `glm.fit()` from the Getting Started example

    Illegal instruction / segfault in `glm.fit()` from the Getting Started example

    Hi!

    The example from the Getting Started page generates an "Illegal instruction" segfault with Python 3.9 on Linux:

    $ python 3.9
    Python 3.9.0 (default, Oct  6 2020, 11:01:41)
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    >>> import sklearn
    >>> from sklearn.datasets import fetch_openml
    >>> from glum import GeneralizedLinearRegressor, GeneralizedLinearRegressorCV
    >>> house_data = fetch_openml(name="house_sales", version=3, as_frame=True)
    >>> X = house_data.data[
    ...     [
    ...         "bedrooms",
    ...         "bathrooms",
    ...         "sqft_living",
    ...         "floors",
    ...         "waterfront",
    ...         "view",
    ...         "condition",
    ...         "grade",
    ...         "yr_built",
    ...     ]
    ... ].copy()
    >>> y = house_data.target
    >>> X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    ...     X, y, test_size = 0.3, random_state=5
    ... )
    >>> glm = GeneralizedLinearRegressor(family="normal", alpha=0.1, l1_ratio=1)
    >>> glm.fit(X_train, y_train)
    Illegal instruction (core dumped)
    

    The error is here:

    Program terminated with signal 4, Illegal instruction.
    #0  0x00007f8d4dd2e973 in dense_baseTrue<double> ([email protected]=0x7f8d4ce00ac0, L=0x7f8d4d809340, [email protected]=0x3f031d0, [email protected]=9, imin2=0, imax2=9, jmin2=0,
        jmax2=9, kmin=0, kmax=512, innerblock=128, kstep=512, d=<optimized out>) at src/tabmat/ext/dense_helpers.cpp:73
    

    This is from version 2.1.0 installed with pip, using the PyPi wheels:

    Collecting glum
      Downloading glum-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 44.8 MB/s eta 0:00:00
    Collecting pandas
      Downloading pandas-1.4.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.7/11.7 MB 92.4 MB/s eta 0:00:00
    Collecting joblib
      Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.0/307.0 kB 106.2 MB/s eta 0:00:00
    Collecting scipy
      Downloading scipy-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.2/42.2 MB 141.2 MB/s eta 0:00:00
    Collecting tabmat>=3.0.1
      Downloading tabmat-3.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.8/5.8 MB 75.3 MB/s eta 0:00:00
    Collecting numpy
      Downloading numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.1/17.1 MB 94.6 MB/s eta 0:00:00
    Collecting numexpr
      Downloading numexpr-2.8.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (380 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 380.5/380.5 kB 86.3 MB/s eta 0:00:00
    Collecting scikit-learn>=0.23
      Downloading scikit_learn-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.8 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.8/30.8 MB 145.3 MB/s eta 0:00:00
    Collecting threadpoolctl>=2.0.0
      Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
    Collecting packaging
      Downloading packaging-21.3-py3-none-any.whl (40 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 kB 86.0 MB/s eta 0:00:00
    Collecting pytz>=2020.1
      Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 503.5/503.5 kB 87.0 MB/s eta 0:00:00
    Collecting python-dateutil>=2.8.1
      Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.7/247.7 kB 77.3 MB/s eta 0:00:00
    Requirement already satisfied: six>=1.5 in /share/software/user/open/python/3.9.0/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas->glum) (1.16.0)
    Collecting pyparsing!=3.0.5,>=2.0.2
      Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.3/98.3 kB 72.3 MB/s eta 0:00:00
    Installing collected packages: pytz, threadpoolctl, python-dateutil, pyparsing, numpy, joblib, scipy, pandas, packaging, tabmat, scikit-learn, numexpr, glum
    Successfully installed glum-2.1.0 joblib-1.1.0 numexpr-2.8.3 numpy-1.23.0 packaging-21.3 pandas-1.4.3 pyparsing-3.0.9 python-dateutil-2.8.2 pytz-2022.1 scikit-learn-1.1.1 scipy-1.8.1 tabmat-3.1.0 threadpoolctl-3.1.0
    
    opened by kcgthb 7
  • Docs for `P1` are a bit unclear

    Docs for `P1` are a bit unclear

    In the API reference for glum.GeneralizedLinearRegressor: https://glum.readthedocs.io/en/latest/glm.html#glum.GeneralizedLinearRegressor

    It says about P2:

    With this option, you can set the P2 matrix in the L2 penalty w * P2 * w. This gives a fine control over this penalty (Tikhonov regularization). A 2d array is directly used as the square matrix P2.

    But for P1:

    With this array, you can exclude coefficients from the L1 penalty. Set the corresponding value to 1 (include) or 0 (exclude).

    The latter one gives the impression that P1 is only for inclusion/exclusion instead of also being usable as a per-feature multiplier.

    opened by david-cortes 1
  • Interactions

    Interactions

    Fantastic project.

    I would love to see the possibility to add interactions on the fly, just like H20. There, you can provide a list of interaction pairs or, alternatively, a list of columns with pairwise interactions.

    This would be especially useful as scikit-learn preprocessing does not allow to create dummy encodings for categorical X and then calculate their product with another feature. (At least not with neat code.)

    opened by mayer79 0
  • Implement distributional anchor regression in glum

    Implement distributional anchor regression in glum

    I am interested in domain generalization (DG, also "external validity) of statistical / machine learning models. Anchor regression [1] is a recent idea interpolating between OLS and IV. [3] give ideas to generalize anchor regression to more general distributions (including classification). [2] is a "nice-to-read" summary, including ideas on how to extend to non-linear settings.

    To my knowledge, no efficient implementations for anchor regression or classification exist. I'd be interested to contribute this to my favorite GLM library but would need some guidance.

    What is Anchor Regression?

    Anchor regression improves the DG / external validity of OLS by adding a regularization term penalizing the correlation between a so-called anchor variable and the regression's residuals. The anchor variable is assumed to be exogenous to the system, i.e., not directly causally affected by covariates, the outcome, or relevant hidden variables. See the following causal graph:

    graph LR
    A --> U & X & Y
    U --> X & Y
    X --> Y
    

    What is an anchor?: Say we are interested to predict health outcomes in the ICU. Possibly valid anchor variables would be hospital id (one-hot encoded) or some transformation of time of year. The choice of anchor depends on the application. If we would like to predict out of time but on the same hospitals as seen in training, using time of year as anchor suffices. The hospital id should be included in the covariates (X). If we however would like to generalize across hospitals (i.e., predict on unseen hospitals), we need to include hospital id as an anchor (and exclude it from covariates). A similar example would be insurance with geographical location and time of year.

    Write $P_A$ for the $\ell_2$-projection onto the column-space of $A$ (i.e., $P_A(\cdot) = \mathbb{E}[\cdot \mid A]$) and let $\gamma>0$. In a regression setting, the anchor regression solution is given by:

    $$ b^\gamma = \underset{b}{\arg\min} \mathbb{E}\textrm{train}[((\mathrm{Id} - P_A)(Y - X^T b))^2] + \gamma \mathbb{E}\textrm{train}[(P_A(Y - X^T b))^2]. $$

    Given samples from $P_\mathrm{train}$, write $\Pi_A$ for the projection onto the column space of $A$, this can be estimated as

    $$ \hat b^\gamma = \underset{b}{\arg\min} |((\mathrm{Id} - \Pi_A)(Y - X^T b))|_2^2 + \gamma | \Pi_A (Y - X^T b)|_2^2. $$

    [1] show that the anchor regression solution protects against the worst-case risk with respect to distribution shifts induced through the anchor variable. Here $\gamma$ controls the size of the set of distributions the method protects against, which is generated by $\sqrt{\gamma}$-times the shifts as seen in the training data [1, Theorem 1].

    In an instrumental variable (IV) setting (no direct causal effect $A \to U$, $A \to Y$, "sufficient" effect $A \to X$), anchor regression interpolates between OLS and IV regression, with $\hat b^\gamma$ converging to the IV solution for $\gamma \to \infty$. This is because the IV solution can be written as

    $$ \hat b^\textrm{IV} = \underset{b \colon \mathrm{Cor}(A, X^T b - Y)=0}{\arg\min} |Y - X^T b|_2^2. $$

    In low-dimensional settings, (1) can be optimized using the transformation

    $$ \tilde X := (\mathrm{Id} - \Pi_A) X + \sqrt{\gamma} \Pi_A X \ \ \textrm{ and }\ \ \tilde Y := (\mathrm{Id} - \Pi_A)Y + \sqrt{\gamma} \Pi_A Y, $$

    where $\Pi_A = A (A^T A)^{-1} A^T$ (this needs not to be calculated though).

    What is Distributional Anchor Regression?

    [2] present ideas on how to generalize anchor regression from OLS to GLMs. In particular, if $f$ are raw scores, they propose to use residuals

    $$ r = \frac{d}{d f} \ell(f, y). $$

    For $f = X^T \beta$ and $\ell(f, y) = \frac{1}{2}(y - f)^2$ this reduces to anchor regression. For logistic regression, with $Y \in {-1, 1}$ and

    $$ \ell(f, y) = - \sum_i \log(1 + \exp(-y_i f_i)), $$

    this yields residuals

    $$ r = \frac{d}{d f} \ell(f, y) = y (1 + \exp(y_i f_i))^{-1} = \tilde y - p_i, $$

    where $\tilde y = \frac{y}{2} + 0.5 \in {0, 1}$ and $p_i = (1 + \exp(-f_i))^{-1}$.

    Define $\ell^\gamma(y, f) := \ell(f, y) + (\gamma - 1) | \Pi_A r |_2^2$. The gradient of the anchor loss is given as

    $$ \frac{d}{d f_i} \ell^\gamma(f, y) = y_i (1 + \exp(y_i f_i))^{-1} - 2 (\gamma - 1) (\Pi_A r)_i p_i (1 - p_i). $$

    The Hessian is (not pretty)

    $$ \frac{d}{d f_i f_j} \ell^\gamma(f, y) = -\mathbb{1}_{{i = j}} p_i ( 1 - p_i) \left(1 + 2(\gamma - 1) (1 - 2p_i) (\Pi_A r)i \right) + 2 (\gamma - 1) p_i (1 - p_i) p_j (1 - p_j) (\Pi_A){i, j} $$

    If $f = X^T \beta$, then (here, $\cdot$ is matrix multiplication)

    $$ \frac{d}{d \beta} \ell^\gamma(X^T\beta, y) = y(1 + \exp(yf)^{-1}) \cdot X + 2(\gamma - 1) p (1 - p) \cdot \Pi_A X $$

    and

    $$ \frac{d}{d^2 \beta} \ell^\gamma(X^T\beta, y) = X^T \cdot \textrm{diag}(p (1 - p) (1 + 2(\gamma - 1)(1 - 2p)\Pi_A r)) X + X^T \cdot \mathrm{diag}(p (1-p)) \cdot \Pi_A \cdot \mathrm{diag}(p (1-p))\cdot X $$

    Computational considerations

    Here is some numpy code calculating and testing the above derivatives:

    import numpy as np
    import pytest
    from scipy.optimize import approx_fprime
    
    
    def predictions(f):
        return 1 / (1 + np.exp(-f))
    
    
    def proj(A, f):
        return np.dot(A, np.linalg.lstsq(A, f, rcond=None)[0])
    
    
    def proj_matrix(A):
        return np.dot(np.dot(A, np.linalg.inv(A.T @ A)), A.T)
    
    
    def loss(X, beta, y, A, gamma):
        f = X @ beta
        r = (y / 2 + 0.5) - predictions(f)
        return -np.sum(np.log1p(np.exp(-y * f))) + (gamma - 1) * np.sum(proj(A, r) ** 2)
    
    
    def grad(X, beta, y, A, gamma):
        f = X @ beta
        p = predictions(f)
        r = (y / 2 + 0.5) - p
    
        return (r - 2 * (gamma - 1) * proj(A, r) * p * (1 - p)) @ X
    
    
    def hess(X, beta, y, A, gamma):
        f = X @ beta
        p = predictions(f)
        r = (y / 2 + 0.5) - p
        diag = -np.diag(p * (1 - p) * (1 + 2 * (gamma - 1) * (1 - 2 * p) * proj(A, r)))
        dense = proj_matrix(A) * p * (1 - p)[np.newaxis, :] * (p * (1 - p))[:, np.newaxis]
    
        return X.T @ (diag + 2 * (gamma - 1) * dense) @ X
    
    
    @pytest.mark.parametrize("gamma", [0, 0.1, 0.8, 1, 5])
    def test_grad_hess(gamma):
        rng = np.random.default_rng(0)
        n = 100
        p = 10
        q = 3
    
        X = rng.normal(size=(n, p))
        beta = rng.normal(size=p)
    
        y = 2 * rng.binomial(1, 0.5, n) - 1
    
        A = rng.normal(size=(n, q))
    
        approx_grad = approx_fprime(beta, lambda b: loss(X, b, y, A, gamma))
        np.testing.assert_allclose(approx_grad, grad(X, beta, y, A, gamma), 1e-5)
    
        approx_hess = approx_fprime(beta, lambda b: grad(X, b, y, A, gamma), 1e-7)
        np.testing.assert_allclose(approx_hess, hess(X, beta, y, A, gamma), 1e-5)
    

    I understand that glum implements different solvers. As $\ell_1$-regularization is popular in the robustness community, the irls solver is most interesting.

    To my understanding, the computation of the full projection matrix above can be skipped using a QR decomposition of $A$. However, in your implementation, you never actually compute the Hessian, but rather an approximation. And your implementation appears to depend heavily on the Hessian being of the form $X^T D X$ for some diagonal $D$, which is no longer the case here.

    Summary

    Anchor regression interpolates between OLS and IV regression to improve the models' robustness to distribution shifts. Distributional anchor regression is a generalization to GLMs. To my knowledge, no efficient solver for distributional anchor regression exists.

    Is this something you would be interested to integrate into glum? How complex would this be? Are there any hurdles (e.g., dense Hessian) that prohibit the use of existing methods?

    References

    [1] Rothenhäusler, D., N. Meinshausen, P. Bühlmann, and J. Peters (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B (Statistical Methodology) 83(2), 215–246.

    [2] Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Science 35(3), 404– 426.

    [3] Kook, L., B. Sick, and P. Bühlmann (2022). Distributional anchor regression. Statistics and Computing 32(3), 1–19.

    opened by mlondschien 0
  • Support for quasibinomial, quasipoisson, negative binomial, multinomial & Dirichlet multinomial families?

    Support for quasibinomial, quasipoisson, negative binomial, multinomial & Dirichlet multinomial families?

    Are there any plans to also support additional GLM families like quasibinomial, quasipoisson, negative binomial, multinomial, Dirichlet-multinomial (overdispersed multinomial) & ordinal GLMs, some of which are now supported by e.g. h20? (Multinomial & Dirichlet multinomial I believe can be recast into a poisson or quasipoisson GLM via the Poisson trick, but that's computationally not efficient)

    opened by tomwenseleers 1
  • Request for a force_finite flag for score function

    Request for a force_finite flag for score function

    The r2_score method in sklearn has a force_finite flag which defaults to True in order to avoid infinite and NaN values when the TSS happens to be 0. The analogous quantity when computing D^2 is the null deviance, which can also sometimes be 0. It would be great if, in glum, there was also a force_finite flag that can gracefully handle the case where the null deviance happens to be 0. Right now, I get a ZeroDivisionError in glum 2.1.2 running in Python 3.6.

    new feature 
    opened by thobanster 0
Releases(2.2.1)
  • 2.2.1(Nov 25, 2022)

  • 2.2.0(Nov 25, 2022)

    2.2.0 - 2022-11-25

    New features:

    • Add an argument to GeneralizedLinearRegressorBase to drop the first category in a Categorical column using implementation in tabmat
    • One may now request the Tweedie loss by setting the 'family' parameter of GeneralizedLinearRegressor and GeneralizedLinearRegressorCV to 'tweedie'.

    Bug fixes:

    • Setting bounds for constant columns was not working (bounds were internally modified to 0). A similar issue was preventing inequalities from working with constant columns. This is now fixed.

    Other changes:

    • No more builds for 32-bit systems with Python >= 3.8. This is due to scipy not supporting it anymore.
    Source code(tar.gz)
    Source code(zip)
  • 2.1.2(Jul 1, 2022)

  • 2.1.1(Jul 1, 2022)

  • 2.1.0(Jun 27, 2022)

    2.1.0 - 2022-06-27

    New features:

    • Added aic, aicc and bic attributes to GeneralizedLinearRegressor. These attributes provide the information criteria based on the training data and the effective degrees of freedom of the maximum likelihood estimate for the model's parameters.
    • GeneralizedLinearRegressor.std_errors and GeneralizedLinearRegressor.covariance_matrix now accept data frames with categorical data.

    Bug fixes:

    • The score method of GeneralizedLinearRegressor and GeneralizedLinearRegressorCV now accepts offsets.
    • Fixed the calculation of the information matrix for the Binomial distribution with logit link, which affected non-robust standard errors.

    Other:

    • The CI now runs daily unit tests against the nightly builds of numpy, pandas and scikit-learn.
    • The minimally required version of tabmat is now 3.1.0.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.3(Nov 5, 2021)

    2.0.3 - 2021-11-05

    Other:

    • We are now specifying the run time dependencies in setup.py, so that missing dependencies are automatically installed from PyPI when installing glum via pip.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.2(Nov 3, 2021)

    Bug fix:

    • Fixed the sign of the log likelihood of the Gaussian distribution (not used for fitting coefficients).
    • Fixed the wide benchmarks which had duplicated columns (categorical and numerical).

    Other:

    • The CI now builds the wheels and upload to pypi with every new release.
    • Renamed functions checking for qc.matrix compliance to refer to tabmat.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.1(Oct 11, 2021)

  • 2.0.0(Oct 8, 2021)

    Breaking changes:

    • Renamed the package to glum!!! Hurray! Celebration.
    • GeneralizedLinearRegressor and GeneralizedLinearRegressorCV lose the fit_dispersion parameter. Please use the dispersion method of the appropriate family instance instead.
    • All functions now use sample_weight as a keyword instead of weights, in line with scikit-learn.
    • All functions now use dispersion as a keyword instead of phi.
    • Several methods GeneralizedLinearRegressor and GeneralizedLinearRegressorCV that should have been private have had an underscore prefixed on their names: tear_down_from_fit, _set_up_for_fit, _set_up_and_check_fit_args, _get_start_coef, _solve and _solve_regularization_path.
    • glum.GeneralizedLinearRegressor.report_diagnostics and glum.GeneralizedLinearRegressor.get_formatted_diagnostics are now public.

    New features:

    • P1 and P2 now accepts 1d array with the same number of elements as the unexpanded design matrix. In this case, the penalty associated with a categorical feature will be expanded to as many elements as there are levels, all with the same value.
    • ExponentialDispersionModel gains a dispersion method.
    • BinomialDistribution and TweedieDistribution gain a log_likelihood method.
    • The fit method of GeneralizedLinearRegressor and GeneralizedLinearRegressorCV now saves the column types of pandas data frames.
    • GeneralizedLinearRegressor and GeneralizedLinearRegressorCV gain two properties: family_instance and link_instance.
    • GeneralizedLinearRegressor.std_errors and GeneralizedLinearRegressor.covariance_matrix have been added and support non-robust, robust (HC-1), and clustered covariance matrices.
    • GeneralizedLinearRegressor and GeneralizedLinearRegressorCV now accept family='gaussian' as an alternative to family='normal'.

    Bug fix:

    • The score method of GeneralizedLinearRegressor and GeneralizedLinearRegressorCV now accepts data frames.
    • Upgraded the code to use tabmat 3.0.0.

    Other:

    • A major overhaul of the documentation. Everything is better!
    • The methods of the link classes will now return scalars when given scalar inputs. Under certain circumstances, they'd return zero-dimensional arrays.
    • There is a new benchmark available glm_benchmarks_run based on the Boston housing dataset. See here.
    • glm_benchmarks_analyze now includes offset in the index. See here.
    • glmnet_python was removed from the benchmarks suite.
    • The innermost coordinate descent was optimized. This speeds up coordinate descent dominated problems like LASSO by about 1.5-2x. See here.
    Source code(tar.gz)
    Source code(zip)
  • 1.5.1(Jul 22, 2021)

    1.5.1 - 2021-07-22

    Bug fix:

    • Have the linear_predictor and predict methods of GeneralizedLinearRegressor and GeneralizedLinearRegressorCV honor the offset when alpha is None.
    Source code(tar.gz)
    Source code(zip)
  • 1.5.0(Jul 15, 2021)

    1.5.0 - 2021-07-15

    New features:

    • The linear_predictor and predict methods of quantcore.glm.GeneralizedLinearRegressor and quantcore.glm.GeneralizedLinearRegressorCV gain an alpha parameter (in complement to alpha_index). Moreover, they are now able to predict for multiple penalties.

    Other:

    • Methods of Link now consistently return NumPy arrays, whereas they used to preserve pandas series in special cases.
    • Don't list sparse_dot_mkl as a runtime requirement from the conda recipe.
    • The minimal NumPy pin should be dependent on the NumPy version in host and not fixed to 1.16.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.3(Jun 25, 2021)

    1.4.3 - 2021-06-25

    Bug fix:

    • copy_X = False will now raise a value error when X has dtype int32 or int64. Previously, it would only raise for dtype int64.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.2(Jun 15, 2021)

    1.4.2 - 2021-06-15

    Tutorials and documenation improvements:

    • Adding tutorials to the documentation
    • Additional documentation improvements

    Bug fix:

    • Verbose progress bar now working again.

    Other:

    • Small improvement in documentation for the alpha_index argument to :func:quantcore.glm.GeneralizedLinearRegressor.predict.
    • Pinned pre-commit hooks versions.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.1(May 1, 2021)

  • 1.4.0(Apr 13, 2021)

    1.4.0 - 2021-04-13

    Deprecations:

    • Fusing the alpha and alphas arguments for quantcore.glm.GeneralizedLinearRegressor. alpha now also accepts array-like inputs. alphas is now deprecated but can still be used for backward compatibility. The alphas argument will be removed with the next major version.

    Other:

    • We removed entry points to functions in quantcore.glm_benchmarks from the conda package.
    Source code(tar.gz)
    Source code(zip)
  • 1.3.1(Apr 13, 2021)

    1.3.1 - 2021-04-12

    Bug fix:

    • quantcore.glm._distribution.unit_variance_derivative is evaluating a proper numexpr expression again (regression in 1.3.0).
    Source code(tar.gz)
    Source code(zip)
  • 1.3.0(Apr 13, 2021)

    1.3.0 - 2021-04-12

    New features:

    • We added a new solver based on scipy.optimize.minimize(method='trust-constr').
    • We added support for linear inequality constraints of type A_ineq.dot(coef_) <= b_ineq.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Feb 4, 2021)

  • 1.1.1(Jan 11, 2021)

  • 1.1.0(Nov 23, 2020)

    1.1.0 - 2020-11-23

    New features:

    Direct support for pandas categorical types in fit and predict. These will be converted into a CategoricalMatrix.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.1(Nov 12, 2020)

  • 1.0.0(Nov 11, 2020)

    Breaking change:

    • Renamed alpha_level attribute of quantcore.glm.GeneralizedLinearRegressor and quantcore.glm.GeneralizedLinearRegressorCV to alpha_index.

    Other:

    • Clarified behavior of scale_predictors.
    Source code(tar.gz)
    Source code(zip)
  • 0.0.15(Nov 11, 2020)

A high-performance topological machine learning toolbox in Python

giotto-tda is a high-performance topological machine learning toolbox in Python built on top of scikit-learn and is distributed under the G

giotto.ai 632 Dec 29, 2022
A GitHub action that suggests type annotations for Python using machine learning.

Typilus: Suggest Python Type Annotations A GitHub action that suggests type annotations for Python using machine learning. This action makes suggestio

40 Sep 18, 2022
Azure MLOps (v2) solution accelerators.

Azure MLOps (v2) solution accelerator Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting poi

Microsoft Azure 233 Jan 01, 2023
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

Daniel Khromov 47 Dec 17, 2022
NumPy-based implementation of a multilayer perceptron (MLP)

My own NumPy-based implementation of a multilayer perceptron (MLP). Several of its components can be tuned and played with, such as layer depth and size, hidden and output layer activation functions,

1 Feb 10, 2022
Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

Panagiotis (Panos) Mavritsakis 4 Sep 22, 2022
As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Machine Learning Loot Crate 💻 🧰 🔴 Welcome contributors! As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Cra

Abhishek Sharma 89 Dec 28, 2022
Turning images into '9-pan' palettes using KMeans clustering from sklearn.

img2palette Turning images into '9-pan' palettes using KMeans clustering from sklearn. Requirements We require: Pillow, for opening and processing ima

Samuel Vidovich 2 Jan 01, 2022
Machine Learning from Scratch

Machine Learning from Scratch Author: Shengxuan Wang From: Oregon State University Content: Building Machine Learning model from Scratch, without usin

ShawnWang 0 Jul 05, 2022
Pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code

pandas-method-chaining pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code. It is a fork from pandas-v

Francis 5 May 14, 2022
Machine Learning approach for quantifying detector distortion fields

DistortionML Machine Learning approach for quantifying detector distortion fields. This project is a feasibility study for training a surrogate model

Joel Bernier 1 Nov 05, 2021
Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

Happy N. Monday 3 Feb 15, 2022
This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

B DEVA DEEKSHITH 1 Nov 03, 2021
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list Uber Open Source 997 Dec 30, 2022

This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC 47 Dec 30, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

1 Jan 19, 2022
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing va

Wenjie Du 179 Dec 31, 2022