Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

Overview

PyPI version codecov Code style: black

formulae

formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations like Patsy or formulaic is that formulae can work with formulas describing a model with both common and group specific effects (a.k.a. fixed and random effects, respectively).

This package has been written to make it easier to specify models with group effects in Bambi, a package that makes it easy to work with Bayesian GLMMs in Python, but it could be used independently as a backend for another library. The approach in this library is to extend classical statistical formulas in a similar way than in R package lme4.

Installation

formulae requires a working Python interpreter (3.7+) and the libraries numpy, scipy and pandas with versions specified in the requirements.txt file.

Assuming a standard Python environment is installed on your machine (including pip), the latest release of formulae can be installed in one line using pip:

pip install formulae

Alternatively, if you want the development version of the package you can install from GitHub:

pip install git+https://github.com/bambinos/formulae.git

Documentation

The official documentation can be found here

Notes

  • The data argument only accepts objects of class pandas.DataFrame.
  • y ~ . is not implemented and won't be implemented in a first version. However, it is planned to be included in the future.
Comments
  • Feature Request: Splines

    Feature Request: Splines

    Hi all. I would like to start with my appreciation for your work on Bambi and formulae! I think both of these are very valuable additions to the scientific python stack. I saw a thread on twitter that made me think formulae might become the new patsy, so I thought I would describe an affordance in patsy that would be really cool to add to formulae: splines: https://patsy.readthedocs.io/en/latest/spline-regression.html

    I would love to be able to include bs(x) in one line, e.g.

    model = bmb.Model("test_positive ~ vaccinated + bs(age,df=3) + (vaccinated+vaccinated:day|state)",
                      df, family="bernoulli")
    

    Since this is already implemented in patsy maybe it will not be too hard to add here, as well! Thanks again for all of your work!

    opened by aflaxman 6
  • model definition

    model definition

    I would like to fit a fully specified model. In lme I would write this as:

    f = "choice ~ 1 + stimulus*reward + (1+stimulus*reward|subject_id)"`
    

    In formulae this gives the following error:

    import formulae as fm
    fm.model_description(f)
    
        891                 return NotImplemented
        892         else:
    --> 893             raise ValueError("LHS of group specific term cannot have more than one term.")
        894 
        895     def __repr__(self):
    

    The following does seem to work:

    f = "choice ~ 1 + stimulus*reward + (1|subject_id) + (stimulus|subject_id) + (reward|subject_id) + (stimulus:reward|subject_id)"
    fm.model_description(f)
    

    Is this the syntax for a fully specified model in formulae? Output:

    ModelTerms(
      ResponseTerm(
        Term(
          name= choice
          variable= choice
          kind= None
        )
      ),
      InterceptTerm(),
      Term(
        name= stimulus
        variable= stimulus
        kind= None
      ),
      Term(
        name= reward
        variable= reward
        kind= None
      ),
      InteractionTerm(
        name= stimulus:reward
        variables= {'stimulus', 'reward'}
      ),
      GroupSpecTerm(
        expr= InterceptTerm(),
        factor= Term(
          name= subject_id
          variable= subject_id
          kind= None
        )
      ),
      GroupSpecTerm(
        expr= Term(
          name= stimulus
          variable= stimulus
          kind= None
        ),
        factor= Term(
          name= subject_id
          variable= subject_id
          kind= None
        )
      ),
      GroupSpecTerm(
        expr= Term(
          name= reward
          variable= reward
          kind= None
        ),
        factor= Term(
          name= subject_id
          variable= subject_id
          kind= None
        )
      ),
      GroupSpecTerm(
        expr= InteractionTerm(
          name= stimulus:reward
          variables= {'stimulus', 'reward'}
        ),
        factor= Term(
          name= subject_id
          variable= subject_id
          kind= None
        )
      )
    )
    
    opened by jwdegee 6
  • Correlated Random Effects

    Correlated Random Effects

    Hi.

    In Ch3 of the lme4 book manuscript, Bates showed two cases of random effects, one is correlated and the other is uncorrelated (independent):

    Screen Shot 2021-11-19 at 16 01 07

    The correlated one is from formula Reaction ~ 1 + Days + (1 + Days | Subject), the uncorrelated on is from Reaction ~ 1 + Days + (1|Subject) + (0 + Days|Subject).

    I tried the same formulae with formulae, the two design matrices are the same:

    Screen Shot 2021-11-19 at 16 04 12
    from formulae import design_matrices
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np
    
    data = pd.read_csv('../datasets/sleepstudy.csv')
    
    f1 = 'Reaction ~ 1 + Days + (1 + Days | Subject)'
    f2 = 'Reaction ~ 1 + Days + (1|Subject) + (0 + Days|Subject)'
    
    fig, ax = plt.subplots(2, 1, figsize=(12, 6))
    for i, formula in enumerate([f1, f2]):
        
        dm = design_matrices(formula, data)
        terms = dm.group.terms_info.keys()
    
        Zs = [dm.group[key] for key in dm.group.terms_info.keys()]
        Z = np.hstack(Zs)
        Z[Z ==0] = np.nan
        ax[i].imshow(Z.T)
        ax[i].set_title(formula)
    

    Is this an intended behavior for formulae or is it a bug?

    opened by huangziwei 4
  • create MANIFEST.in

    create MANIFEST.in

    The source tarball is broken b/c a required file in setup.py is not shipped with the source. This PR adds a MANIFEST.in file to ensure the necessary files will be packaged.

    See https://github.com/conda-forge/staged-recipes/pull/14193 for an example of the problem with the current tarball.

    opened by ocefpaf 4
  • Update documentation

    Update documentation

    This PR attempts to improve existing online documentation. Many classes, methods, and functions are currently documented, but this documentation is not shown on the webpage. The general aim is to have three broad sections.

    • Examples: A collection of examples using formulae to obtain design matrices and fit statistical models.
    • API Reference: The documentation for the objects that we expect the users to interact with.
    • Internals: A documentation made for people wanting to use formulae for their own libraries.

    These tasks may be split into several PRs. The goal of this PR is to improve the Internals reference. Examples will come later.

    opened by tomicapretto 3
  • Adding support for survival functions #79

    Adding support for survival functions #79

    This PR adds a transformation for survival type responses. It is similar to R's Surv.

    Surv(time, status) ~ returns a 2xn matrix with the time and the event status. These can then be used in the model to handle censored and uncensored data differently.

    #79

    opened by ipa 2
  • Add contrasts

    Add contrasts

    • Add general contrasts so we can use other encodings apart from reference encoding.
    • Stop reshaping 1d arrays to column vectors. It's not really necessary.
    opened by tomicapretto 2
  • Add basis spline stateful transformation

    Add basis spline stateful transformation

    Adds an equivalent of the bs() function in the splines package in R. This is a stateful transformation by the way (it remember knots, bounds, and whether it has intercept or not).

    Using the engine dataset from the R package gamair, we have:

    import pandas as pd
    from formulae import  design_matrices
    
    data = pd.DataFrame({
        "size": [
            2.78, 2.43, 2.32, 2.43, 2.98, 2.32, 2.32, 2.32, 2.32, 2.13, 
            2.13, 2.13, 2.98, 1.99, 1.99, 1.99, 1.78, 1.58, 1.42
        ],
        "wear": [
            3, 3.1, 4.8, 3.3, 2.8, 2.9, 3.8, 3, 2.7, 3.2, 2.4, 2.6, 1.7,
            2.6, 2.8, 2.4, 2.5, 4.2, 4
        ]
    })
    design_matrices("wear ~ bs(size, 4)", data).common.design_matrix
    
    array([[1.        , 0.00498077, 0.09481583, 0.56163869, 0.33856471],
           [1.        , 0.10358454, 0.429405  , 0.46238083, 0.00462963],
           [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
           [1.        , 0.10358454, 0.429405  , 0.46238083, 0.00462963],
           [1.        , 0.        , 0.        , 0.        , 1.        ],
           [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
           [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
           [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
           [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
           [1.        , 0.36011331, 0.46706614, 0.16341177, 0.        ],
           [1.        , 0.36011331, 0.46706614, 0.16341177, 0.        ],
           [1.        , 0.36011331, 0.46706614, 0.16341177, 0.        ],
           [1.        , 0.        , 0.        , 0.        , 1.        ],
           [1.        , 0.48758651, 0.37856345, 0.08455375, 0.        ],
           [1.        , 0.48758651, 0.37856345, 0.08455375, 0.        ],
           [1.        , 0.48758651, 0.37856345, 0.08455375, 0.        ],
           [1.        , 0.56530178, 0.19739645, 0.02130178, 0.        ],
           [1.        , 0.39454797, 0.04771909, 0.00187011, 0.        ],
           [1.        , 0.        , 0.        , 0.        , 0.        ]])
    

    EDIT

    By mistake, I ended up including here commits related to other topics. Now, it is also possible to pass response levels with index notation without having to use an explicit string. This doesn't work when the level contains spaces (must use a string there) or when the level is a number (use binary() function there). Examples:

    This works:

    • "y['A'] ~ x1 + x2"
    • "y[A] ~ x1 + x2"
    • "y['My level contains spaces'] ~ x1 + x2"
    • "binary(y, 2)" ~ x1 + x2"

    This does not work

    • "y[My level contains spaces] ~ x1 + x2"
    • "y[1] ~ x1 + x2"
    opened by tomicapretto 2
  • Factor of group specific effect can be an interaction now

    Factor of group specific effect can be an interaction now

    Originally, the group of a group-specific term could only be a single term, not an interaction. This PR aims to enable groups to be made of interactions. It looks like so far it is working, but there are some edge cases with the evaluation with new data that should be addressed.

    This feature (problem) was requested (reported) here.

    opened by tomicapretto 2
  • Nested stateful transformations don't work

    Nested stateful transformations don't work

    We fail at parsing and resolving terms that are nested function calls. All the following alternatives work without any problem

    design_matrices("y ~ f(x1)", data)
    design_matrices("y ~ f(arg = x1)", data)
    design_matrices("y ~ f(arg = 'blabla')", data)
    design_matrices("y ~ f(arg1= a, arg2 = "blabla")", data)
    

    But nested function calls, like the following, don't work

    design_matrices("y ~ f(g(x1))", data)
    design_matrices("y ~ f(arg = g(x1))", data)
    # ...etc
    

    To solve this we need to update the Resolver.visitCallExpr() method as well as how the call term handles Expr.Call() instances, and maybe something else that I'm missing now.

    bug 
    opened by tomicapretto 2
  • Parser generators

    Parser generators

    It looks like you're doing a lot of low-level parser generator work by hand. Have you considered using an existing pure Python parser package like rply? We've had a lot of success with that package in hy.

    opened by brandonwillard 2
  • Enable deepcopy of bmb.Model

    Enable deepcopy of bmb.Model

    Hello !

    I tried to deepcopy a bmb.Model and find out that we face a cannot pickle '_thread.lock' object. After investigation, this is due to saving of thread locked objects within formulae.Environment objects.

    Indeed many, locked variables are fetched while calling Environment.capture(). My dummy workaround was to just replace the return cls([frame.f_locals, frame.f_globals])

            if isinstance(env, cls):
                return env
            elif isinstance(env, numbers.Integral):
                depth = env + reference
            else:
                raise TypeError("'env' must be either an integer or an instance of Environment.")
            frame = inspect.currentframe()
            try:
                for _ in range(depth + 1):
                    if frame is None:
                        raise ValueError("call-stack is not that deep!")
                    frame = frame.f_back
                # return cls([frame.f_locals, frame.f_globals])
                return cls([])
            finally:
                del frame
    

    I'm not experienced enough on Bambi formulae, a good workaround would be to keep only what is necessary for formulae to work in output of the capture() function

    I hope it would be helpful

    opened by lt-brs 1
  • Support unknown test time group categories

    Support unknown test time group categories

    I just tried a model with a different train/test split for the fit() and predict() functionality and hit the following error in the predict() call in bambi and hit the following error:

    https://github.com/bambinos/formulae/blob/b926883a4ab744367953095a8540c8bf64ec2449/formulae/terms/variable.py#L223-L226

    For a lot of real world use cases it's quite common for unknown categories to show up at inference time (e.g. new users, new customers etc) and it would be great to try and handle this natively in the predict() call such that a 'default' prediction can be made based on the mean across all groups rather than a specific group.

    opened by markgoodhead 6
  • Version 0.3.3 crashes kernel with large dataset

    Version 0.3.3 crashes kernel with large dataset

    Hi all,

    Thank you for the work on this awesome package!

    I am using bambi to fit a fairly complex model on a large dataset with a single group effect, which has many individuals. There are around 183,000 rows and 30,542 unique groups. Version 0.3.3 crashes Jupyter reliably when instantiating this design matrix. Interestingly, version 0.2.0 will instantiate it, with the caveat I can't include an interaction term I can in 0.3.3 within bambi (see bambi issue 495).

    I've tried a few different approaches, and have found it will instantiate with about 25-50% of the data (using DataFrame.sample(frac=.25)), so it seems more an issue of sheer scale than anything else. I've also tried with Spyder, getting the same issue.

    The code below will grab a modified dataset of the same size and structure I am using and set up the model design, which kills my kernel after a minute or two.

    import pandas as pd
    from formulae import design_matrices
    
    trouble = pd.read_feather('https://osf.io/kw2xh/download')
    md = design_matrices('response ~ bs(time, degree=2, df=10) * state + state * level * trait + (1|pid)', data=trouble)
    

    0.2.0 will return this after a little while, however. Any help is greatly appreciated, hopefully this issue isn't localised to my machine!

    opened by alexjonesphd 7
  • Improve tests

    Improve tests

    Tests are quite complete, but they are messy and redundant as well. Also, there are a couple of things we're missing yet (see https://app.codecov.io/gh/bambinos/formulae/blob/master/formulae/terms/terms.py).

    enhancement 
    opened by tomicapretto 0
Releases(v0.3.4)
  • v0.3.4(May 18, 2022)

    Release done because we'll have a new Bambi release.

    • Fixed a bug in the levels of interaction terms involving numeric terms with multiple columns (b4a1f73)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Apr 14, 2022)

  • v0.3.2(Apr 13, 2022)

  • v0.3.1(Apr 12, 2022)

    Maintenance and fixes

    • Renamed ResponseVector to ResponseMatrix (#71)
    • Renamed design_vector to design_matrix in ResponseMatrix(#71)
    • Updated docstrings in formulae/matrices.py (#71)

    Deprecation

    • Removed binary and success attributes from ResponseMatrix (#71)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Mar 11, 2022)

    Changes

    New features

    • We can create our own encodings such as Treatment or Sum encoding. These are subclasses of Encoding.
    • Added two aliases T and S that are shorthands of C(var, Treatment) and C(var, Sum) respectively.
    • DesignVector, CommonEffectsMatrix and GroupEffectsMatrix now retrieve their values when passed to np.array() and np.asarray().
    • Add poly() stateful transform.
    • na_action in design_matrices() can be "pass" which means not to drop or raise error about missing values. It just keeps them (#69)
    • design_matrices() gains a new argument extra_namespace. It allows us to pass a dictionary of transformations that are made available in the environment where the formula is evaluated (#70)

    Maintenance and fixes

    • Fixed a bug in the addition of lower order terms when the original higher order term wasn't full-rank.
    • Columns for categorical terms are integer by default. They are casted to float only if there are other float-valued columns in the design matrix.
    • Updated str and repr methods of ResponseVector, CommonEffectsMatrix, and GroupEffectsMatrix.
    • Added str and repr methods for DesignMatrices.
    • Added get_item method for DesignMatrices.
    • Added support for comparison operators within function calls.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Sep 15, 2021)

  • v0.1.4(Aug 9, 2021)

  • v0.1.3(Aug 2, 2021)

  • v0.1.2(Jul 30, 2021)

  • v0.1.1(Jun 18, 2021)

  • v0.1.0(Jun 12, 2021)

    Making a new release. I'm using 0.1.0 because, as far as I'm aware, there are no hidden bugs now. Also, I need this new release to merge a PR in Bambi that does fix tricky bugs.

    Source code(tar.gz)
    Source code(zip)
  • v0.0.10(May 10, 2021)

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can b

Shigang Li 6 Jun 18, 2022
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

15 Jun 20, 2022
Dragonfly is an open source python library for scalable Bayesian optimisation.

Dragonfly is an open source python library for scalable Bayesian optimisation. Bayesian optimisation is used for optimising black-box functions whose

744 Jan 02, 2023
Both social media sentiment and stock market data are crucial for stock price prediction

Relating-Social-Media-to-Stock-Movement-Public - We explore the application of Machine Learning for predicting the return of the stock by using the information of stock returns. A trading strategy ba

Vishal Singh Parmar 15 Oct 29, 2022
30 Days Of Machine Learning Using Pytorch

Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

Mayur 119 Nov 24, 2022
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
This is the code repository for Interpretable Machine Learning with Python, published by Packt.

Interpretable Machine Learning with Python, published by Packt

Packt 299 Jan 02, 2023
Azure Cloud Advocates at Microsoft are pleased to offer a 12-week, 24-lesson curriculum all about Machine Learning

Azure Cloud Advocates at Microsoft are pleased to offer a 12-week, 24-lesson curriculum all about Machine Learning

Microsoft 43.4k Jan 04, 2023
A simple machine learning package to cluster keywords in higher-level groups.

Simple Keyword Clusterer A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" -- "Fronte

Andrea D'Agostino 10 Dec 18, 2022
Predicting India’s COVID-19 Third Wave with LSTM

Predicting India’s COVID-19 Third Wave with LSTM Complete project of predicting new COVID-19 cases in the next 90 days with LSTM India is seeing a ste

Samrat Dutta 4 Jan 27, 2022
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit 🎈 This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 02, 2023
Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

AZİZE SULTAN PALALI 2 Oct 23, 2021
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
MLFlow in a Dockercontainer based on Azurite and Postgres

mlflow-azurite-postgres docker This is a MLFLow image which works with a postgres DB and a local Azure Blob Storage Instance (Azurite). This image is

2 May 29, 2022
About Solve CTF offline disconnection problem - based on python3's small crawler

About Solve CTF offline disconnection problem - based on python3's small crawler, support keyword search and local map bed establishment, currently support Jianshu, xianzhi,anquanke,freebuf,seebug

天河 32 Oct 25, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
Magenta: Music and Art Generation with Machine Intelligence

Magenta is a research project exploring the role of machine learning in the process of creating art and music. Primarily this involves developing new

Magenta 18.1k Dec 30, 2022
Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc). Structured a custom ensemble model and a neural network. Found a outperformed

Chris Yuan 1 Feb 06, 2022