A pure Python implementation of a mixed effects random forest (MERF) algorithm

Related tags

Algorithmsmerf
Overview

Mixed Effects Random Forest

This repository contains a pure Python implementation of a mixed effects random forest (MERF) algorithm. It can be used, out of the box, to fit a MERF model and predict with it.

MERF Model

The MERF model is:

y_i = f(X_i) + Z_i * b_i + e_i

b_i ~ N(0, D)

e_i ~ N(0, R_i)

for each cluster i out of n total clusters.

In the above:

  • y_i -- the (n_i x 1) vector of responses for cluster i. These are given at at training.
  • X_i -- the (n_i x p) fixed effects covariates that are associated with the y_i. These are given at training.
  • Z_i -- the (n_i x q) random effects covariates that are associated with the y_i. These are given at training.
  • e_i -- the (n_i x 1) vector of errors for cluster i. This is unknown.
  • i is the cluster_id. This is given at training.

The learned parameters in MERF are:

  • f() -- which is a random forest that models the, potentially nonlinear, mapping from the fixed effect covariates to the response. It is common across all clusters.
  • D -- which is the covariance of the normal distribution from which each of the b_i are drawn. It is common across all clusters.
  • sigma^2 -- which is the variance of e_i, which is assumed to be white. It is common across all clusters.

Note that one key assumption of the MERF model is that the random effect is linear. Though, this is limiting in some regards, it is still broadly useful for many problems. It is better than not modelling the random effect at all.

The algorithms implemented in this repo were developed by Ahlem Hajjem, Francois Bellavance, and Denis Larocque and published in a paper here. Many thanks to Ahlem and Denis for providing an R reference and aiding in the debugging of this code. Quick note, the published paper has a small typo in the update equation for sigma^2 which is corrected in the source code here.

Using the Code

The MERF code is modelled after scikit-learn estimators. To use, you instantiate a MERF object. As of 1.0, you can pass any non-linear estimator for the fixed effect. By default this is a scikit-learn random forest, but you can pass any model you wish that conforms to the scikit-learn estimator API, e.g. LightGBM, XGBoost, a properly wrapped PyTorch neural net,

Then you fit the model using training data. As of 1.0, you can also pass a validation set to see the validation performance on it. This is meant to feel similar to PyTorch where you can view the validation loss after each epoch of training. After fitting you can predict responses from data, either from known (cluster in training set) or new (cluster not in training set) clusters.

For example:

> from merf import MERF
> merf = MERF()
> merf.fit(X_train, Z_train, clusters_train, y_train)
> y_hat = merf.predict(X_test, Z_test, clusters_test)

Alternatively:

> from lightgbm import LGBMRegressor
> lgbm = LGBMRegressor()
> mrf_lgbm = MERF(lgbm, max_iterations=15)
> mrf_lgbm.fit(X_train, Z_train, clusters_train, y_train, X_val, Z_val, clusters_val, y_val)
> y_hat = merf.predict(X_test, Z_test, clusters_test)

Note that training is slow because the underlying expectation-maximization (EM) algorithm requires many calls to the non-linear fixed effects model, e.g. random forest. That being said, this implemtataion has early stopping which aborts the EM algorithm if the generalized log-likelihood (GLL) stops significantly improving.

Tour of the Source Code

The \merf directory contains all the source code:

  • merf.py is the key module that contains the MERF class. It is imported at the package level.
  • merf_test.py contain some simple unit tests.
  • utils.py contains a class for generating synthetic data that can be used to test the accuracy of MERF. The process implemented is the same as that in this paper.
  • viz.py contains a plotting function that takes in a trained MERF object and plots various metrics of interest.

The \notebooks directory contains some useful notebooks that show you how to use the code and evaluate MERF performance. Most of the techniques implemented are the same as those in this paper.

Comments
  • Can not run examples given by notebooks

    Can not run examples given by notebooks

    Greetings,

    I can not reproduce notebooks given in the ./notebooks. For example, run code given in the Real World MERF Examples notebook thrower out two errors.

    When I tried to import the modules, the first error popped out

    No module named 'merf.evaluator'

    When I tried to call mrf.fit(X_train, Z_train, clusters_train, y_train), the error popped out

    ValueError: operands could not be broadcast together with shapes (10,) (162,)

    I also attached the full information below FYI. BTW, I installed the merf using pip install merf. It should be the latest version.

    Any suggestion is appreciated.

    ValueError Traceback (most recent call last) in () 13 clusters_train = train['Subject'] 14 y_train = train['Reaction'] ---> 15 mrf.fit(X_train, Z_train, clusters_train, y_train) 16 17 # Mixed Effects Random Forest Test

    ~/anaconda/lib/python3.6/site-packages/merf/merf.py in fit(self, X, Z, clusters, y) 142 143 # Compute y_star for this cluster and put back in right place --> 144 y_star_i = y_i - Z_i.dot(b_hat_i) 145 y_star[indices_i] = y_star_i 146

    ~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op) 713 lvalues = lvalues.values 714 --> 715 result = wrap_results(safe_na_op(lvalues, rvalues)) 716 return construct_result( 717 left,

    ~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues) 674 try: 675 with np.errstate(all='ignore'): --> 676 return na_op(lvalues, rvalues) 677 except Exception: 678 if isinstance(rvalues, ABCSeries):

    ~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y) 650 try: 651 result = expressions.evaluate(op, str_rep, x, y, --> 652 raise_on_error=True, **eval_kwargs) 653 except TypeError: 654 if isinstance(y, (np.ndarray, ABCSeries, pd.Index)):

    ~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs) 208 if use_numexpr: 209 return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error, --> 210 **eval_kwargs) 211 return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error) 212

    ~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs) 119 120 if result is None: --> 121 result = _evaluate_standard(op, op_str, a, b, raise_on_error) 122 123 return result

    ~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs) 61 _store_test_result(False) 62 with np.errstate(all='ignore'): ---> 63 return op(a, b) 64 65

    ValueError: operands could not be broadcast together with shapes (10,) (162,)

    documentation 
    opened by Yutong-Dai 5
  • TypeError: __init__() got an unexpected keyword argument 'n_estimators'

    TypeError: __init__() got an unexpected keyword argument 'n_estimators'

    When trying to set arguments like n_estimators for MERF, MERF.fit returns the above initialisation error. It works fine with default setting. This seems to suggest that MERF to sklearn random forest API is broken. Is this a version issue? I'm using: Python 3.9.5 scikit-learn==0.24.2

    opened by jessie831024 3
  • Add partial dependency plotting function

    Add partial dependency plotting function

    It would be nice if there was a function for creating partial dependency plot data or plots. This would help with translating information to consumers and clients

    enhancement 
    opened by OliverFishCode 3
  • Random forest can now take user defined parameters.

    Random forest can now take user defined parameters.

    The current implementation of MERF only allows the user to set the number of trees in the random forest. This pull request allows the user to set any parameter they want but prevents them from changing important params like having the out of bag score on.

    I hope you accept this it has been useful for me.

    opened by arose13 3
  • Why not expose all the parameters of the RandomForestRegressor?

    Why not expose all the parameters of the RandomForestRegressor?

    Also, would using the ExtraTreesRegressor also in this framework? It seems simple to just add a base_estimator parameter, allowing the user to use whatever he/she wants?

    I can submit a pull request if it makes sense to add these features.

    opened by jonathanng 3
  • Typechecks

    Typechecks

    Thank you for this wonderful package and your efforts!

    For future users, would it be possible to add type-checks and more verbose error statements to the fit func of the MERF class when the input type deviates from the expected input type? This led to a bit of reverse engineering to figure out why the underlying linear alg was failing when, for example, inputting a NUMPY array in place of an expected pandas series.

    (Also happy to contribute...when I have the time).

    Thanks!

    opened by JCSyng 3
  • Regarding Merf Plot by Cluster

    Regarding Merf Plot by Cluster

    Screen Shot 2020-03-17 at 7 39 40 PM

    I am using Python 3.6.5 and Pandas version 1.0.1. I receive the above when attempting to plot a specific cluster by its respective id.

    I noticed here in the example notebook that Panel is deprecated. I welcome feedback on the best way to address this.

    opened by ahlusar1989 2
  • Add options

    Add options

    Add ability to specify additional parameters to RandomForestRegressor, with some minor error checking and two additional tests.

    Options are passed in as dictionary argument rf_opts. n_estimators is ignored in favor of the argument to the constructor; oob_score is ignored and always set to True; warm_start is ignored (assumed to be False).

    Two reasons this is helpful for me, at least: the random forest fit is often faster (particularly on my laptop) with n_jobs set to 1. I'm also hoping to investigate the shap interpretability package, but the underlying algorithms perform extremely poorly on deep trees, so the ability to specify max_depth would make that a bit easier.

    This is a repeat of another pull request, closed by me, attempting to fix formatting issues that cause continuous integration to fail.

    opened by cunningjames 2
  • Translation from MERF random effects nomenclature to align with other implementations

    Translation from MERF random effects nomenclature to align with other implementations

    First, thank you for creating this package! It is kind of exactly what I am looking for. However, I have some questions that stem from my prior experience with mixed models on other platforms (notably the lme4 package for R), and the documentation and examples aren't helping me translate my knowledge of how random effects are specified/named in MERF.

    For example, in lme4, random effects are designated as having slopes and intercepts (which I think correspond to "clusters" and "covariates" in MERF? With 1s in the covariates matrix indicating the intercepts?).

    So if "subject" (in an experiment) or "county") like in the radon example are grouping variables over which there are multiple observations, one could specify a random intercept for subject (or county). Then, one could additionally specify a random slope for some variable by the grouping variable (such as to specify a random slope for experimental condition by subject, to allow the model to estimate the variance of how much subjects differ in their response to the experimental manipulation, or that counties could have a random slope for floor, allowing counties to differ in how much each floor impacts random levels in the model). I'm not quite sure if I'm translating these to the MERF nomenclature correctly.

    Moreover, lme4 allows the researcher to specify multiple, crossed random effects (e.g., random intercepts for both experimental subjects and experimental items, as well as random slopes for variables by both subject and item).

    Getting to the point:

    1. My looking through the examples leads me to think that the clusters argument is the column containing the IDs for which random intercepts are generated: Is this the case?
    2. I get the inclination that the Z matrix includes 1s for the random intercepts, but can include a second column for random slopes (i.e., the covariates): Is this the case? Can there be more columns?

    More generally, some comments in the notebooks or documentation about what these variables are (concretely, and what form they must/can/cannot take, and possibly relationships to how random effects are specified in other mixed modeling packages) would be very helpful.

    1. Can MERF handle crossed random effects structures?
    2. Finally, does MERF provide variable importance measures from the fit forest (analogous to those produced in sklearn, and from randomForest and party::cforest in R)? I couldn't find mention of that in the readme or the notebooks.

    Thanks!

    documentation 
    opened by dstanner 2
  • Issue with git cloning and pip installing from github

    Issue with git cloning and pip installing from github

    Thanks for an excellent package! I am very keen to try out your latest updates from the last few days (including using lighgbm, SHAP values). However, I cannot seem to pip install directly from github or clone because of this error:

    Clone failed
    				Invalid path 'data/Rossmann Store Sales | Kaggle eval.pdf'
    				unable to checkout working tree
    				warning: Clone succeeded, but checkout failed.
    				You can inspect what was checked out with 'git status'
    				and retry with 'git restore --source=HEAD :/'
    

    It seems like this error is because of the file name 'Rossmann Store Sales | Kaggle eval.pdf' in the data directory. Is it possible to rename the two pdf files in that directory?

    opened by ritviksahajpal 1
  • saving merf as a pickle

    saving merf as a pickle

    Hi, it took me a long time to train the MERF on my dataset. After training, I tried saving it as a pickle object and reload it to make prediction on test data. However, I got an error saying the pkl is a dictionary that doesn't have a predict method. Any ideas on how to get around this? I don't want to train the model every time I pass new test data. Thanks.

    AttributeError                            Traceback (most recent call last)
    
    <ipython-input-55-0bd2bd5ab064> in <module>()
    ----> 1 y_hat_known_merf = loaded_model.predict(X_known, Z_known, clusters_known)
          2 y_hat_known_merf
    
    AttributeError: 'dict' object has no attribute 'predict'
    
    
    opened by jyu-theartofml 1
  • Bump numpy from 1.18.4 to 1.22.0 in /docker

    Bump numpy from 1.18.4 to 1.22.0 in /docker

    Bumps numpy from 1.18.4 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Line # 96 in merf.merf.py might be better when modifying

    Line # 96 in merf.merf.py might be better when modifying "len(indices_i)" to "sum(indices_i)"

    Hello, Thanks for your great work on merf! When I debug merf, I found that there is one line that does not work in any case:

    
    63   def predict(self, X: np.ndarray, Z: np.ndarray, clusters: pd.Series):
               ...
              for cluster_id in self.cluster_counts.index:
                    indices_i = clusters == cluster_id
    
                   # If cluster doesn't exist in test data that's ok. Just move on.
    96           if len(indices_i) == 0:  < ------------------ might revise to: if sum(indices_i) == 0
                        continue
    
                   # If cluster does exist, apply the correction.
                    ...
    

    I marked Line # 96 and suggested changing "len()" to "sum()", otw, this if will never run in any case because indices_i is a pd.Series has the same shape with the input cluster.

    And if possible, I would like to ask another question about the random effect matrix Z.

    In the given example notebook, I noticed that when considering one variance as a random effect, the Z is not simply composed of one column but also has another column of ones. Why are the ones necessary?

    Thanks in advance! meng

    opened by CCChen-Menggg 0
  • Incompatibility with scikit-learn: MERF does not implement get_params

    Incompatibility with scikit-learn: MERF does not implement get_params

    MERF does not implement get_params. This means that scikit-learn cannot clone a MERF predictor as needed for e.g., RandomSearchCV to optimize the hyper parameter of the fixed effects model.

    opened by NegatedObjectIdentity 0
  • An issue about the random state

    An issue about the random state

    Hi Dey, Thank you so much for providing this useful python pachage. I watched your video that is very enlighting, and is curretly using the merf package to handle my data. I'm just wondering if it is possible to specify the random_state for merf for repetition. I tried to specify random_state for RandomForestRegressor, but it still returns different results every time I run the code. Below are my codes, mrf = MERF(fixed_effects_model=RandomForestRegressor(n_estimators=100, n_jobs=-1,random_state = 2),max_iterations=20)

    Do you have any idea? Thanks, Zixiao

    opened by zixiao-yin 0
  • will the MERF select best model based on validation set?

    will the MERF select best model based on validation set?

    mrf.fit(X_train, Z_train, clusters_train, y_train, X_val, Z_val, clusters_val, y_val) y_hat_known = mrf.predict(X_known, Z_known, clusters_known)

    is the mrf model the best performance model based on X_val, Z_val, clusters_val, y_val?

    BW, Yang

    opened by baiyang4 0
Releases(1.0)
  • 1.0(May 19, 2020)

    This is major release that modifies the MERF API in a backward-incompatible way. There are a number of things added to this release. The highlights:

    • Added ability to add any fixed effects learner, e.g. gradient boosting, deep learning
    • Added ability to add validation set and track validation loss during training process
    • Updated code to work with pandas 1.0+
    • Added a plotting utility to look at results of training
    • Added documentation built using Sphinx
    • Added more unit tests for pickling, typing, plotting, etc.
    • Small bug fixes
    • Moved CI over to GitHub Actions
    • Published documentation using GitHub pages
    • Moved over to Orbyter 2.0
    • Moved to Makefile for builds, CI, linting
    Source code(tar.gz)
    Source code(zip)
  • 0.3(Mar 28, 2019)

    This release contain some additions from the community:

    • Ability to pass all random forest parameters
    • Early stopping
    • Addition of CircleCI
    • Removal of some unecessary files
    Source code(tar.gz)
    Source code(zip)
Owner
Manifold
Manifold is the applied AI company. Global companies partner with Manifold through joint discovery and development programs.
Manifold
CLI Eight Puzzle mini-game featuring BFS, DFS, Greedy and A* searches as solver algorithms.

🕹 Eight Puzzle CLI Jogo do quebra-cabeças de 8 peças em linha de comando desenvolvido para a disciplina de Inteligência Artificial. Escrito em python

Lucas Nakahara 1 Jun 30, 2021
A tictactoe where you never win, implemented using minimax algorithm

Unbeatable_TicTacToe A tictactoe where you never win, implemented using minimax algorithm Requirements Make sure you have the pygame module along with

Jessica Jolly 3 Jul 28, 2022
This application solves sudoku puzzles using a backtracking recursive algorithm

This application solves sudoku puzzles using a backtracking recursive algorithm. The user interface is coded with Pygame to allow users to easily input puzzles.

Glenda T 0 May 17, 2022
Algorithmic Trading with Python

Source code for Algorithmic Trading with Python (2020) by Chris Conlan

Chris Conlan 1.3k Jan 03, 2023
Implements (high-dimenstional) clustering algorithm

Description Implements (high-dimenstional) clustering algorithm described in https://arxiv.org/pdf/1804.02624.pdf Dependencies python3 pytorch (=0.4)

Eric Elmoznino 5 Dec 27, 2022
Gnat - GNAT is NOT Algorithmic Trading

GNAT GNAT is NOT Algorithmic Trading! GNAT is a financial tool with two goals in

Sher Shah 2 Jan 09, 2022
Genius Square puzzle solver in Python

Genius Square puzzle solver in Python

James 3 Dec 15, 2022
PickMush - A mini study/project on boosting algorithm

PickMush A mini project implementing Boosting Author Shashwat Vaibhav What does it do? Classifies whether Mushroom is edible or is non-edible (binary

Shashwat Vaibahav 3 Nov 08, 2022
Algorithms and utilities for SAR sensors

WARNING: THIS CODE IS NOT READY FOR USE Sarsen Algorithms and utilities for SAR sensors Objectives Be faster and simpler than ESA SNAP and cloud nativ

B-Open 201 Dec 27, 2022
Zipline, a Pythonic Algorithmic Trading Library

Zipline, a Pythonic Algorithmic Trading Library

Stefan Jansen 463 Jan 08, 2023
Given a list of tickers, this algorithm generates a recommended portfolio for high-risk investors.

RiskyPortfolioGenerator Given a list of tickers, this algorithm generates a recommended portfolio for high-risk investors. Working in a group, we crea

Victoria Zhao 2 Jan 13, 2022
A Python Package for Portfolio Optimization using the Critical Line Algorithm

A Python Package for Portfolio Optimization using the Critical Line Algorithm

19 Oct 11, 2022
A GUI visualization of QuickSort algorithm

QQuickSort A simple GUI visualization of QuickSort algorithm. It only uses PySide6, it does not have any other external dependency. How to run Install

Jaime R. 2 Dec 24, 2021
Python based framework providing a simple and intuitive framework for algorithmic trading

Harvest is a Python based framework providing a simple and intuitive framework for algorithmic trading. Visit Harvest's website for details, tutorials

100 Jan 03, 2023
Pathfinding visualizer in pygame: A*

Pathfinding Visualizer A* What is this A* algorithm ? Simply put, it is an algorithm that aims to find the shortest possible path between two location

0 Feb 26, 2022
This repository explores an implementation of Grover's Algorithm for knights on a chessboard.

Grover Knights Welcome to my Knights project! Project Description: I explore an implementation of a quantum oracle for knights on a chessboard.

Will Sun 8 Feb 22, 2022
Our implementation of Gillespie's Stochastic Simulation Algorithm (SSA)

SSA Our implementation of Gillespie's Stochastic Simulation Algorithm (SSA) Requirements python =3.7 numpy pandas matplotlib pyyaml Command line usag

Anoop Lab 1 Jan 27, 2022
Genetic algorithm which evolves aoe2 DE ai scripts

AlphaScripter Use the power of genetic algorithms to evolve AI scripts for Age of Empires II : Definitive Edition. For now this package runs in AOC Us

6 Nov 04, 2022
BCI datasets and algorithms

Brainda Welcome! First and foremost, Welcome! Thank you for visiting the Brainda repository which was initially released at this repo and reorganized

52 Jan 04, 2023
A Python project for optimizing the 8 Queens Puzzle using the Genetic Algorithm implemented in PyGAD.

8QueensGenetic A Python project for optimizing the 8 Queens Puzzle using the Genetic Algorithm implemented in PyGAD. The project uses the Kivy cross-p

Ahmed Gad 16 Nov 13, 2022