Python module for performing linear regression for data with measurement errors and intrinsic scatter

Overview

Linear regression for data with measurement errors and intrinsic scatter (BCES)

Python module for performing robust linear regression on (X,Y) data points where both X and Y have measurement errors.

The fitting method is the bivariate correlated errors and intrinsic scatter (BCES) and follows the description given in Akritas & Bershady. 1996, ApJ. Some of the advantages of BCES regression compared to ordinary least squares fitting (quoted from Akritas & Bershady 1996):

  • it allows for measurement errors on both variables
  • it permits the measurement errors for the two variables to be dependent
  • it permits the magnitudes of the measurement errors to depend on the measurements
  • other "symmetric" lines such as the bisector and the orthogonal regression can be constructed.

In order to understand how to perform and interpret the regression results, please read the paper.

Installation

Using pip:

pip install bces

If that does not work, you can install it using the setup.py script:

python setup.py install

You may need to run the last command with sudo.

Alternatively, if you plan to modify the source then install the package with a symlink, so that changes to the source files will be immediately available:

python setup.py develop

Usage

import bces.bces as BCES
a,b,aerr,berr,covab=BCES.bcesp(x,xerr,y,yerr,cov)

Arguments:

  • x,y : 1D data arrays
  • xerr,yerr: measurement errors affecting x and y, 1D arrays
  • cov : covariance between the measurement errors, 1D array

If you have no reason to believe that your measurement errors are correlated (which is usually the case), you can provide an array of zeroes as input for cov:

cov = numpy.zeros_like(x)

Output:

  • a,b : best-fit parameters a,b of the linear regression such that y = Ax + B.
  • aerr,berr : the standard deviations in a,b
  • covab : the covariance between a and b (e.g. for plotting confidence bands)

Each element of the arrays a, b, aerr, berr and covab correspond to the result of one of the different BCES lines: y|x, x|y, bissector and orthogonal, as detailed in the table below. Please read the original BCES paper to understand what these different lines mean.

Element Method Description
0 y|x Assumes x as the independent variable
1 x|y Assumes y as the independent variable
2 bissector Line that bisects the y|x and x|y. This approach is self-inconsistent, do not use this method, cf. Hogg, D. et al. 2010, arXiv:1008.4686.
3 orthogonal Orthogonal least squares: line that minimizes orthogonal distances. Should be used when it is not clear which variable should be treated as the independent one

By default, bcesp run in parallel with bootstrapping.

Examples

bces-example.ipynb is a jupyter notebook including a practical, step-by-step example of how to use BCES to perform regression on data with uncertainties on x and y. It also illustrates how to plot the confidence band for a fit.

If you have suggestions of more examples, feel free to add them.

Running Tests

To test your installation, run the following command inside the BCES directory:

pytest -v

Requirements

See requirements.txt.

Citation

If you end up using this code in your paper, you are morally obliged to cite the following works

I spent considerable time writing this code, making sure it is correct and user-friendly, so I would appreciate your citation of the second paper in the above list as a token of gratitude.

If you are really happy with the code, you can buy me a beer.

Misc.

This python module is inspired on the (much faster) fortran routine originally written Akritas et al. I wrote it because I wanted something more portable and easier to use, trading off speed.

For a general tutorial on how to (and how not to) perform linear regression, please read this paper: Hogg, D. et al. 2010, arXiv:1008.4686. In particular, please refrain from using the bisector method.

If you want to plot confidence bands for your fits, have a look at nmmn package (in particular, modules nmmn.plots.fitconf and stats).

Bayesian linear regression

There are a couple of Bayesian approaches to perform linear regression which can be more powerful than BCES, some of which are described below.

A Gibbs Sampler for Multivariate Linear Regression: R code, arXiv:1509.00908. Linear regression in the fairly general case with errors in X and Y, errors may be correlated, intrinsic scatter. The prior distribution of covariates is modeled by a flexible mixture of Gaussians. This is an extension of the very nice work by Brandon Kelly (Kelly, B. 2007, ApJ).

LIRA: A Bayesian approach to linear regression in astronomy: R code, arXiv:1509.05778 Bayesian hierarchical modelling of data with heteroscedastic and possibly correlated measurement errors and intrinsic scatter. The method fully accounts for time evolution. The slope, the normalization, and the intrinsic scatter of the relation can evolve with the redshift. The intrinsic distribution of the independent variable is approximated using a mixture of Gaussian distributions whose means and standard deviations depend on time. The method can address scatter in the measured independent variable (a kind of Eddington bias), selection effects in the response variable (Malmquist bias), and departure from linearity in form of a knee.

AstroML: Machine Learning and Data Mining for Astronomy. Python example of a linear fit to data with correlated errors in x and y using AstroML. In the literature, this is often referred to as total least squares or errors-in-variables fitting.

Todo

If you have improvements to the code, suggestions of examples,speeding up the code etc, feel free to submit a pull request.

  • implement weighted least squares (WLS)
  • implement unit testing: bces
  • unit testing: bootstrap

Visit the author's web page and/or follow him on twitter (@nemmen).


Copyright (c) 2021, Rodrigo Nemmen. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Owner
Rodrigo Nemmen
Professor of Astronomy & Astrophysics
Rodrigo Nemmen
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

92 Dec 14, 2022
LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading. The framework simplify development, testing, deployment, analysis and training algo trading strategies

Amichay Oren 458 Dec 24, 2022
Apple-voice-recognition - Machine Learning

Apple-voice-recognition Machine Learning How does Siri work? Siri is based on large-scale Machine Learning systems that employ many aspects of data sc

Harshith VH 1 Oct 22, 2021
fMRIprep Pipeline To Machine Learning

fMRIprep Pipeline To Machine Learning(Demo) 所有配置均在config.py文件下定义 前置环境(lilab) 各个节点均安装docker,并有fmripre的镜像 可以使用conda中的base环境(相应的第三份包之后更新) 1. fmriprep scr

Alien 3 Mar 08, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 05, 2023
Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

Arjun U 1 Jan 23, 2022
机器学习检测webshell

ai-webshell-detect 机器学习检测webshell,利用textcnn+简单二分类网络,基于keras,花了七天 检测原理: 从文件熵 文件长度 文件语句提取出特征,然后文件熵与长度送入二分类网络,文件语句送入textcnn 项目原理,介绍,怎么做出来的

Huoji's 56 Dec 14, 2022
A classification model capable of accurately predicting the price of secondhand cars

The purpose of this project is create a classification model capable of accurately predicting the price of secondhand cars. The data used for model building is open source and has been added to this

Akarsh Singh 2 Sep 13, 2022
Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. ⚡️🧑‍🔧

Deliver ML products, better & faster Giskard is an Open-Source CI/CD platform for ML teams. Inspect ML models visually from your Python notebook 📗 Re

Giskard 335 Jan 04, 2023
A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

Aayush Malik 80 Dec 12, 2022
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

27 Aug 19, 2022
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list Uber Open Source 997 Dec 30, 2022

Microsoft 5.6k Jan 07, 2023
Python ML pipeline that showcases mltrace functionality.

mltrace tutorial Date: October 2021 This tutorial builds a training and testing pipeline for a toy ML prediction problem: to predict whether a passeng

Log Labs 28 Nov 09, 2022
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

Zillow-Houses This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform. Pipeline is consists of 10

2 Jan 09, 2022