Statistical package in Python based on Pandas

Overview

https://travis-ci.org/raphaelvallat/pingouin.svg?branch=master https://pepy.tech/badge/pingouin/month https://badges.gitter.im/owner/repo.png
https://github.com/raphaelvallat/pingouin/blob/master/docs/pictures/logo_pingouin.png

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation.

  1. ANOVAs: N-ways, repeated measures, mixed, ancova
  2. Pairwise post-hocs tests (parametric and non-parametric) and pairwise correlations
  3. Robust, partial, distance and repeated measures correlations
  4. Linear/logistic regression and mediation analysis
  5. Bayes Factors
  6. Multivariate tests
  7. Reliability and consistency
  8. Effect sizes and power analysis
  9. Parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient
  10. Circular statistics
  11. Chi-squared tests
  12. Plotting: Bland-Altman plot, Q-Q plot, paired plot, robust correlation...

Pingouin is designed for users who want simple yet exhaustive statistical functions.

For example, the ttest_ind function of SciPy returns only the T-value and the p-value. By contrast, the ttest function of Pingouin returns the T-value, the p-value, the degrees of freedom, the effect size (Cohen's d), the 95% confidence intervals of the difference in means, the statistical power and the Bayes Factor (BF10) of the test.

Documentation

Chat

If you have questions, please ask them in the public Gitter chat

https://badges.gitter.im/owner/repo.png

Installation

Dependencies

The main dependencies of Pingouin are :

In addition, some functions require :

Pingouin is a Python 3 package and is currently tested for Python 3.6 and 3.7. Pingouin does not work with Python 2.7.

User installation

Pingouin can be easily installed using pip

pip install pingouin

or conda

conda install -c conda-forge pingouin

New releases are frequent so always make sure that you have the latest version:

pip install --upgrade pingouin

Quick start

Click on the link below and navigate to the notebooks/ folder to run a collection of interactive Jupyter notebooks showing the main functionalities of Pingouin. No need to install Pingouin beforehand, the notebooks run in a Binder environment.

10 minutes to Pingouin

1. T-test

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)
Output
T dof tail p-val CI95% cohen-d BF10 power
-3.401 58 two-sided 0.001 [-1.68 -0.43] 0.878 26.155 0.917

2. Pearson's correlation

pg.corr(x, y)
Output
n r CI95% r2 adj_r2 p-val BF10 power
30 0.595 [0.3 0.79] 0.354 0.306 0.001 69.723 0.95

3. Robust correlation

# Introduce an outlier
x[5] = 18
# Use the robust Shepherd's pi correlation
pg.corr(x, y, method="shepherd")
Output
n outliers r CI95% r2 adj_r2 p-val power
30 1 0.561 [0.25 0.77] 0.315 0.264 0.002 0.917

4. Test the normality of the data

The pingouin.normality function works with lists, arrays, or pandas DataFrame in wide or long-format.

print(pg.normality(x))                                    # Univariate normality
print(pg.multivariate_normality(np.column_stack((x, y)))) # Multivariate normality
Output
W pval normal
0.615 0.000 False
(False, 0.00018)

5. One-way ANOVA using a pandas DataFrame

# Read an example dataset
df = pg.read_dataset('mixed_anova')

# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)
Output
Source SS DF MS F p-unc np2
Group 5.460 1 5.460 5.244 0.023 0.029
Within 185.343 178 1.041 nan nan nan

6. Repeated measures ANOVA

pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)
Output
Source SS DF MS F p-unc np2 eps
Time 7.628 2 3.814 3.913 0.023 0.062 0.999
Error 115.027 118 0.975 nan nan nan nan

7. Post-hoc tests corrected for multiple-comparisons

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_ttests(data=df, dv='Scores', within='Time', subject='Subject',
                             parametric=True, padjust='fdr_bh', effsize='hedges')

# Pretty printing of table
pg.print_table(posthoc, floatfmt='.3f')
Output
Contrast A B Paired Parametric T dof Tail p-unc p-corr p-adjust BF10 hedges
Time August January True True -1.740 59.000 two-sided 0.087 0.131 fdr_bh 0.582 -0.328
Time August June True True -2.743 59.000 two-sided 0.008 0.024 fdr_bh 4.232 -0.485
Time January June True True -1.024 59.000 two-sided 0.310 0.310 fdr_bh 0.232 -0.170

8. Two-way mixed ANOVA

# Compute the two-way mixed ANOVA
aov = pg.mixed_anova(data=df, dv='Scores', between='Group', within='Time',
                     subject='Subject', correction=False, effsize="np2")
pg.print_table(aov)
Output
Source SS DF1 DF2 MS F p-unc np2 eps
Group 5.460 1 58 5.460 5.052 0.028 0.080 nan
Time 7.628 2 116 3.814 4.027 0.020 0.065 0.999
Interaction 5.167 2 116 2.584 2.728 0.070 0.045 nan

9. Pairwise correlations between columns of a dataframe

import pandas as pd
np.random.seed(123)
z = np.random.normal(5, 1, 30)
data = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
pg.pairwise_corr(data, columns=['X', 'Y', 'Z'], method='pearson')
Output
X Y method tail n r CI95% r2 adj_r2 z p-unc BF10 power
X Y pearson two-sided 30 0.366 [0.01 0.64] 0.134 0.070 0.384 0.047 1.500 0.525
X Z pearson two-sided 30 0.251 [-0.12 0.56] 0.063 -0.006 0.257 0.181 0.534 0.272
Y Z pearson two-sided 30 0.020 [-0.34 0.38] 0.000 -0.074 0.020 0.916 0.228 0.051

10. Convert between effect sizes

# Convert from Cohen's d to Hedges' g
pg.convert_effsize(0.4, 'cohen', 'hedges', nx=10, ny=12)
0.384

11. Multiple linear regression

pg.linear_regression(data[['X', 'Z']], data['Y'])
Linear regression summary
names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%]
Intercept 4.650 0.841 5.530 0.000 0.139 0.076 2.925 6.376
X 0.143 0.068 2.089 0.046 0.139 0.076 0.003 0.283
Z -0.069 0.167 -0.416 0.681 0.139 0.076 -0.412 0.273

12. Mediation analysis

pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)
Mediation summary
path coef se pval CI[2.5%] CI[97.5%] sig
Z ~ X 0.103 0.075 0.181 -0.051 0.256 No
Y ~ Z 0.018 0.171 0.916 -0.332 0.369 No
Total 0.136 0.065 0.047 0.002 0.269 Yes
Direct 0.143 0.068 0.046 0.003 0.283 Yes
Indirect -0.007 0.025 0.898 -0.069 0.029 No

13. Contingency analysis

data = pg.read_dataset('chi2_independence')
expected, observed, stats = pg.chi2_independence(data, x='sex', y='target')
stats
Chi-squared tests summary
test lambda chi2 dof p cramer power
pearson 1.000 22.717 1.000 0.000 0.274 0.997
cressie-read 0.667 22.931 1.000 0.000 0.275 0.998
log-likelihood 0.000 23.557 1.000 0.000 0.279 0.998
freeman-tukey -0.500 24.220 1.000 0.000 0.283 0.998
mod-log-likelihood -1.000 25.071 1.000 0.000 0.288 0.999
neyman -2.000 27.458 1.000 0.000 0.301 0.999

Integration with Pandas

Several functions of Pingouin can be used directly as pandas DataFrame methods. Try for yourself with the code below:

import pingouin as pg

# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)

# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])

# Example 3 | Partial correlation matrix
data.pcorr()

The functions that are currently supported as pandas method are:

Development

Pingouin was created and is maintained by Raphael Vallat, mostly during his spare time. Contributions are more than welcome so feel free to contact me, open an issue or submit a pull request!

To see the code or report a bug, please visit the GitHub repository.

Note that this program is provided with NO WARRANTY OF ANY KIND. If you can, always double check the results with another statistical software.

Contributors

How to cite Pingouin?

If you want to cite Pingouin, please use the publication in JOSS:

Acknowledgement

Several functions of Pingouin were inspired from R or Matlab toolboxes, including:

Comments
  • ValueError with large dataset

    ValueError with large dataset

    I have a large dataset of 500k rows and 74 columns. Whenever I try to use a pairwise partial correlation I get the following error:

    ----> 1 data2.pairwise_corr(covar = 'SEX')
    
    ~/.local/lib/python3.7/site-packages/pandas_flavor/register.py in __call__(self, *args, **kwargs)
         27             @wraps(method)
         28             def __call__(self, *args, **kwargs):
    ---> 29                 return method(self._obj, *args, **kwargs)
         30 
         31         register_dataframe_accessor(method.__name__)(AccessorMethod)
    
    ~/.local/lib/python3.7/site-packages/pingouin/pairwise.py in pairwise_corr(data, columns, covar, tail, method, padjust, nan_policy)
       1229         else:
       1230             cor_st = partial_corr(data=data, x=col1, y=col2, covar=covar,
    -> 1231                                   tail=tail, method=method)
       1232         cor_st_keys = cor_st.columns.tolist()
       1233 
    
    ~/.local/lib/python3.7/site-packages/pingouin/correlation.py in partial_corr(data, x, y, covar, x_covar, y_covar, tail, method, **kwargs)
        783         # PARTIAL CORRELATION
        784         cvar = np.atleast_2d(C[covar].to_numpy())
    --> 785         beta_x = np.linalg.lstsq(cvar, C[x].to_numpy(), rcond=None)[0]
        786         beta_y = np.linalg.lstsq(cvar, C[y].to_numpy(), rcond=None)[0]
        787         res_x = C[x].to_numpy() - cvar @ beta_x
    
    <__array_function__ internals> in lstsq(*args, **kwargs)
    
    /curc/sw/anaconda3/2019.07/envs/jupyterlab2/lib/python3.7/site-packages/numpy/linalg/linalg.py in lstsq(a, b, rcond)
       2257         # lapack can't handle n_rhs = 0 - so allocate the array one larger in that axis
       2258         b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype)
    -> 2259     x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
       2260     if m == 0:
       2261         x[...] = 0
    
    ValueError: On entry to DLASCL parameter number 4 had an illegal value
    

    This is even with 12 cores running, is there any way to resolve this or is the pairwise correlation with a covariate just not compatible with a large dataset? pcorr() and pairwise_corr() work as methods on the dataframe.

    feature request :construction: invalid :triangular_flag_on_post: question :raising_hand: 
    opened by jackransomlovell 23
  • Unexpected results in semi-partial correlation with Spearman/Kendall

    Unexpected results in semi-partial correlation with Spearman/Kendall

    As a test, I generated correlated variables x and y. I then set x_test to x and ran a semi-partial correlation of xtest against y, controlling for the effect x on y. As one would expect, the pearson semi-partial correlation of this is zero. However, when applying Spearman and Kendall, the result is non-zero.

    The Spearman and Kendall result was 0.14 and 0.11 respectively. Although it is low using the following dummy dataset, the data I am working with produces a very high semi-partial correlation around 0.8 which does not make sense.

    df=pd.DataFrame({"y":[1,2,3,4,5,6,7,8,9,10],"x":[1,3,5,8,7,10,13,14,18,21]})
    print(df.corr())
    df['x_test']=df['x']
    partial_corr(data=df,x='x_test',y='y',y_covar=['x'], 
                 method='spearman')['r'].values
    
    Out:
    
              y         x
    y  1.000000  0.985318
    x  0.985318  1.000000
    
    array([0.13939394])
    
    invalid :triangular_flag_on_post: 
    opened by nrcjea001 20
  • Shapiro-Wilk test should have option to return W statistic

    Shapiro-Wilk test should have option to return W statistic

    Currently pingouin.normality() silently discards the W statistic returned by scipy.stats.shapiro(). I would love to have an option to retrieve these stats, as I need to present them together with the p values when drafting a scientific publication. I wonder if the function could not be adjusted to return a DataFrame with numerous information like most of the other pingouin functions?

    feature request :construction: 
    opened by hoechenberger 16
  • Add axis title to plot_shift

    Add axis title to plot_shift

    The entire library is absolutely amazing; thank you for developing it. I came up with a minor improvement for the "Shift plot". No matter what data array we provide, this graphic shows X and Y. There may be disagreement about its use in academic papers. I believed that by defining X and Y, this conflict would be addressed. I created an edit for this where the user can configure the X and Y variables as well as the plot title. The labels being on the Y axis creates problems with the size of the figure, thus I located them as a legend in the plot.

    I have ideas for other modules and would like to add them when I have the time. My initial contribution would be adding the odds ratio and CI-range values to the chi-sqr module.

    Here my test code; plotting_test.zip

    opened by turkalpmd 15
  • multiple regression: catch rank deficient design matrices and warn

    multiple regression: catch rank deficient design matrices and warn

    closes #130

    For now I opted to only return the coefficients in case a rank deficient design matrix is detected.

    The basic error of pingouin is its inability to handle the residues return parameter of lstsq whenever the design matrix is rank deficient. In that case, residues is an empty array array([], dtype=float64) instead of a float.

    One option apart from warning and simply returning the coefficients only could be to "revive" the currently commented line from line 413(old) / 422(new): # ss_res = (resid ** 2).sum()

    Something like:

        # FIT (WEIGHTED) LEAST SQUARES REGRESSION USING SCIPY.LINALG.LSTST
        coef, ss_res, rank, _ = lstsq(Xw, yw)
        if coef_only:
            return coef
        calc_ss_res = False
        if rank < Xw.shape[1]:
            warnings.warn('Design matrix supplied with `X` parameter is rank '
                          f'deficient (rank {rank} with {Xw.shape[1]} columns. '
                          'That means that one or more of the columns in `X` '
                          'are a linear combination of one of more of the '
                          'other columns.'')
            calc_ss_res = True
    
        # Degrees of freedom
        df_model = rank - constant
        df_resid = n - p
        # Calculate predicted values and (weighted) residuals
        pred = Xw @ coef
        resid = yw - pred
        if calc_ss_res:
            ss_res = (resid ** 2).sum()
    

    I played around with that, and it seems to be producing the same results as statsmodels. Try for yourself if you'd like:

    import numpy as np 
    import pingouin 
    import statsmodels.api as sm 
    
    # make up some random data and a design matrix for multiple linear regression 
    n = 100  
    y = np.random.randn(n) 
    X = np.vstack([np.random.permutation([0, 1, 0, 0, 0]) for i in range(n)]) 
    
    # try with pingouin (use patch from code example above)
    results_pingouin = pingouin.linear_regression(X, y, add_intercept=True) 
    print(results_pingouin)
    
    # try with statsmodels
    X_with_intercept = sm.add_constant(X) 
    model = sm.OLS(endog=y, exog=X_with_intercept) 
    results_sm = model.fit() 
    print(results_sm.summary())
    
    # compare residuals
    np.allclose(results_pingouin.residuals_, results_sm.resid)  # True
    np.allclose(results_pingouin.coef.to_numpy(), results_sm.params)  # True
    

    Let me know what you think :-)

    feature request :construction: 
    opened by sappelhoff 14
  • pairwise_corr doc updated and extended functionality to automatically control for all remaining covariates

    pairwise_corr doc updated and extended functionality to automatically control for all remaining covariates

    I updated doc of pairwise_corr to reflect more clearly that a partial corr is only calculated, if covariates are specified.

    Furthermore I implemented that combinations of columns, which are also present in covars, will not be dropped. Instead the x and y columns will be removed from covar only for this specific combination. This allows for automatically controlling for all other columns in the dataframe while calculating the pairwise partial correlation for all combinations.

    If this means that there are not covars for a combination, an empty list will be returned for stats['covar'] and via partial_corr line 744 a standard non-partial corr will be returned. Is this clear enough for the user or should "empty" covar rows be dropped instead?

    Closes #123

    Some basic examples of the funcitonality (tests are included in the respective test file)

    import numpy as np
    import pandas as pd
    
    import pingouin as pg
    
    df = pd.DataFrame(
        data=np.random.rand(200, 4), columns=['a', 'b', 'c', 'd'])
    df['e'] = df['a']**2
    df['f'] = df['a'] * df['b']
    df['g'] = df['c'] + np.random.randint(0, 5, 200)
    df['h'] = np.sqrt(df['d'] + 5) - df['b']**3
    
    # Get pairwise corr for all available covars per combination, also try nan policy
    pwc = pg.pairwise_corr(data=df, covar=df.columns)
    pwc_lwnan = pg.pairwise_corr(data=df, covar=df.columns, nan_policy='listwise')
    # assert that both are equal
    assert (
        pwc.drop(columns=['CI95%']) == pwc_lwnan.drop(columns=['CI95%'])
    ).all().all()
    
    # Now the same with a MultiIndex:
    mi = pd.MultiIndex.from_product((
        ('a', 'b', 'c', 'd'), ('u', 'v'),
    ))
    df_mi = df.copy()
    df_mi.columns = mi
    
    pwc_mi = pg.pairwise_corr(data=df_mi, covar=df_mi.columns)
    # Assert that the results are equal to non-multiindex tests
    assert (
        pwc_mi.drop(
            columns=['X', 'Y', 'covar', 'CI95%']) == pwc.drop(
                columns=['X', 'Y', 'covar', 'CI95%'])
    ).all().all()
    # and compare to nan policy listwise
    pwc_mi_lwnan = pg.pairwise_corr(
        data=df_mi, covar=df_mi.columns, nan_policy='listwise'
    )
    assert (
        pwc_mi.drop(
            columns=['X', 'Y', 'covar', 'CI95%']) == pwc.drop(
                columns=['X', 'Y', 'covar', 'CI95%'])
    ).all().all()
    

    And to show what I mean with "empty covars" (taken from test_pairwise.py):

    from pingouin import pairwise_corr, read_dataset
    
    data = read_dataset('pairwise_corr').iloc[:, 1:]
    n = data.shape[0]
    data['Age'] = np.random.randint(18, 65, n)
    data['IQ'] = np.random.normal(105, 1, n)
    data['One'] = 1
    data['Gender'] = np.repeat(['M', 'F'], int(n / 2))
    
    # see row 4 for empty covar
    pwc_empty = pairwise_corr(data, columns='Neuroticism', covar='Age')
    print(pwc_empty.covar)
    # or fully empty covar
    pwc_fully_empty = pairwise_corr(data, columns=['Neuroticism', 'Age'], covar='Age')
    
    feature request :construction: docs/testing:book: 
    opened by JoElfner 13
  • Assertion failure (-1 <= r <= 1) in power.py when running rm_corr

    Assertion failure (-1 <= r <= 1) in power.py when running rm_corr

    Hi there, Thanks for providing this awesome library. I picked it up to run an rm_corr test on some data but I'm having issues.

    I have 1000 individuals for each of whom I have 8 paired measurements. I have uploaded a pickled dataframe that causes the issue to GDrive (https://drive.google.com/open?id=1gkvGKBth2J8EhmhblsWRdK-eCDlk8zJk).

    Here's a snapshot of what the dataframe looks like: image

    You can reproduce the issue by running:

    df = pd.read_pickle('bad_df.pkl')
    rm_corr(data=df, x='m2', y='m1', subject='id')
    

    which yields the error

    AssertionError                            Traceback (most recent call last)
    <ipython-input-156-86b16d57ef4b> in <module>
    ----> 1 rm_corr(data=df, x='m2', y='m1', subject='id')
    
    .../lib/python3.7/site-packages/pingouin/correlation.py in rm_corr(data, x, y, subject, tail)
        981     pval = pval * 0.5 if tail == 'one-sided' else pval
        982     ci = compute_esci(stat=rm, nx=n, eftype='pearson').tolist()
    --> 983     pwr = power_corr(r=rm, n=n, tail=tail)
        984     # Convert to Dataframe
        985     stats = pd.DataFrame({"r": round(rm, 3), "dof": int(dof),
    
    .../lib/python3.7/site-packages/pingouin/power.py in power_corr(r, n, power, alpha, tail)
        902     # Safety checks
        903     if r is not None:
    --> 904         assert -1 <= r <= 1
        905         r = abs(r)
        906     if alpha is not None:
    
    

    If I drop into a debugger I can see that r = 1.0000184067367273, perhaps this is due to numerical stability issues?

    feature request :construction: invalid :triangular_flag_on_post: 
    opened by willprice 13
  • Partial correlations out of range when variance = 0

    Partial correlations out of range when variance = 0

    The following reproducer:

    from pandas import DataFrame
    from random import seed, sample
    
    import pingouin as pg
    
    seed(0, 1)
    sample_size = 100
    random_columns = 60
    
    n = [0.0000001 for _ in range(sample_size)]
    df = DataFrame(dict(
        n1b=n, n2b=n, n3b=n,
        **{
            f'x{i}': sample(range(2000), sample_size)
            for i in range(random_columns)
        }
    ))
    
    pg.pcorr(df).max().max()
    

    gives a value of 6.754645467173336 with pingouin 0.4.0. I encountered this behaviour (giving out of range coefficients) in the wild when I accidentally forgot to remove variables with no variance.

    To ensure that incorrect results like that don't get published, I would:

    • add a check on the data verifying that var() != 0 prior to running computations; not sure if this should raise or warn
    • add a check on the results; if any of the coefficients are out of range it would warn and provide name of the column those come from and a helpful message like "your data may have close-to-zero variance variables or be otherwise ill-conditioned".

    It might be also worth investigating if there is a way to workaround these numerical issues, but the two points above should address the most urgent worries.

    invalid :triangular_flag_on_post: URGENT :warning: 
    opened by krassowski 12
  • Skipped correlation gives different results than the Pernet implementation in Matlab

    Skipped correlation gives different results than the Pernet implementation in Matlab

    I think this line in correlation.py:

    dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)
    

    is functionally different compared to the Matlab implementation and the equation in Wilcox, R. (2004)

    invalid :triangular_flag_on_post: 
    opened by adamnarai 11
  • adding option to return stats as float, rather than str

    adding option to return stats as float, rather than str

    This PR is a work in progress.

    I'm not sure whether there is a general guideline about returning statistics as formatted strings, rather than as float, but I find it very useful to use the raw float values, rather than get strings. This PR adds this functionality, at least to the bayesian module (not sure what the situation is in other modules).

    This PR leaves the default behaviour of returning strings untouched. It could be argued that returning the raw float values should be default, and formatting left to the user in general (or some utility function perhaps). I'm interested to hear thoughts on this as well.

    If you think in principle this PR is a good idea, I'll update and add proper documentation etc.

    feature request :construction: 
    opened by Spaak 11
  • TestParametric::test_pandas fails with dtype mismatch on 32-bit platforms

    TestParametric::test_pandas fails with dtype mismatch on 32-bit platforms

    I’m helping to update the python-pingouin package in Fedora Linux. One test, TestParametric::test_pandas, is failing on 32-bit platforms (i686 and armv7hl). I can’t easily tell if the actual problem here is in pingouin, pandas, or somewhere else (scipy, pandas-flavor, …).

    =================================== FAILURES ===================================
    __________________________ TestParametric.test_pandas __________________________
    self = <pingouin.tests.test_pandas.TestParametric testMethod=test_pandas>
        def test_pandas(self):
            """Test pandas method.
            """
            # Test the ANOVA (Pandas)
            aov = df.anova(dv='Scores', between='Group', detailed=True)
            assert aov.equals(pg.anova(dv='Scores', between='Group', detailed=True,
                                       data=df))
            aov3_ss1 = df_aov3.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                     ss_type=1)
            aov3_ss2 = df_aov3.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                     ss_type=2)
            aov3_ss2_pg = pg.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                   data=df_aov3, ss_type=2)
            assert not aov3_ss1.equals(aov3_ss2)
            assert aov3_ss2.round(3).equals(aov3_ss2_pg.round(3))
        
            # Test the Welch ANOVA (Pandas)
            aov = df.welch_anova(dv='Scores', between='Group')
            assert aov.equals(pg.welch_anova(dv='Scores', between='Group',
                                             data=df))
        
            # Test the ANCOVA
            aov = df_anc.ancova(dv='Scores', covar='Income',
                                between='Method').round(3)
            assert aov.equals(pg.ancova(data=df_anc, dv='Scores', covar='Income',
                              between='Method').round(3))
        
            # Test the repeated measures ANOVA (Pandas)
            aov = df.rm_anova(dv='Scores', within='Time', subject='Subject',
                              detailed=True)
            assert aov.equals(pg.rm_anova(dv='Scores', within='Time',
                                          subject='Subject',
                                          detailed=True, data=df))
        
            # FDR-corrected post hocs with Hedges'g effect size
            ttests = df.pairwise_ttests(dv='Scores', within='Time',
                                        subject='Subject', padjust='fdr_bh',
                                        effsize='hedges')
            assert ttests.equals(pg.pairwise_ttests(dv='Scores', within='Time',
                                                    subject='Subject',
                                                    padjust='fdr_bh',
                                                    effsize='hedges', data=df))
        
            # Pairwise Tukey
            tukey = df.pairwise_tukey(dv='Scores', between='Group')
            assert tukey.equals(pg.pairwise_tukey(data=df, dv='Scores',
                                                  between='Group'))
        
            # Test two-way mixed ANOVA
            aov = df.mixed_anova(dv='Scores', between='Group', within='Time',
                                 subject='Subject', correction=False)
            assert aov.equals(pg.mixed_anova(dv='Scores', between='Group',
                                             within='Time',
                                             subject='Subject', correction=False,
                                             data=df))
        
            # Test parwise correlations
            corrs = data.pairwise_corr(columns=['X', 'M', 'Y'], method='spearman')
            corrs2 = pg.pairwise_corr(data=data, columns=['X', 'M', 'Y'],
                                      method='spearman')
            assert corrs['r'].equals(corrs2['r'])
        
            # Test partial correlation
            corrs = data.partial_corr(x='X', y='Y', covar='M', method='spearman')
            corrs2 = pg.partial_corr(x='X', y='Y', covar='M', method='spearman',
                                     data=data)
            assert corrs['r'].equals(corrs2['r'])
        
            # Test partial correlation matrix (compare with the ppcor package)
            corrs = data.iloc[:, :5].pcorr().round(3)
            np.testing.assert_array_equal(corrs.iloc[0, :].to_numpy(),
                                          [1, 0.392, 0.06, -0.014, -0.149])
            # Now compare against Pingouin's own partial_corr function
            corrs = data[['X', 'Y', 'M']].pcorr()
            corrs2 = data.partial_corr(x='X', y='Y', covar='M')
            assert np.isclose(corrs.at['X', 'Y'], corrs2.at['pearson', 'r'])
        
            # Test rcorr (correlation matrix with p-values)
            # We compare against Pingouin pairwise_corr function
            corrs = df_corr.rcorr(padjust='holm', decimals=4)
            corrs2 = df_corr.pairwise_corr(padjust='holm').round(4)
            assert corrs.at['Neuroticism', 'Agreeableness'] == '*'
            assert (corrs.at['Agreeableness', 'Neuroticism'] ==
                    str(corrs2.at[2, 'r']))
            corrs = df_corr.rcorr(padjust='holm', stars=False, decimals=4)
            assert (corrs.at['Neuroticism', 'Agreeableness'] ==
                    str(corrs2.at[2, 'p-corr'].round(4)))
            corrs = df_corr.rcorr(upper='n', decimals=5)
            corrs2 = df_corr.pairwise_corr().round(5)
            assert corrs.at['Extraversion', 'Openness'] == corrs2.at[4, 'n']
            assert corrs.at['Openness', 'Extraversion'] == str(corrs2.at[4, 'r'])
            # Method = spearman does not work with Python 3.5 on Travis?
            # Instead it seems to return the Pearson correlation!
    >       df_corr.rcorr(method='spearman')
    aov        =         Source        SS  DF1  DF2  ...         F     p-unc       np2       eps
    0        Group  5.459963    1   58  ......064929  0.998751
    2  Interaction  5.167192    2  116  ...  2.727996  0.069545  0.044922       NaN
    [3 rows x 9 columns]
    aov3_ss1   =        Source         SS    DF        MS         F     p-unc       np2
    0         Sex   3.362769   1.0  3.362769  3.461...62   2.0  0.561181  0.577636  0.564168  0.018007
    3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
    aov3_ss2   =        Source         SS    DF        MS         F     p-unc       np2
    0         Sex   3.419789   1.0  3.419789  3.520...62   2.0  0.561181  0.577636  0.564168  0.018007
    3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
    aov3_ss2_pg =        Source         SS    DF        MS         F     p-unc       np2
    0         Sex   3.419789   1.0  3.419789  3.520...62   2.0  0.561181  0.577636  0.564168  0.018007
    3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
    corrs      =                   Neuroticism Extraversion  ... Agreeableness Conscientiousness
    Neuroticism                 -         ...              500
    Conscientiousness    -0.36801      0.06459  ...       0.15867                 -
    [5 rows x 5 columns]
    corrs2     =                X                  Y   method  ...    p-unc       BF10    power
    0    Neuroticism       Extraversion  pe...  0.059  0.06035
    9  Agreeableness  Conscientiousness  pearson  ...  0.00037     31.243  0.94638
    [10 rows x 10 columns]
    self       = <pingouin.tests.test_pandas.TestParametric testMethod=test_pandas>
    ttests     =   Contrast        A        B  Paired  ...    p-corr  p-adjust   BF10    hedges
    0     Time   August  January    True  ....  4.232 -0.482547
    2     Time  January     June    True  ...  0.310194    fdr_bh  0.232 -0.169520
    [3 rows x 13 columns]
    tukey      =          A           B   mean(A)  ...         T   p-tukey    hedges
    0  Control  Meditation  5.567851  ... -2.289903  0.023202 -0.339918
    [1 rows x 9 columns]
    pingouin/tests/test_pandas.py:114: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    /usr/lib/python3.10/site-packages/pandas_flavor/register.py:29: in __call__
        return method(self._obj, *args, **kwargs)
            args       = ()
            kwargs     = {'method': 'spearman'}
            method     = <function rcorr at 0xbb3880b8>
            self       = <pandas_flavor.register.register_dataframe_method.<locals>.inner.<locals>.AccessorMethod object at 0xa8881f10>
    pingouin/correlation.py:1050: in rcorr
        mat = self.corr(method=method).round(decimals)
            decimals   = 3
            ffp        = <function format_float_positional at 0xc3ab3028>
            method     = 'spearman'
            padjust    = None
            pearsonr   = <function pearsonr at 0xbbb59e38>
            pval_stars = {0.001: '***', 0.01: '**', 0.05: '*'}
            self       =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
    0        2.47917       4.20833   3.93750   ...3            3.39583
    499      2.54167       3.56250   3.14583        3.45833            2.89583
    [500 rows x 5 columns]
            spearmanr  = <function spearmanr at 0xbbb5f388>
            stars      = True
            tif        = <function triu_indices_from at 0xc39efcd0>
            upper      = 'pval'
    /usr/lib/python3.10/site-packages/pandas/core/frame.py:9376: in corr
        correl = libalgos.nancorr_spearman(mat, minp=min_periods)
            cols       = Index(['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
           'Conscientiousness'],
          dtype='object')
            idx        = Index(['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
           'Conscientiousness'],
          dtype='object')
            mat        = array([[2.47917, 4.20833, 3.9375 , 3.95833, 3.45833],
           [2.60417, 3.1875 , 3.95833, 3.39583, 3.22917],
           [2.... 3.     ],
           [2.08333, 3.66667, 3.58333, 3.45833, 3.39583],
           [2.54167, 3.5625 , 3.14583, 3.45833, 2.89583]])
            method     = 'spearman'
            min_periods = 1
            numeric_df =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
    0        2.47917       4.20833   3.93750   ...3            3.39583
    499      2.54167       3.56250   3.14583        3.45833            2.89583
    [500 rows x 5 columns]
            self       =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
    0        2.47917       4.20833   3.93750   ...3            3.39583
    499      2.54167       3.56250   3.14583        3.45833            2.89583
    [500 rows x 5 columns]
    pandas/_libs/algos.pyx:415: in pandas._libs.algos.nancorr_spearman
        ???
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    >   ???
    E   ValueError: Buffer dtype mismatch, expected 'const intp_t' but got 'long long'
    pandas/_libs/algos.pyx:938: ValueError
    

    On 32-bit Linux platforms, long long is normally 64-bit but intp_t is 32-bit, which explains what is happening but not why it is happening.

    I can easily do any tests that are needed on these platforms to try to figure out where the root issue is, but some ideas regarding what to look for would be very helpful.

    docs/testing:book: 
    opened by musicinmybrain 9
  • description in pg.pairwise_tests for non-parametrics

    description in pg.pairwise_tests for non-parametrics

    The pairwise_tests module has a return_desc option that is suitable for parametric data. However, if non-parametric data is being used, medians and IQRs can be returned by setting the parametric parameter to False. I am willing to implement this feature in the relevant file at the provided GitHub link if it is desired. code line:

    feature request :construction: 
    opened by turkalpmd 2
  • Reporting Extra GG Corrected Values in rm_anova

    Reporting Extra GG Corrected Values in rm_anova

    I believe it would be a great addition to the package! Generally, when reporting GG corrected p-values, the corrected F value and ddofs are also reported. It is also a relatively easy addition. (...) if correction: (...)
    corr_fval = f(corr_ddof1, corr_ddof2).ppf(1-p_corr)

    then during the formation of dataframe: if not detailed: (....) aov["F"] = corr_fval aov["DF"] = [corr_ddof1,corr_ddof2] else: (...) aov["DF"] = [corr_ddof1,corr_ddof2] aov["F"] = [corr_fval,np.nan] (...)

    feature request :construction: 
    opened by Esmerald1no 1
  • Add Z tests (proportions & means)

    Add Z tests (proportions & means)

    Hi,

    1. Intro

    As mentioned in discussion #296, it would be convenient to have support for both proportion and mean z tests within pingouin.

    statsmodels already provides several methods around those tests but in a confused and dispersed way, so you often end up writing some wrapper. Moreover, those methods sometimes have strange signatures, making it not really straightforward to use. A single and richer pandas output would be far more helpful. We are talking about basics here: in my opinion it would strengthen pingouin's position as a user-friendly but powerful and complete statistical package.

    2. Ressources

    • statsmodels

      • Proportions
        • z_stat and p_value: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportions_ztest.html
        • diff CI: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.confint_proportions_2indep.html
        • power: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.power_proportions_2indep.html
        • effect size: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_effectsize.html
      • Means
        • z_stat and p_value: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html?highlight=ztest#statsmodels.stats.weightstats.ztest OR https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.ztest_ind.html#statsmodels.stats.weightstats.CompareMeans.ztest_ind
        • diff CI: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.zconfint.html?highlight=ztest OR https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.zconfint_diff.html#statsmodels.stats.weightstats.CompareMeans.zconfint_diff
    • R

      • Means
        • Paired Z test: https://rpubs.com/nguyenminhsang/paired_z-test

    3. Feature

    • proportions_ztest()

      • parameters

        • x1: 2-column array_like, 1st column: number of trials, 2nd column: number of successes
        • x2: same or proportion value
        • alternative
        • paired
        • r
        • confidence
        • method: method for computing confidence interval, ‘newcomb’ (default), ‘wald’, ‘agresti-caffo’, ‘score’
      • returns

        • Z
        • alternative
        • p_val
        • CI95%: diff in proportion
        • cohen-d
        • BF10
        • power
    • means_ztest()

      • parameters
        • x1: array_like
        • x2: same or mean value
        • alternative
        • paired
        • r
        • confidence
      • returns
        • Z
        • alternative
        • p_val
        • CI95%: diff in means
        • cohen-d
        • BF10
        • power

    Thanks! Aurélien

    feature request :construction: 
    opened by aurel-p 1
  • Roadmap for release 0.6.0

    Roadmap for release 0.6.0

    The following issues should be addressed:

    • [x] #308 ❗
    • [ ] #153
    • [ ] #253
    • [ ] #208
    • [ ] #225
    • [ ] #118

    The following PR should be merged:

    • [ ] #226
    • [x] #291

    The following dependencies should be updated:

    • [x] Scikit-learn <1.1.2 (revert temporary fix in #278)

    The following deprecation should be made:

    help wanted :bell: IMPORTANT❗ 
    opened by raphaelvallat 3
  • Support for Bayesian Credible Intervals

    Support for Bayesian Credible Intervals

    Hello, is there an implementation for Bayesian credible intervals as there is for frequentist confidence intervals?

    https://en.wikipedia.org/wiki/Credible_interval

    Helpful Literature:

    1. https://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/
    feature request :construction: 
    opened by filipmarkoski 1
Releases(v0.5.3)
  • v0.5.3(Dec 29, 2022)

    This is a minor release with a few bugfixes, several improvements and one new function/pandas.DataFrame method. Read the changelog at https://pingouin-stats.org/changelog.html

    What's Changed

    • Fix numerical stability issue in multivariate_normality by @gkanwar in https://github.com/raphaelvallat/pingouin/pull/292
    • Add new function for pairwise T-tests between columns of a dataframe (pingouin.ptests) by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/291
    • Handle single-sample comparsion in pairwise_test by @George3d6 in https://github.com/raphaelvallat/pingouin/pull/299
    • Change TestRegression class test methods to fix victim flakiness by @blazyy in https://github.com/raphaelvallat/pingouin/pull/303
    • Add aesthetic flexibility to plot_rm_corr by @remrama in https://github.com/raphaelvallat/pingouin/pull/312
    • Update distribution.py by @ALL-SPACE-Rob in https://github.com/raphaelvallat/pingouin/pull/310
    • Plotting seaborn.FacetGrid compatibility by @remrama in https://github.com/raphaelvallat/pingouin/pull/314
    • Use scikit-learn>=1.1.2 by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/300
    • Plot shift documentation PR by @turkalpmd in https://github.com/raphaelvallat/pingouin/pull/320
    • Fix pandas warning by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/323
    • Deal with small sample size in pingouin.normality when using long-format by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/324
    • Renamed 'r' with 'pointbiserialr' in convert_effsize by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/325
    • Exact calculation of effect sizes in pairwise_tukey and pairwise_gameshowell by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/328

    New Contributors

    • @gkanwar made their first contribution in https://github.com/raphaelvallat/pingouin/pull/292
    • @George3d6 made their first contribution in https://github.com/raphaelvallat/pingouin/pull/299
    • @blazyy made their first contribution in https://github.com/raphaelvallat/pingouin/pull/303
    • @remrama made their first contribution in https://github.com/raphaelvallat/pingouin/pull/312
    • @ALL-SPACE-Rob made their first contribution in https://github.com/raphaelvallat/pingouin/pull/310
    • @turkalpmd made their first contribution in https://github.com/raphaelvallat/pingouin/pull/320
    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 24, 2022)

    Bugfixes

    a. The eta-squared (n2) effect size was not properly calculated in one-way and two-way repeated measures ANOVAs. Specifically, Pingouin followed the same behavior as JASP, i.e. the eta-squared was the same as the partial eta-squared. However, as explained in #251, this behavior is not valid. In one-way ANOVA design, the eta-squared should be equal to the generalized eta-squared. As of March 2022, this bug is also present in JASP. We have therefore updated the unit tests to use JAMOVI instead.

    Please double check any effect sizes previously obtained with the pingouin.rm_anova function!

    b. Fixed invalid resampling behavior for bivariate functions in pingouin.compute_bootci when x and y were not paired. #281 c. Fixed bug where confidence (previously ci) was ignored when calculating the bootstrapped confidence intervals in pingouin.plot_shift. #282

    Enhancements

    a. The pingouin.pairwise_ttests has been renamed to pingouin.pairwise_tests. Non-parametric tests are also supported in this function with the parametric=False argument, and thus the name "ttests" was misleading #209 b. Allow pingouin.bayesfactor_binom to take Beta alternative model. #252 c. Allow keyword arguments for logistic regression in pingouin.mediation_analysis. #245 d. Speed improvements for the Holm and FDR correction in pingouin.multicomp. #271 e. Speed improvements univariate functions in pingouin.compute_bootci (e.g. func="mean" is now vectorized). f. Rename eta to eta_squared in pingouin.power_anova andpingouin.power_rm_anova to avoid any confusion. #280 g. Add support for DataMatrix objects. #286 h. Use black for code formatting.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Feb 20, 2022)

    Pingouin 0.5.1

    This is a minor release, with several bugfixes and improvements. This release is compatible with SciPy 1.8 and Pandas 1.4.

    Bugfixes

    • Added support for SciPy 1.8 and Pandas 1.4. https://github.com/raphaelvallat/pingouin/pull/234
    • Fixed bug where pingouin.rm_anova() and pingouin.mixed_anova() changed the dtypes of categorical columns in-place https://github.com/raphaelvallat/pingouin/issues/224

    Enhancements

    • Faster implementation of pingouin.gzscore(), adding all options available in zscore: axis, ddof and nan_policy. Warning: this function is deprecated and will be removed in the next version of Pingouin (use scipy.stats.gzscore() instead). https://github.com/raphaelvallat/pingouin/pull/210.
    • Replace use of statsmodels’ studentized range distribution functions with more SciPy’s more accurate scipy.stats.studentized_range(). https://github.com/raphaelvallat/pingouin/pull/229.
    • Add support for optional keywords argument in the pingouin.homoscedasticity() function https://github.com/raphaelvallat/pingouin/issues/218
    • Add support for the Jarque-Bera test in pingouin.normality() https://github.com/raphaelvallat/pingouin/issues/216.

    Lastly, we have also deprecated the Gitter forum in favor of GitHub Discussions. Please use Discussions to ask questions, share ideas / tips and engage with the Pingouin community!

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Oct 28, 2021)

    This is a major release with several important bugfixes. We recommend all users to upgrade to this new version.

    See the full changelog at: https://pingouin-stats.org/changelog.html#v0-5-0-october-2021

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Aug 13, 2021)

    This is a major release with an important upgrade of the dependencies (requires Python 3.7+ and SciPy 1.7+), several enhancements in existing function and a new function to test the equality of covariance matrices (pingouin.box_m). We recommend all users to upgrade to the latest version of Pingouin.

    See the full changelog at: https://pingouin-stats.org/changelog.html#v0-4-0-august-2021

    Source code(tar.gz)
    Source code(zip)
  • v0.3.12(May 27, 2021)

    This release fixes a critical error in pingouin.partial_corr: the number of covariates was not taken into account when calculating the degrees of freedom of the partial correlation, thus leading to incorrect results (except for the correlation coefficient which remained unaffected). For more details, please see https://github.com/raphaelvallat/pingouin/issues/171.

    For the full changelog, please see https://pingouin-stats.org/changelog.html

    Source code(tar.gz)
    Source code(zip)
  • v0.3.11(Apr 14, 2021)

  • v0.3.10(Feb 16, 2021)

    This release fixes an error in the calculation of the p-values in the pg.pairwise_tukey() and pg.pairwise_gameshowell() functions (https://github.com/raphaelvallat/pingouin/pull/156). Old versions of Pingouin used an incorrect algorithm for the studentized range approximation, which resulted in (slightly) incorrect p-values. In most cases, the error did not seem to affect the significance of the p-values. The new version of Pingouin uses statsmodels to estimate the p-values.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.9(Jan 19, 2021)

  • v0.3.8(Sep 7, 2020)

    • Important bugfix in pingouin.ttest() in which the 95% confidence intervals for one-sample T-test with y != 0 were invalid.
    • Added an "options" module to control global rounding/display behavior.
    • Several enhancements / new features in existing functions.

    See full changelog at: https://pingouin-stats.org/changelog.html

    Source code(tar.gz)
    Source code(zip)
  • v0.3.7(Jul 29, 2020)

  • v0.3.6(Jul 2, 2020)

  • v0.3.5(Jun 14, 2020)

  • v0.3.4(May 7, 2020)

  • v0.3.3(Feb 5, 2020)

    Minor release:

    Bugfixes

    • Fixed a bug in pingouin.pairwise_corr caused by the deprecation of pandas.core.index in the new version of Pandas (1.0). For now, both Pandas 0.25 and Pandas 1.0 are supported.
    • The standard deviation in pingouin.pairwise_ttests when using return_desc=True is now calculated with np.nanstd(ddof=1) to be consistent with Pingouin/Pandas default unbiased standard deviation.

    New functions

    • Added the pingouin.plot_circmean function to plot the circular mean and circular vector length of a set of angles (in radians) on the unit circle. Note that this function is still in beta and some parameters may change without warnings in the next releases.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jan 18, 2020)

    Hotfix release to fix a critical issue with pingouin.pairwise_ttests() (see below). We strongly recommend that you update to the newest version of Pingouin and double-check your previous results if you’ve ever used the pairwise T-tests with more than one factor (e.g. mixed, factorial or 2-way repeated measures design).

    Bugfixes

    • MAJOR: Fixed a bug in pingouin.pairwise_ttests() when using mixed or two-way repeated measures design. Specifically, the T-tests were performed without averaging over repeated measurements first (i.e. without calculating the marginal means). Note that for mixed design, this only impacts the between-subject T-test(s). Practically speaking, this led to higher degrees of freedom (because they were conflated with the number of repeated measurements) and ultimately incorrect T and p-values because the assumption of independence was violated. Pingouin now averages over repeated measurements in mixed and two-way repeated measures design, which is the same behavior as JASP or JAMOVI. As a consequence, and when the data has only two groups, the between-subject p-value of the pairwise T-test should be (almost) equal to the p-value of the same factor in the pingouin.mixed_anova() function. The old behavior of Pingouin can still be obtained using the marginal=False argument.

    • Minor: Added a check in pingouin.mixed_anova() to ensure that the subject variable has a unique set of values for each between-subject group defined in the between variable. For instance, the subject IDs for group1 are [1, 2, 3, 4, 5] and for group2 [6, 7, 8, 9, 10]. The function will throw an error if there are one or more overlapping subject IDs between groups (e.g. the subject IDs for group1 AND group2 are both [1, 2, 3, 4, 5]).

    • Minor: Fixed a bug which caused the pingouin.plot_rm_corr() and pingouin.ancova() (with >1 covariates) to throw an error if any of the input variables started with a number (because of statsmodels / Patsy formula formatting).

    Enhancements

    • Upon loading, Pingouin will now use the outdated package to check and warn the user if a newer stable version is available.

    • Globally removed the export_filename parameter, which allowed to export the output table to a .csv file. This helps simplify the API and testing. As an alternative, one can simply use pandas.to_csv() to export the output dataframe generated by Pingouin.

    • Added the correction argument to pingouin.pairwise_ttests() to enable or disable Welch’s correction for independent T-tests.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Dec 3, 2019)

    Minor release with some bugfixes

    • Fixed a bug in which missing values were removed from all columns in the dataframe in pingouin.kruskal(), even columns that were unrelated. See https://github.com/raphaelvallat/pingouin/issues/74.

    • The pingouin.power_corr() function now throws a warning and return a np.nan when the sample size is too low (and not an error like in previous version). This is to improve compatibility with the pingouin.pairwise_corr() function.

    • Fixed quantile direction in the pingouin.plot_shift() function. In v0.3.0, the quantile subplot was incorrectly labelled as Y - X, but it was in fact calculating X - Y. See https://github.com/raphaelvallat/pingouin/issues/73

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Nov 14, 2019)

    New functions

    Enhancements

    • Added the relimp argument to pingouin.linear_regression() to return the relative importance (= contribution) of each individual predictor to the R^2 of the full model.
    • Complete refactoring of pingouin.intraclass_corr() to closely match the R implementation in the psych package. Pingouin now returns the 6 types of ICC, together with F values, p-values, degrees of freedom and confidence intervals.
    • The pingouin.plot_shift() now 1) uses the Harrel-Davis robust quantile estimator in conjunction with a bias-corrected bootstrap confidence intervals, and 2) support paired samples.
    • Added the axis argument to pingouin.harrelldavis() to support 2D arrays.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.9(Sep 4, 2019)

    Minor release with mostly internal code refactoring.

    See full changelog at: https://pingouin-stats.org/changelog.html#v0-2-9-september-2019

    Source code(tar.gz)
    Source code(zip)
  • v0.2.8(Jul 22, 2019)

  • v0.2.7(Jun 25, 2019)

    This is a minor release, mainly to fix dependency issues between scipy and statsmodels.

    Dependencies

    a. Pingouin now requires statsmodels>=0.10.0 (latest release June 2019) and is compatible with SciPy 1.3.0.

    Enhancements

    a. Added support for long-format dataframe in pingouin.sphericity and pingouin.epsilon. b. Added support for two within-factors interaction in pingouin.sphericity and pingouin.epsilon (for the former, granted that at least one of them has no more than two levels.)

    New functions

    a. Added pingouin.power_rm_anova function.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Jun 3, 2019)

    Bugfixes

    • Fixed ERROR in two-sided p-value for Wilcoxon test (pingouin.wilcoxon()), the p-values were accidentally squared, and therefore smaller. Make sure to always use the latest release of Pingouin.
    • pingouin.wilcoxon() now uses the continuity correction by default (the documentation was saying that the correction was applied but it was not applied in the code.)
    • The show_median argument of the pingouin.plot_shift() function was not working properly when the percentiles were different that the default parameters.

    Dependencies

    • The current release of statsmodels (0.9.0) is not compatible with the newest release of Scipy (1.3.0). In order to avoid compatibility issues in the pingouin.ancova() and pingouin.anova() functions (which rely on statsmodels for certain cases), Pingouin will require SciPy < 1.3.0 until a new stable version of statsmodels is released.

    New functions

    • Added pingouin.chi2_independence() tests.
    • Added pingouin.chi2_mcnemar() tests.
    • Added pingouin.power_chi2() function.
    • Added pingouin.bayesfactor_binom() function.

    Enhancements

    • pingouin.linear_regression() now returns the residuals.
    • Completely rewrote pingouin.normality() function, which now support pandas DataFrame (wide & long format), multiple normality tests (scipy.stats.shapiro(), scipy.stats.normaltest()), and an automatic casewise removal of missing values.
    • Completely rewrote pingouin.homoscedasticity() function, which now support pandas DataFrame (wide & long format).
    • Faster and more accurate algorithm in pingouin.bayesfactor_pearson() (same algorithm as JASP).
    • Support for one-sided Bayes Factors in pingouin.bayesfactor_pearson().
    • Better handling of required parameters in pingouin.qqplot().
    • The epsilon value for the interaction term in pingouin.rm_anova() are now computed using the Greenhouse-Geisser method instead of the lower bound. A warning message has been added to the documentation to alert the user that the value might slightly differ than from R or JASP.

    Contributors

    • Raphael Vallat
    • Arthur Paulino
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Apr 29, 2019)

    Major release with several bugfixes, new functions, and many internal improvements:

    MAJOR BUG FIXES

    • Fixed error in p-values for one-sample one-sided T-test (pingouin.ttest()), the two-sided p-value was divided by 4 and not by 2, resulting in inaccurate (smaller) one-sided p-values.
    • Fixed global error for unbalanced two-way ANOVA (pingouin.anova()), the sums of squares were wrong, and as a consequence so were the F and p-values. In case of unbalanced design, Pingouin now computes a type II sums of squares via a call to the statsmodels package.
    • The epsilon factor for the interaction term in two-way repeated measures ANOVA (pingouin.rm_anova()) is now computed using the lower bound approach. This is more conservative than the Greenhouse-Geisser approach and therefore give (slightly) higher p-values. The reason for choosing this is that the Greenhouse-Geisser values for the interaction term differ than the ones returned by R and JASP. This will be hopefully fixed in future releases.

    New functions

    • Added pingouin.multivariate_ttest() (Hotelling T-squared) test.
    • Added pingouin.cronbach_alpha() function.
    • Added pingouin.plot_shift() function.
    • Several functions of pandas can now be directly used as pandas.DataFrame methods.
    • Added pingouin.pcorr() method to compute the partial Pearson correlation matrix of a pandas.DataFrame (similar to the pcor function in the ppcor package).
    • The pingouin.partial_corr() now supports semi-partial correlation.

    Enhancements

    • The pingouin.rm_corr() function now returns a pandas.DataFrame with the r-value, degrees of freedom, p-value, confidence intervals and power.
    • pingouin.compute_esci() now works for paired and one-sample Cohen d.
    • pingouin.bayesfactor_ttest() and pingouin.bayesfactor_pearson() now return a formatted str and not a float.
    • pingouin.pairwise_ttests() now returns the degrees of freedom (dof).
    • Better rounding of float in pingouin.pairwise_ttests().
    • Support for wide-format data in pingouin.rm_anova()
    • pingouin.ttest() now returns the confidence intervals around the T-values.

    Missing values

    • pingouin.remove_na() and pingouin.remove_rm_na() are now external function documented in the API.
    • pingouin.remove_rm_na() now works with multiple within-factors.
    • pingouin.remove_na() now works with 2D arrays.
    • Removed the remove_na argument in pingouin.rm_anova() and pingouin.mixed_anova(), an automatic listwise deletion of missing values is applied (same behavior as JASP). Note that this was also the default behavior of Pingouin, but the user could also specify not to remove the missing values, which most likely returned inaccurate results.
    • The pingouin.ancova() function now applies an automatic listwise deletion of missing values.
    • Added remove_na argument (default = False) in pingouin.linear_regression() and pingouin.logistic_regression() functions
    • Missing values are automatically removed in the pingouin.anova() function.

    Contributors

    • Raphael Vallat
    • Nicolas Legrand
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Apr 5, 2019)

    Major release with several new functions as well as many internal improvements.

    Correlation

    • Added pingouin.distance_corr() (distance correlation) function.
    • pingouin.rm_corr() now requires at least 3 unique subjects (same behavior as the original R package).
    • The pingouin.pairwise_corr() is faster and returns the number of outlier if a robust correlation is used.
    • Added support for 2D level in the pingouin.pairwise_corr(). See Jupyter notebooks for examples.
    • Added support for partial correlation in the pingouin.pairwise_corr() function.
    • Greatly improved execution speed of pingouin.correlation.skipped() function.
    • Added default random state to compute the Min Covariance Determinant in the pingouin.correlation.skipped() function.
    • The default number of bootstrap samples for the pingouin.correlation.shepherd() function is now set to 200 (previously 2000) to increase computation speed.
    • pingouin.partial_corr() now automatically drops rows with missing values.

    Datasets

    • Renamed pingouin.read_dataset() and pingouin.list_dataset() (before one needed to call these functions by calling pingouin.datasets)

    Pairwise T-tests and multi-comparisons

    • Added support for non-parametric pairwise tests in pingouin.pairwise_ttests() function.
    • Common language effect size (CLES) is now reported by default in pingouin.pairwise_ttests() function.
    • CLES is now implemented in the pingouin.compute_effsize() function.
    • Better code, doc and testing for the functions in multicomp.py.
    • P-values adjustment methods now do not take into account NaN values (same behavior as the R function p.adjust)

    Plotting

    • Added pingouin.plot_paired() function.

    Regression

    • NaN are now automatically removed in pingouin.mediation_analysis().
    • The pingouin.linear_regression() and pingouin.logistic_regression() now fail if NaN / Inf are present in the target or predictors variables. The user must remove then before running these functions.
    • Added support for multiple parallel mediator in pingouin.mediation_analysis().
    • Added support for covariates in pingouin.mediation_analysis().
    • Added seed argument to pingouin.mediation_analysis() for reproducible results.
    • pingouin.mediation_analysis() now returns two-sided p-values computed with a permutation test.
    • Added pingouin.utils._perm_pval() to compute p-value from a permutation test.

    Bugs and tests

    • Travis and AppVeyor test for Python 3.5, 3.6 and 3.7.
    • Better doctest & improved examples for many functions.
    • Fixed bug with pingouin.mad() when axis was not 0.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Feb 12, 2019)

  • v0.2.2(Dec 15, 2018)

  • v0.2.1(Nov 20, 2018)

    MINOR:

    • JOSS paper citation
    • Better confidence intervals

    See full changelog: https://raphaelvallat.github.io/pingouin/build/html/changelog.html#v0-2-1-november-2018

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Nov 19, 2018)

    MAJOR changes in code and documentation, see full changelog here: https://raphaelvallat.github.io/pingouin/build/html/changelog.html#v0-2-0-november-2018

    Source code(tar.gz)
    Source code(zip)
  • v0.1.10(Oct 10, 2018)

    Minor release:

    • Fixed dataset names in MANIFEST.in (.csv files were not copy-pasted with pip)
    • Added circ_vtest function
    • Added multivariate_normality function (Henze-Zirkler’s Multivariate Normality Test)
    • Renamed functions test_normality, test_sphericity and test_homoscedasticity to normality, sphericity and homoscedasticity to avoid bugs with pytest.
    • Moved distribution tests from parametric.py to distribution.py
    Source code(tar.gz)
    Source code(zip)
  • v0.1.9(Oct 5, 2018)

Owner
Raphael Vallat
French research scientist specialized in sleep and dreaming | Strong interest in stats and signal processing | Python lover
Raphael Vallat
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Python asymptotic Partial Directed Coherence and Directed Coherence estimation package for brain connectivity analysis. Free software: MIT license Doc

Heitor Baldo 3 Nov 26, 2022
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 61 Oct 02, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
Statistical Rethinking course winter 2022

Statistical Rethinking (2022 Edition) Instructor: Richard McElreath Lectures: Uploaded Playlist and pre-recorded, two per week Discussion: Online, F

Richard McElreath 3.9k Dec 31, 2022
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
CS50 pset9: Using flask API to create a web application to exchange stocks' shares.

C$50 Finance In this guide we want to implement a website via which users can “register”, “login” “buy” and “sell” stocks, like below: Background If y

1 Jan 24, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022