rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Overview

rliable

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Desideratum Current evaluation approach Our Recommendation
Uncertainty in aggregate performance Point estimates:
  • Ignore statistical uncertainty
  • Hinder results reproducibility
Interval estimates using stratified bootstrap confidence intervals (CIs)
Performance variability across tasks and runs Tables with task mean scores:
  • Overwhelming beyond a few tasks
  • Standard deviations frequently omitted
  • Incomplete picture for multimodal and heavy-tailed distributions
Score distributions (performance profiles):
  • Show tail distribution of scores on combined runs across tasks
  • Allow qualitative comparisons
  • Easily read any score percentile
Aggregate metrics for summarizing benchmark performance Mean:
  • Often dominated by performance on outlier tasks
  Median:
  • Statistically inefficient (requires a large number of runs to claim improvements)
  • Poor indicator of overall performance: 0 scores on nearly half the tasks doesn't change it
Interquartile Mean (IQM) across all runs:
  • Performance on middle 50% of combined runs
  • Robust to outlier scores but more statistically efficient than median
To show other aspects of performance gains, report Probability of improvement and Optimality gap

rliable provides support for:

  • Stratified Bootstrap Confidence Intervals (CIs)
  • Performance Profiles (with plotting functions)
  • Aggregate metrics
    • Interquartile Mean (IQM) across all runs
    • Optimality Gap
    • Probability of Improvement

Interactive colab

We provide a colab at bit.ly/statistical_precipice_colab, which shows how to use the library with examples of published algorithms on widely used benchmarks including Atari 100k, ALE, DM Control and Procgen.

Paper

For more details, refer to the accompanying NeurIPS 2021 paper (Oral): Deep Reinforcement Learning at the Edge of the Statistical Precipice.

Installation

To install rliable, run:

pip install -U rliable

To install latest version of rliable as a package, run:

pip install git+https://github.com/google-research/rliable

To import rliable, we suggest:

from rliable import library as rly
from rliable import metrics
from rliable import plot_utils

Aggregate metrics with 95% Stratified Bootstrap CIs

IQM, Optimality Gap, Median, Mean
algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow',
              'IQN', 'M-IQN', 'DreamerV2']
# Load ALE scores as a dictionary mapping algorithms to their human normalized
# score matrices, each of which is of size `(num_runs x num_games)`.
atari_200m_normalized_score_dict = ...
aggregate_func = lambda x: np.array([
  metrics.aggregate_median(x),
  metrics.aggregate_iqm(x),
  metrics.aggregate_mean(x),
  metrics.aggregate_optimality_gap(x)])
aggregate_scores, aggregate_score_cis = rly.get_interval_estimates(
  atari_200m_normalized_score_dict, aggregate_func, reps=50000)
fig, axes = plot_utils.plot_interval_estimates(
  aggregate_scores, aggregate_score_cis,
  metric_names=['Median', 'IQM', 'Mean', 'Optimality Gap'],
  algorithms=algorithms, xlabel='Human Normalized Score')
Probability of Improvement
# Load ProcGen scores as a dictionary containing pairs of normalized score
# matrices for pairs of algorithms we want to compare
procgen_algorithm_pairs = {.. , 'x,y': (score_x, score_y), ..}
average_probabilities, average_prob_cis = rly.get_interval_estimates(
  procgen_algorithm_pairs, metrics.probability_of_improvement, reps=50000)
plot_probability_of_improvement(average_probabilities, average_prob_cis)

Sample Efficiency Curve

algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow',
              'IQN', 'M-IQN', 'DreamerV2']
# Load ALE scores as a dictionary mapping algorithms to their human normalized
# score matrices across all 200 million frames, each of which is of size
# `(num_runs x num_games x 200)` where scores are recorded every million frame.
ale_all_frames_scores_dict = ...
frames = np.array([1, 10, 25, 50, 75, 100, 125, 150, 175, 200]) - 1
ale_frames_scores_dict = {algorithm: score[:, :, frames] for algorithm, score
                          in ale_all_frames_scores_dict.items()}
iqm = lambda scores: np.array([metrics.aggregate_iqm(scores[..., frame])
                               for frame in range(scores.shape[-1])])
iqm_scores, iqm_cis = rly.get_interval_estimates(
  ale_frames_scores_dict, iqm, reps=50000)
plot_utils.plot_sample_efficiency_curve(
    frames+1, iqm_scores, iqm_cis, algorithms=algorithms,
    xlabel=r'Number of Frames (in millions)',
    ylabel='IQM Human Normalized Score')

Performance Profiles

# Load ALE scores as a dictionary mapping algorithms to their human normalized
# score matrices, each of which is of size `(num_runs x num_games)`.
atari_200m_normalized_score_dict = ...
# Human normalized score thresholds
atari_200m_thresholds = np.linspace(0.0, 8.0, 81)
score_distributions, score_distributions_cis = rly.create_performance_profile(
    atari_200m_normalized_score_dict, atari_200m_thresholds)
# Plot score distributions
fig, ax = plt.subplots(ncols=1, figsize=(7, 5))
plot_utils.plot_performance_profiles(
  score_distributions, atari_200m_thresholds,
  performance_profile_cis=score_distributions_cis,
  colors=dict(zip(algorithms, sns.color_palette('colorblind'))),
  xlabel=r'Human Normalized Score $(\tau)$',
  ax=ax)

The above profile can also be plotted with non-linear scaling as follows:

plot_utils.plot_performance_profiles(
  perf_prof_atari_200m, atari_200m_tau,
  performance_profile_cis=perf_prof_atari_200m_cis,
  use_non_linear_scaling=True,
  xticks = [0.0, 0.5, 1.0, 2.0, 4.0, 8.0]
  colors=dict(zip(algorithms, sns.color_palette('colorblind'))),
  xlabel=r'Human Normalized Score $(\tau)$',
  ax=ax)

Dependencies

The code was tested under Python>=3.7 and uses these packages:

  • arch >= 4.19
  • scipy >= 1.7.0
  • numpy >= 0.9.0
  • absl-py >= 1.16.4

Citing

If you find this open source release useful, please reference in your paper:

@article{agarwal2021deep,
  title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
          and Courville, Aaron and Bellemare, Marc G},
  journal={Advances in Neural Information Processing Systems},
  year={2021}
}

Disclaimer: This is not an official Google product.

Comments
  • RAD results may be incorrect.

    RAD results may be incorrect.

    Hi @agarwl. I found that the 'step' in RAD's 'eval.log' refers to the policy step. But the 'step' in 'xxx--eval_scores.npy' refers to the environment step. We know that 'environment step = policy step * action_repreat'.

    Here comes a problem: if you use the results of 100k steps in 'eval.log', then you actually evaluate the scores at 100k*action_repeat steps. This will lead to the overestimation of RAD. And I wonder whether you do such incorrect evaluations, or you take the results in 'xxx--eval_scores.npy', which are correct in terms of 'steps'. You may refer to a similar question in https://github.com/MishaLaskin/rad/issues/15.

    I reproduced the results of RAD locally, and I found my results are much worse than the reported ones (in your paper). I list them in the following figure. QQ20211223-153829

    I compare the means of each task. Obviously, there is a huge gap, and my results are close to the ones reported by DrQ authors (see the Table in https://github.com/MishaLaskin/rad/issues/1). I guess you may evaluate scores at incorrect environment steps? So, could you please offer more details when evaluating RAD? Thanks :)

    opened by TaoHuang13 19
  • Installation fails on MacBook Pro with M1 chip

    Installation fails on MacBook Pro with M1 chip

    The installation fails on my MacBook Pro with M1 chip.

    I also tried on a MacBook Pro with an Intel chip (and the same OS version) and on a Linux system: the installation was successful on both configurations.

    $ cd rliable
    $ pip install -e .
    Obtaining file:///Users/quentingallouedec/rliable
      Preparing metadata (setup.py) ... done
    Collecting arch==5.0.1
      Using cached arch-5.0.1.tar.gz (937 kB)
      Installing build dependencies ... error
      error: subprocess-exited-with-error
    
    ... # Log too long for GitHub issue
    
    error: subprocess-exited-with-error
    
    × pip subprocess to install build dependencies did not run successfully.
    │ exit code: 1
    ╰─> See above for output.
    
    note: This error originates from a subprocess, and is likely not a problem with pip.
    

    System info

    • Python version: 3.9
    • System Version: macOS 12.4 (21F79)
    • Kernel Version: Darwin 21.5.0

    What I've tried

    Install only arch 5.0.1

    It seems to be related with the installation of arch. I've tried to pip install arch==5.0.1 and it also failed with the same logs.

    Install the last version of arch

    I've tried to pip install arch (current version: 5.2.0), and it worked.

    Use rliable with the last version of arch

    Since I can install arch==5.2.0, I've tried to make rliable work with arch 5.2.0 (by modifying manually setup.py). Pytest failed. Here is the logs for one of the failing unitest:

    _____________________________________________ LibraryTest.test_stratified_bootstrap_runs_and_tasks _____________________________________________
    
    self = <library_test.LibraryTest testMethod=test_stratified_bootstrap_runs_and_tasks>, task_bootstrap = True
    
        @parameterized.named_parameters(
            dict(testcase_name="runs_only", task_bootstrap=False),
            dict(testcase_name="runs_and_tasks", task_bootstrap=True))
        def test_stratified_bootstrap(self, task_bootstrap):
          """Tests StratifiedBootstrap."""
          bs = rly.StratifiedBootstrap(
              self._x, y=self._y, z=self._z, task_bootstrap=task_bootstrap)
    >     for data, kwdata in bs.bootstrap(5):
    
    tests/rliable/library_test.py:40: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    env/lib/python3.9/site-packages/arch/bootstrap/base.py:694: in bootstrap
        yield self._resample()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = Stratified Bootstrap(no. pos. inputs: 1, no. keyword inputs: 2, ID: 0x15b353a00)
    
        def _resample(self) -> Tuple[Tuple[ArrayLike, ...], Dict[str, ArrayLike]]:
            """
            Resample all data using the values in _index
            """
            indices = self._index
    >       assert isinstance(indices, np.ndarray)
    E       AssertionError
    
    env/lib/python3.9/site-packages/arch/bootstrap/base.py:1294: AssertionError
    _______________________________________________ LibraryTest.test_stratified_bootstrap_runs_only ________________________________________________
    
    self = <library_test.LibraryTest testMethod=test_stratified_bootstrap_runs_only>, task_bootstrap = False
    
        @parameterized.named_parameters(
            dict(testcase_name="runs_only", task_bootstrap=False),
            dict(testcase_name="runs_and_tasks", task_bootstrap=True))
        def test_stratified_bootstrap(self, task_bootstrap):
          """Tests StratifiedBootstrap."""
          bs = rly.StratifiedBootstrap(
              self._x, y=self._y, z=self._z, task_bootstrap=task_bootstrap)
    >     for data, kwdata in bs.bootstrap(5):
    
    tests/rliable/library_test.py:40: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    env/lib/python3.9/site-packages/arch/bootstrap/base.py:694: in bootstrap
        yield self._resample()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = Stratified Bootstrap(no. pos. inputs: 1, no. keyword inputs: 2, ID: 0x15b2ff1f0)
    
        def _resample(self) -> Tuple[Tuple[ArrayLike, ...], Dict[str, ArrayLike]]:
            """
            Resample all data using the values in _index
            """
            indices = self._index
    >       assert isinstance(indices, np.ndarray)
    E       AssertionError
    
    env/lib/python3.9/site-packages/arch/bootstrap/base.py:1294: AssertionError
    

    It seems like there are breaking changes between arch 5.0.1 and arch 5.2.0. Maybe this issue can be solved by updating this dependency to it's current version.

    opened by qgallouedec 10
  • Bug in plot_utils.py

    Bug in plot_utils.py

    Hi,

    In plot_utils.py, I think this line ought to be algorithms = list(point_estimates.keys()) https://github.com/google-research/rliable/blob/72fc16c31c4021b72e7b21f3ba915e1b38cff481/rliable/plot_utils.py#L245 Otherwise, algorithms cannot be indexed in the next line.

    opened by zhefan 2
  • Question about documentation in probability_of_improvement

    Question about documentation in probability_of_improvement

    Hi, I wonder if the documentation in probability_of_improvement function in metrics.py is wrong? Specifically,

    scores_x: A matrix of size (num_runs_x x num_tasks) where scores_x[m][n] represent the score on run n of task m for algorithm X. https://github.com/google-research/rliable/blob/cc5eff51cab488b34cfeb5c5e37eae7a6b4a92b2/rliable/metrics.py#L77)

    Should scores_x[n][m] be the score on run n of task m for algorithm X?

    Thanks.

    opened by zhefan 2
  • Downloading data set always stuck

    Downloading data set always stuck

    Thanks for sharing the repo. There is a problem that every time I download the dataset, it is always stuck somewhere at 9X% Do you know what might cause this?

    ...
    Copying gs://rl-benchmark-data/atari_100k/SimPLe.json...
    Copying gs://rl-benchmark-data/atari_100k/OTRainbow.json...
    [55/59 files][  2.9 MiB/  3.0 MiB]  98% Done
    
    opened by HYDesmondLiu 2
  • Fix dict_keys object -> list

    Fix dict_keys object -> list

    This fixes a downstream task where algorithms[0] in the following line fails because point_estimates.keys() returns a dict_keys object, not a subscriptable list.

    opened by jjshoots 1
  • How can I access the data directly without using gsutil?

    How can I access the data directly without using gsutil?

    I haven't got gsutil set up on my M1 MacBook and I'm not sure the steps are super streamlined. Can I somehow access the data from my browser or download it another way?

    documentation 
    opened by slerman12 1
  • Add installation of compatible arch version to notebook

    Add installation of compatible arch version to notebook

    Latest arch version raises an exception when calling create_performance_profile. Adding !pip install arch==5.0.1 to the notebook file resolves the issue. This change should be reflected in the hosted colab notebook.

    opened by Aladoro 1
  • Customisable linestyles in performance profile plots

    Customisable linestyles in performance profile plots

    The primary reason for this PR is an added option for customising linestyles in performance profile plots. It works in exactly the same way as the colors parameter it already had; a map, None by default which means all methods are plotted as solid lines, but a map can be passed in to change the linestyles of every method's plot.

    Here you can see, as an example, a plot I'm currently working on where I'm using this functionality to have some methods plotted as dotted lines instead of solid ones:

    afbeelding

    Additionally, I have added a .gitignore file to ignore some files that were automatically created when I installed rliable with pip from local source code in my own fork of the repo, and files created by working with rliable source code in the PyCharm IDE.

    opened by DennisSoemers 1
  • README image link broken: ale_score_distributions_new.png

    README image link broken: ale_score_distributions_new.png

    It seems that the file images/ale_score_distributions_new.png pointed to in the README (https://github.com/google-research/rliable#performance-profiles) was deleted in one of the recent commits.

    opened by nirbhayjm 1
  • Urgent question about data aggregates

    Urgent question about data aggregates

    Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.

    We have median human-norm scores all around 0.10 - 0.12.

    Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.

    opened by slerman12 1
Releases(v1.0)
Custom TensorFlow2 implementations of forward and backward computation of soft-DTW algorithm in batch mode.

Batch Soft-DTW(Dynamic Time Warping) in TensorFlow2 including forward and backward computation Custom TensorFlow2 implementations of forward and backw

19 Aug 30, 2022
Real-time VIBE: Frame by Frame Inference of VIBE (Video Inference for Human Body Pose and Shape Estimation)

Real-time VIBE Inference VIBE frame-by-frame. Overview This is a frame-by-frame inference fork of VIBE at [https://github.com/mkocabas/VIBE]. Usage: i

23 Jul 02, 2022
Compact Bidirectional Transformer for Image Captioning

Compact Bidirectional Transformer for Image Captioning Requirements Python 3.8 Pytorch 1.6 lmdb h5py tensorboardX Prepare Data Please use git clone --

YE Zhou 19 Dec 12, 2022
This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation".

IR-GAIL This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation". Dependency The experiments are de

Zhao-Heng Yin 1 Jul 14, 2022
Optimized Gillespie algorithm for simulating Stochastic sPAtial models of Cancer Evolution (OG-SPACE)

OG-SPACE Introduction Optimized Gillespie algorithm for simulating Stochastic sPAtial models of Cancer Evolution (OG-SPACE) is a computational framewo

Data and Computational Biology Group UNIMIB (was BI*oinformatics MI*lan B*icocca) 0 Nov 17, 2021
Simple Pixelbot for Diablo 2 Resurrected written in python and opencv.

Simple Pixelbot for Diablo 2 Resurrected written in python and opencv. Obviously only use it in offline mode as it is against the TOS of Blizzard to use it in online mode!

468 Jan 03, 2023
Face Transformer for Recognition

Face-Transformer This is the code of Face Transformer for Recognition (https://arxiv.org/abs/2103.14803v2). Recently there has been great interests of

Zhong Yaoyao 153 Nov 30, 2022
Le dataset des images du projet d'IA de 2021

face-mask-dataset-ilc-2021 Le dataset des images du projet d'IA de 2021, Indiquez vos id git dans la issue pour les droits TL;DR: Choisir 200 images J

7 Nov 15, 2021
This is a pytorch implementation for the BST model from Alibaba https://arxiv.org/pdf/1905.06874.pdf

Behavior-Sequence-Transformer-Pytorch This is a pytorch implementation for the BST model from Alibaba https://arxiv.org/pdf/1905.06874.pdf This model

Jaime Ferrando Huertas 83 Jan 05, 2023
A tensorflow implementation of Fully Convolutional Networks For Semantic Segmentation

##A tensorflow implementation of Fully Convolutional Networks For Semantic Segmentation. #USAGE To run the trained classifier on some images: python w

Alex Seewald 13 Nov 17, 2022
Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"

Time-Sensitive-QA The repo contains the dataset and code for NeurIPS2021 (dataset track) paper Time-Sensitive Question Answering dataset. The dataset

wenhu chen 35 Nov 14, 2022
Leveraging Two Types of Global Graph for Sequential Fashion Recommendation, ICMR 2021

This is the repo for the paper: Leveraging Two Types of Global Graph for Sequential Fashion Recommendation Requirements OS: Ubuntu 16.04 or higher ver

Yujuan Ding 10 Oct 10, 2022
source code for 'Finding Valid Adjustments under Non-ignorability with Minimal DAG Knowledge' by A. Shah, K. Shanmugam, K. Ahuja

Source code for "Finding Valid Adjustments under Non-ignorability with Minimal DAG Knowledge" Reference: Abhin Shah, Karthikeyan Shanmugam, Kartik Ahu

Abhin Shah 1 Jun 03, 2022
Realtime_Multi-Person_Pose_Estimation

Introduction Multi Person PoseEstimation By PyTorch Results Require Pytorch Installation git submodule init && git submodule update Demo Download conv

tensorboy 1.3k Jan 05, 2023
The official implementation for "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

FQ-ViT [arXiv] This repo contains the official implementation of "FQ-ViT: Fully Quantized Vision Transformer without Retraining". Table of Contents In

132 Jan 08, 2023
NeRF Meta-Learning with PyTorch

NeRF Meta Learning With PyTorch nerf-meta is a PyTorch re-implementation of NeRF experiments from the paper "Learned Initializations for Optimizing Co

Sanowar Raihan 78 Dec 18, 2022
EfficientNetV2 implementation using PyTorch

EfficientNetV2-S implementation using PyTorch Train Steps Configure imagenet path by changing data_dir in train.py python main.py --benchmark for mode

Jahongir Yunusov 86 Dec 29, 2022
[ICCV'2021] "SSH: A Self-Supervised Framework for Image Harmonization", Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, Zhangyang Wang

SSH: A Self-Supervised Framework for Image Harmonization (ICCV 2021) code for SSH Representative Examples Main Pipeline RealHM DataSet Google Drive Pr

VITA 86 Dec 02, 2022
Code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms.

RDC-SLAM This repository contains code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms. The system takes in

40 Nov 19, 2022
This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch

This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch. The code was prepared to the final version of the accepted manuscript in AIST

Marcelo Hartmann 2 May 06, 2022