Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

Overview

SLM Lab
GitHub tag (latest SemVer) CI Maintainability Test Coverage

Modular Deep Reinforcement Learning framework in PyTorch.

Documentation:
https://slm-lab.gitbook.io/slm-lab/

ppo beamrider ppo breakout ppo kungfumaster ppo mspacman
BeamRider Breakout KungFuMaster MsPacman
ppo pong ppo qbert ppo seaquest ppo spaceinvaders
Pong Qbert Seaquest Sp.Invaders
sac ant sac halfcheetah sac hopper sac humanoid
Ant HalfCheetah Hopper Humanoid
sac doublependulum sac pendulum sac reacher sac walker
Inv.DoublePendulum InvertedPendulum Reacher Walker
Comments
  • All 'search' examples end with error

    All 'search' examples end with error

    Describe the bug I'm enjoying the book a lot. The best book on the subject and I've read Sutton & Barto, but I'm an empiricist and not an academic. Anyway, I can run all the examples in the book in 'dev' and 'train' modes but not in 'search' mode. They all end with error. I don't see anybody complaining about this so it must be a rooky mistake on my part. I hope you can help so I can continue enjoying the book to its fullest.

    To Reproduce

    1. OS and environment: Ubuntu 18.04
    2. SLM Lab git SHA (run git rev-parse HEAD to get it): What?
    3. spec file used: benchmark/reinforce/reinforce_cartpole.json

    Additional context I'm showing the error logs for Code 2.15 in page 50, but I get similar error logs for all the other codes ran in 'search' mode. There are 32 files in the 'data' folder, no plots. All the folders in the 'data' folder are empty except for 'log' which has a file with this

    [2020-01-30 11:03:56,907 PID:3351 INFO search.py run_ray_search] Running ray search for spec reinforce_cartpole
    

    NVIDIA drive version: 440.33.01 CUDA version: 10.2

    Error logs

    python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_baseline_cartpole search
    [2020-01-30 11:38:57,177 PID:4355 INFO run_lab.py read_spec_and_run] Running lab spec_file:slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json spec_name:reinforce_baseline_cartpole in mode:search
    [2020-01-30 11:38:57,183 PID:4355 INFO search.py run_ray_search] Running ray search for spec reinforce_baseline_cartpole
    2020-01-30 11:38:57,183	WARNING worker.py:1341 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
    2020-01-30 11:38:57,183	INFO node.py:497 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2020-01-30_11-38-57_183527_4355/logs.
    2020-01-30 11:38:57,288	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:59003 to respond...
    2020-01-30 11:38:57,409	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:55931 to respond...
    2020-01-30 11:38:57,414	INFO services.py:806 -- Starting Redis shard with 3.35 GB max memory.
    2020-01-30 11:38:57,435	INFO node.py:511 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2020-01-30_11-38-57_183527_4355/logs.
    2020-01-30 11:38:57,435	INFO services.py:1441 -- Starting the Plasma object store with 5.02 GB memory using /dev/shm.
    2020-01-30 11:38:57,543	INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
    2020-01-30 11:38:57,543	INFO tune.py:223 -- Starting a new experiment.
    == Status ==
    Using FIFO scheduling algorithm.
    Resources requested: 0/8 CPUs, 0/1 GPUs
    Memory usage on this node: 2.1/16.7 GB
    
    2020-01-30 11:38:57,572	WARNING logger.py:130 -- Couldn't import TensorFlow - disabling TensorBoard logging.
    2020-01-30 11:38:57,573	WARNING logger.py:224 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
    == Status ==
    Using FIFO scheduling algorithm.
    Resources requested: 4/8 CPUs, 0/1 GPUs
    Memory usage on this node: 2.2/16.7 GB
    Result logdir: /home/joe/ray_results/reinforce_baseline_cartpole
    Number of trials: 2 ({'RUNNING': 1, 'PENDING': 1})
    PENDING trials:
     - ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1:	PENDING
    RUNNING trials:
     - ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0:	RUNNING
    
    2020-01-30 11:38:57,596	WARNING logger.py:130 -- Couldn't import TensorFlow - disabling TensorBoard logging.
    2020-01-30 11:38:57,607	WARNING logger.py:224 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
    (pid=4389) [2020-01-30 11:38:58,297 PID:4389 INFO logger.py info] Running sessions
    (pid=4388) [2020-01-30 11:38:58,292 PID:4388 INFO logger.py info] Running sessions
    (pid=4388) terminate called after throwing an instance of 'c10::Error'
    (pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4388) 
    (pid=4388) Fatal Python error: Aborted
    (pid=4388) 
    (pid=4388) Stack (most recent call first):
    (pid=4389) [2020-01-30 11:38:58,326 PID:4456 INFO openai.py __init__] OpenAIEnv:
    (pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4389) - eval_frequency = 2000
    (pid=4389) - log_frequency = 10000
    (pid=4389) - frame_op = None
    (pid=4389) - frame_op_len = None
    (pid=4389) - image_downsize = (84, 84)
    (pid=4389) - normalize_state = False
    (pid=4389) - reward_scale = None
    (pid=4389) - num_envs = 1
    (pid=4389) - name = CartPole-v0
    (pid=4389) - max_t = 200
    (pid=4389) - max_frame = 100000
    (pid=4389) - to_render = False
    (pid=4389) - is_venv = False
    (pid=4389) - clock_speed = 1
    (pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
    (pid=4389) - done = False
    (pid=4389) - total_reward = nan
    (pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4389) - observation_space = Box(4,)
    (pid=4389) - action_space = Discrete(2)
    (pid=4389) - observable_dim = {'state': 4}
    (pid=4389) - action_dim = 2
    (pid=4389) - is_discrete = True
    (pid=4389) [2020-01-30 11:38:58,327 PID:4453 INFO openai.py __init__] OpenAIEnv:
    (pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4389) - eval_frequency = 2000
    (pid=4389) - log_frequency = 10000
    (pid=4389) - frame_op = None
    (pid=4389) - frame_op_len = None
    (pid=4389) - image_downsize = (84, 84)
    (pid=4389) - normalize_state = False
    (pid=4389) - reward_scale = None
    (pid=4389) - num_envs = 1
    (pid=4389) - name = CartPole-v0
    (pid=4389) - max_t = 200
    (pid=4389) - max_frame = 100000
    (pid=4389) - to_render = False
    (pid=4389) - is_venv = False
    (pid=4389) - clock_speed = 1
    (pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
    (pid=4389) - done = False
    (pid=4389) - total_reward = nan
    (pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4389) - observation_space = Box(4,)
    (pid=4389) - action_space = Discrete(2)
    (pid=4389) - observable_dim = {'state': 4}
    (pid=4389) - action_dim = 2
    (pid=4389) - is_discrete = True
    (pid=4389) [2020-01-30 11:38:58,328 PID:4450 INFO openai.py __init__] OpenAIEnv:
    (pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4389) - eval_frequency = 2000
    (pid=4389) - log_frequency = 10000
    (pid=4389) - frame_op = None
    (pid=4389) - frame_op_len = None
    (pid=4389) - image_downsize = (84, 84)
    (pid=4389) - normalize_state = False
    (pid=4389) - reward_scale = None
    (pid=4389) - num_envs = 1
    (pid=4389) - name = CartPole-v0
    (pid=4389) - max_t = 200
    (pid=4389) - max_frame = 100000
    (pid=4389) - to_render = False
    (pid=4389) - is_venv = False
    (pid=4389) - clock_speed = 1
    (pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
    (pid=4389) - done = False
    (pid=4389) - total_reward = nan
    (pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4389) - observation_space = Box(4,)
    (pid=4389) - action_space = Discrete(2)
    (pid=4389) - observable_dim = {'state': 4}
    (pid=4389) - action_dim = 2
    (pid=4389) - is_discrete = True
    (pid=4389) [2020-01-30 11:38:58,335 PID:4458 INFO openai.py __init__] OpenAIEnv:
    (pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4389) - eval_frequency = 2000
    (pid=4389) - log_frequency = 10000
    (pid=4389) - frame_op = None
    (pid=4389) - frame_op_len = None
    (pid=4389) - image_downsize = (84, 84)
    (pid=4389) - normalize_state = False
    (pid=4389) - reward_scale = None
    (pid=4389) - num_envs = 1
    (pid=4389) - name = CartPole-v0
    (pid=4389) - max_t = 200
    (pid=4389) - max_frame = 100000
    (pid=4389) - to_render = False
    (pid=4389) - is_venv = False
    (pid=4389) - clock_speed = 1
    (pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
    (pid=4389) - done = False
    (pid=4389) - total_reward = nan
    (pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4389) - observation_space = Box(4,)
    (pid=4389) - action_space = Discrete(2)
    (pid=4389) - observable_dim = {'state': 4}
    (pid=4389) - action_dim = 2
    (pid=4389) - is_discrete = True
    (pid=4388) [2020-01-30 11:38:58,313 PID:4440 INFO openai.py __init__] OpenAIEnv:
    (pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4388) - eval_frequency = 2000
    (pid=4388) - log_frequency = 10000
    (pid=4388) - frame_op = None
    (pid=4388) - frame_op_len = None
    (pid=4388) - image_downsize = (84, 84)
    (pid=4388) - normalize_state = False
    (pid=4388) - reward_scale = None
    (pid=4388) - num_envs = 1
    (pid=4388) - name = CartPole-v0
    (pid=4388) - max_t = 200
    (pid=4388) - max_frame = 100000
    (pid=4388) - to_render = False
    (pid=4388) - is_venv = False
    (pid=4388) - clock_speed = 1
    (pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
    (pid=4388) - done = False
    (pid=4388) - total_reward = nan
    (pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4388) - observation_space = Box(4,)
    (pid=4388) - action_space = Discrete(2)
    (pid=4388) - observable_dim = {'state': 4}
    (pid=4388) - action_dim = 2
    (pid=4388) - is_discrete = True
    (pid=4388) [2020-01-30 11:38:58,318 PID:4445 INFO openai.py __init__] OpenAIEnv:
    (pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4388) - eval_frequency = 2000
    (pid=4388) - log_frequency = 10000
    (pid=4388) - frame_op = None
    (pid=4388) - frame_op_len = None
    (pid=4388) - image_downsize = (84, 84)
    (pid=4388) - normalize_state = False
    (pid=4388) - reward_scale = None
    (pid=4388) - num_envs = 1
    (pid=4388) - name = CartPole-v0
    (pid=4388) - max_t = 200
    (pid=4388) - max_frame = 100000
    (pid=4388) - to_render = False
    (pid=4388) - is_venv = False
    (pid=4388) - clock_speed = 1
    (pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
    (pid=4388) - done = False
    (pid=4388) - total_reward = nan
    (pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4388) - observation_space = Box(4,)
    (pid=4388) - action_space = Discrete(2)
    (pid=4388) - observable_dim = {'state': 4}
    (pid=4388) - action_dim = 2
    (pid=4388) - is_discrete = True
    (pid=4388) [2020-01-30 11:38:58,319 PID:4449 INFO openai.py __init__] OpenAIEnv:
    (pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4388) - eval_frequency = 2000
    (pid=4388) - log_frequency = 10000
    (pid=4388) - frame_op = None
    (pid=4388) - frame_op_len = None
    (pid=4388) - image_downsize = (84, 84)
    (pid=4388) - normalize_state = False
    (pid=4388) - reward_scale = None
    (pid=4388) - num_envs = 1
    (pid=4388) - name = CartPole-v0
    (pid=4388) - max_t = 200
    (pid=4388) - max_frame = 100000
    (pid=4388) - to_render = False
    (pid=4388) - is_venv = False
    (pid=4388) - clock_speed = 1
    (pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
    (pid=4388) - done = False
    (pid=4388) - total_reward = nan
    (pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4388) - observation_space = Box(4,)
    (pid=4388) - action_space = Discrete(2)
    (pid=4388) - observable_dim = {'state': 4}
    (pid=4388) - action_dim = 2
    (pid=4388) - is_discrete = True
    (pid=4388) [2020-01-30 11:38:58,323 PID:4452 INFO openai.py __init__] OpenAIEnv:
    (pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
    (pid=4388) - eval_frequency = 2000
    (pid=4388) - log_frequency = 10000
    (pid=4388) - frame_op = None
    (pid=4388) - frame_op_len = None
    (pid=4388) - image_downsize = (84, 84)
    (pid=4388) - normalize_state = False
    (pid=4388) - reward_scale = None
    (pid=4388) - num_envs = 1
    (pid=4388) - name = CartPole-v0
    (pid=4388) - max_t = 200
    (pid=4388) - max_frame = 100000
    (pid=4388) - to_render = False
    (pid=4388) - is_venv = False
    (pid=4388) - clock_speed = 1
    (pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
    (pid=4388) - done = False
    (pid=4388) - total_reward = nan
    (pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
    (pid=4388) - observation_space = Box(4,)
    (pid=4388) - action_space = Discrete(2)
    (pid=4388) - observable_dim = {'state': 4}
    (pid=4388) - action_dim = 2
    (pid=4388) - is_discrete = True
    (pid=4389) [2020-01-30 11:38:58,339 PID:4453 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4389) [2020-01-30 11:38:58,340 PID:4450 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4389) [2020-01-30 11:38:58,343 PID:4456 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4389) [2020-01-30 11:38:58,345 PID:4450 INFO base.py __init__][2020-01-30 11:38:58,345 PID:4453 INFO base.py __init__] Reinforce:
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddcc0>
    (pid=4389) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4389)  'action_policy': 'default',
    (pid=4389)  'center_return': False,
    (pid=4389)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                        'end_val': 0.001,
    (pid=4389)                        'name': 'linear_decay',
    (pid=4389)                        'start_step': 0,
    (pid=4389)                        'start_val': 0.01},
    (pid=4389)  'explore_var_spec': None,
    (pid=4389)  'gamma': 0.99,
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'training_frequency': 1}
    (pid=4389) - name = Reinforce
    (pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4389) - net_spec = {'clip_grad_val': None,
    (pid=4389)  'hid_layers': [64],
    (pid=4389)  'hid_layers_activation': 'selu',
    (pid=4389)  'loss_spec': {'name': 'MSELoss'},
    (pid=4389)  'lr_scheduler_spec': None,
    (pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)  'type': 'MLPNet'}
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddcc0>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bcb710>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bddd68>"
    (pid=4389) }
    (pid=4389) - action_pdtype = default
    (pid=4389) - action_policy = <function default at 0x7fcc21560620>
    (pid=4389) - center_return = False
    (pid=4389) - explore_var_spec = None
    (pid=4389) - entropy_coef_spec = {'end_step': 20000,
    (pid=4389)  'end_val': 0.001,
    (pid=4389)  'name': 'linear_decay',
    (pid=4389)  'start_step': 0,
    (pid=4389)  'start_val': 0.01}
    (pid=4389) - policy_loss_coef = 1.0
    (pid=4389) - gamma = 0.99
    (pid=4389) - training_frequency = 1
    (pid=4389) - to_train = 0
    (pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bddd30>
    (pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdda20>
    (pid=4389) - net = MLPNet(
    (pid=4389)   (model): Sequential(
    (pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4389)     (1): SELU()
    (pid=4389)   )
    (pid=4389)   (model_tail): Sequential(
    (pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4389)   )
    (pid=4389)   (loss_fn): MSELoss()
    (pid=4389) )
    (pid=4389) - net_names = ['net']
    (pid=4389) - optim = Adam (
    (pid=4389) Parameter Group 0
    (pid=4389)     amsgrad: False
    (pid=4389)     betas: (0.9, 0.999)
    (pid=4389)     eps: 1e-08
    (pid=4389)     lr: 0.002
    (pid=4389)     weight_decay: 0
    (pid=4389) )
    (pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10ba20b8>
    (pid=4389) - global_net = None
    (pid=4389)  Reinforce:
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdddd8>
    (pid=4389) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4389)  'action_policy': 'default',
    (pid=4389)  'center_return': False,
    (pid=4389)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4388) [2020-01-30 11:38:58,330 PID:4445 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4388) [2020-01-30 11:38:58,330 PID:4449 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4388) [2020-01-30 11:38:58,335 PID:4452 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4388) [2020-01-30 11:38:58,335 PID:4449 INFO base.py __init__] Reinforce:
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e097f60>
    (pid=4388) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4388)  'action_policy': 'default',
    (pid=4388)  'center_return': True,
    (pid=4388)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                        'end_val': 0.001,
    (pid=4388)                        'name': 'linear_decay',
    (pid=4388)                        'start_step': 0,
    (pid=4388)                        'start_val': 0.01},
    (pid=4388)  'explore_var_spec': None,
    (pid=4388)  'gamma': 0.99,
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'training_frequency': 1}
    (pid=4388) - name = Reinforce
    (pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4388) - net_spec = {'clip_grad_val': None,
    (pid=4388)  'hid_layers': [64],
    (pid=4388)  'hid_layers_activation': 'selu',
    (pid=4388)  'loss_spec': {'name': 'MSELoss'},
    (pid=4388)  'lr_scheduler_spec': None,
    (pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)  'type': 'MLPNet'}
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e097f60>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e097fd0>"
    (pid=4388) }
    (pid=4388) - action_pdtype = default
    (pid=4388) - action_policy = <function default at 0x7fce304ad620>
    (pid=4388) - center_return = True
    (pid=4388) - explore_var_spec = None
    (pid=4388) - entropy_coef_spec = {'end_step': 20000,
    (pid=4388)  'end_val': 0.001,
    (pid=4388)  'name': 'linear_decay',
    (pid=4388)  'start_step': 0,
    (pid=4388)  'start_val': 0.01}
    (pid=4388) - policy_loss_coef = 1.0
    (pid=4388) - gamma = 0.99
    (pid=4388) - training_frequency = 1
    (pid=4388) - to_train = 0
    (pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e097c88>
    (pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e083940>
    (pid=4388) - net = MLPNet(
    (pid=4388)   (model): Sequential(
    (pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4388)     (1): SELU()
    (pid=4388)   )
    (pid=4388)   (model_tail): Sequential(
    (pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4388)   )
    (pid=4388)   (loss_fn): MSELoss()
    (pid=4388) )
    (pid=4388) - net_names = ['net']
    (pid=4388) - optim = Adam (
    (pid=4388) Parameter Group 0
    (pid=4388)     amsgrad: False
    (pid=4388)     betas: (0.9, 0.999)
    (pid=4388)     eps: 1e-08
    (pid=4388)     lr: 0.002
    (pid=4388)     weight_decay: 0
    (pid=4388) )
    (pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e0562e8>
    (pid=4388) - global_net = None
    (pid=4388) [2020-01-30 11:38:58,335 PID:4445 INFO base.py __init__] Reinforce:
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e098da0>
    (pid=4388) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4388)  'action_policy': 'default',
    (pid=4388)  'center_return': True,
    (pid=4388)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                        'end_val': 0.001,
    (pid=4389)                        'name': 'linear_decay',
    (pid=4389)                        'start_step': 0,
    (pid=4389)                        'start_val': 0.01},
    (pid=4389)  'explore_var_spec': None,
    (pid=4389)  'gamma': 0.99,
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'training_frequency': 1}
    (pid=4389) - name = Reinforce
    (pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4389) - net_spec = {'clip_grad_val': None,
    (pid=4389)  'hid_layers': [64],
    (pid=4389)  'hid_layers_activation': 'selu',
    (pid=4389)  'loss_spec': {'name': 'MSELoss'},
    (pid=4389)  'lr_scheduler_spec': None,
    (pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)  'type': 'MLPNet'}
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdddd8>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc5828>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdde80>"
    (pid=4389) }
    (pid=4389) - action_pdtype = default
    (pid=4389) - action_policy = <function default at 0x7fcc21560620>
    (pid=4389) - center_return = False
    (pid=4389) - explore_var_spec = None
    (pid=4389) - entropy_coef_spec = {'end_step': 20000,
    (pid=4389)  'end_val': 0.001,
    (pid=4389)  'name': 'linear_decay',
    (pid=4389)  'start_step': 0,
    (pid=4389)  'start_val': 0.01}
    (pid=4389) - policy_loss_coef = 1.0
    (pid=4389) - gamma = 0.99
    (pid=4389) - training_frequency = 1
    (pid=4389) - to_train = 0
    (pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdde48>
    (pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bddb38>
    (pid=4389) - net = MLPNet(
    (pid=4389)   (model): Sequential(
    (pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4389)     (1): SELU()
    (pid=4389)   )
    (pid=4389)   (model_tail): Sequential(
    (pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4389)   )
    (pid=4389)   (loss_fn): MSELoss()
    (pid=4389) )
    (pid=4389) - net_names = ['net']
    (pid=4389) - optim = Adam (
    (pid=4389) Parameter Group 0
    (pid=4389)     amsgrad: False
    (pid=4389)     betas: (0.9, 0.999)
    (pid=4389)     eps: 1e-08
    (pid=4389)     lr: 0.002
    (pid=4389)     weight_decay: 0
    (pid=4389) )
    (pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10ba11d0>
    (pid=4389) - global_net = None
    (pid=4389) [2020-01-30 11:38:58,347 PID:4453 INFO __init__.py __init__][2020-01-30 11:38:58,347 PID:4450 INFO __init__.py __init__] Agent:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4389)                'action_policy': 'default',
    (pid=4389)                'center_return': False,
    (pid=4389)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                                      'end_val': 0.001,
    (pid=4389)                                      'name': 'linear_decay',
    (pid=4389)                                      'start_step': 0,
    (pid=4389)                                      'start_val': 0.01},
    (pid=4389)                'explore_var_spec': None,
    (pid=4389)                'gamma': 0.99,
    (pid=4389)                'name': 'Reinforce',
    (pid=4389)                'training_frequency': 1},
    (pid=4389)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4388)                        'end_val': 0.001,
    (pid=4388)                        'name': 'linear_decay',
    (pid=4388)                        'start_step': 0,
    (pid=4388)                        'start_val': 0.01},
    (pid=4388)  'explore_var_spec': None,
    (pid=4388)  'gamma': 0.99,
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'training_frequency': 1}
    (pid=4388) - name = Reinforce
    (pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4388) - net_spec = {'clip_grad_val': None,
    (pid=4388)  'hid_layers': [64],
    (pid=4388)  'hid_layers_activation': 'selu',
    (pid=4388)  'loss_spec': {'name': 'MSELoss'},
    (pid=4388)  'lr_scheduler_spec': None,
    (pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)  'type': 'MLPNet'}
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e098da0>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e098e48>"
    (pid=4388) }
    (pid=4388) - action_pdtype = default
    (pid=4388) - action_policy = <function default at 0x7fce304ad620>
    (pid=4388) - center_return = True
    (pid=4388) - explore_var_spec = None
    (pid=4388) - entropy_coef_spec = {'end_step': 20000,
    (pid=4388)  'end_val': 0.001,
    (pid=4388)  'name': 'linear_decay',
    (pid=4388)  'start_step': 0,
    (pid=4388)  'start_val': 0.01}
    (pid=4388) - policy_loss_coef = 1.0
    (pid=4388) - gamma = 0.99
    (pid=4388) - training_frequency = 1
    (pid=4388) - to_train = 0
    (pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e098e10>
    (pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e098f28>
    (pid=4388) - net = MLPNet(
    (pid=4388)   (model): Sequential(
    (pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4388)     (1): SELU()
    (pid=4388)   )
    (pid=4388)   (model_tail): Sequential(
    (pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4388)   )
    (pid=4388)   (loss_fn): MSELoss()
    (pid=4388) )
    (pid=4388) - net_names = ['net']
    (pid=4388) - optim = Adam (
    (pid=4388) Parameter Group 0
    (pid=4388)     amsgrad: False
    (pid=4388)     betas: (0.9, 0.999)
    (pid=4388)     eps: 1e-08
    (pid=4388)     lr: 0.002
    (pid=4388)     weight_decay: 0
    (pid=4388) )
    (pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e05b1d0>
    (pid=4388) - global_net = None
    (pid=4388) [2020-01-30 11:38:58,336 PID:4449 INFO __init__.py __init__] Agent:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4388)                'action_policy': 'default',
    (pid=4388)                'center_return': True,
    (pid=4388)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                                      'end_val': 0.001,
    (pid=4388)                                      'name': 'linear_decay',
    (pid=4388)                                      'start_step': 0,
    (pid=4388)                                      'start_val': 0.01},
    (pid=4388)                'explore_var_spec': None,
    (pid=4388)                'gamma': 0.99,
    (pid=4388)                'name': 'Reinforce',
    (pid=4388)                'training_frequency': 1},
    (pid=4388)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'net': {'clip_grad_val': None,
    (pid=4389)          'hid_layers': [64],
    (pid=4389)          'hid_layers_activation': 'selu',
    (pid=4389)          'loss_spec': {'name': 'MSELoss'},
    (pid=4389)          'lr_scheduler_spec': None,
    (pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)          'type': 'MLPNet'}}
    (pid=4389) - name = Reinforce
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdddd8>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc5828>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdde80>"
    (pid=4389) }
    (pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10bdde10>
    (pid=4389)  Agent:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4389)                'action_policy': 'default',
    (pid=4389)                'center_return': False,
    (pid=4389)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                                      'end_val': 0.001,
    (pid=4389)                                      'name': 'linear_decay',
    (pid=4389)                                      'start_step': 0,
    (pid=4389)                                      'start_val': 0.01},
    (pid=4389)                'explore_var_spec': None,
    (pid=4389)                'gamma': 0.99,
    (pid=4389)                'name': 'Reinforce',
    (pid=4389)                'training_frequency': 1},
    (pid=4389)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'net': {'clip_grad_val': None,
    (pid=4389)          'hid_layers': [64],
    (pid=4389)          'hid_layers_activation': 'selu',
    (pid=4389)          'loss_spec': {'name': 'MSELoss'},
    (pid=4389)          'lr_scheduler_spec': None,
    (pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)          'type': 'MLPNet'}}
    (pid=4389) - name = Reinforce
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddcc0>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bcb710>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bddd68>"
    (pid=4389) }
    (pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10bddcf8>
    (pid=4389) [2020-01-30 11:38:58,347 PID:4458 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search[2020-01-30 11:38:58,347 PID:4450 INFO logger.py info][2020-01-30 11:38:58,347 PID:4453 INFO logger.py info]
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'net': {'clip_grad_val': None,
    (pid=4388)          'hid_layers': [64],
    (pid=4388)          'hid_layers_activation': 'selu',
    (pid=4388)          'loss_spec': {'name': 'MSELoss'},
    (pid=4388)          'lr_scheduler_spec': None,
    (pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)          'type': 'MLPNet'}}
    (pid=4388) - name = Reinforce
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e097f60>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e097fd0>"
    (pid=4388) }
    (pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e097f98>
    (pid=4388) [2020-01-30 11:38:58,337 PID:4449 INFO logger.py info] Session:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - index = 2
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e097f60>
    (pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>
    (pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>
    (pid=4388) [2020-01-30 11:38:58,337 PID:4449 INFO logger.py info] Running RL loop for trial 0 session 2
    (pid=4388) [2020-01-30 11:38:58,337 PID:4445 INFO __init__.py __init__] Agent:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4388)                'action_policy': 'default',
    (pid=4388)                'center_return': True,
    (pid=4388)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                                      'end_val': 0.001,
    (pid=4388)                                      'name': 'linear_decay',
    (pid=4388)                                      'start_step': 0,
    (pid=4388)                                      'start_val': 0.01},
    (pid=4388)                'explore_var_spec': None,
    (pid=4388)                'gamma': 0.99,
    (pid=4388)                'name': 'Reinforce',
    (pid=4388)                'training_frequency': 1},
    (pid=4388)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'net': {'clip_grad_val': None,
    (pid=4388)          'hid_layers': [64],
    (pid=4388)          'hid_layers_activation': 'selu',
    (pid=4388)          'loss_spec': {'name': 'MSELoss'},
    (pid=4388)          'lr_scheduler_spec': None,
    (pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)          'type': 'MLPNet'}}
    (pid=4388) - name = Reinforce
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e098da0>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4389)  Session:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - index = 0
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddcc0>
    (pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0>
    (pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0> Session:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - index = 1
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdddd8>
    (pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>
    (pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>
    (pid=4389) 
    (pid=4389) [2020-01-30 11:38:58,347 PID:4450 INFO logger.py info] Running RL loop for trial 1 session 0[2020-01-30 11:38:58,347 PID:4453 INFO logger.py info]
    (pid=4389)  Running RL loop for trial 1 session 1
    (pid=4389) [2020-01-30 11:38:58,348 PID:4456 INFO base.py __init__] Reinforce:
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdcf28>
    (pid=4389) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4389)  'action_policy': 'default',
    (pid=4389)  'center_return': False,
    (pid=4389)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                        'end_val': 0.001,
    (pid=4389)                        'name': 'linear_decay',
    (pid=4389)                        'start_step': 0,
    (pid=4389)                        'start_val': 0.01},
    (pid=4389)  'explore_var_spec': None,
    (pid=4389)  'gamma': 0.99,
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'training_frequency': 1}
    (pid=4389) - name = Reinforce
    (pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4389) - net_spec = {'clip_grad_val': None,
    (pid=4389)  'hid_layers': [64],
    (pid=4389)  'hid_layers_activation': 'selu',
    (pid=4389)  'loss_spec': {'name': 'MSELoss'},
    (pid=4389)  'lr_scheduler_spec': None,
    (pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)  'type': 'MLPNet'}
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdcf28>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc7940>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdcfd0>"
    (pid=4389) }
    (pid=4389) - action_pdtype = default
    (pid=4389) - action_policy = <function default at 0x7fcc21560620>
    (pid=4389) - center_return = False
    (pid=4389) - explore_var_spec = None
    (pid=4389) - entropy_coef_spec = {'end_step': 20000,
    (pid=4389)  'end_val': 0.001,
    (pid=4389)  'name': 'linear_decay',
    (pid=4389)  'start_step': 0,
    (pid=4389)  'start_val': 0.01}
    (pid=4389) - policy_loss_coef = 1.0
    (pid=4389) - gamma = 0.99
    (pid=4389) - training_frequency = 1
    (pid=4389) - to_train = 0
    (pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdcf98>
    (pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdcc50>
    (pid=4389) - net = MLPNet(
    (pid=4389)   (model): Sequential(
    (pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4389)     (1): SELU()
    (pid=4389)   )
    (pid=4389)   (model_tail): Sequential(
    (pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4389)   )
    (pid=4389)   (loss_fn): MSELoss()
    (pid=4389) )
    (pid=4389) - net_names = ['net']
    (pid=4389) - optim = Adam (
    (pid=4389) Parameter Group 0
    (pid=4389)     amsgrad: False
    (pid=4389)     betas: (0.9, 0.999)
    (pid=4389)     eps: 1e-08
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e098e48>"
    (pid=4388) }
    (pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e098dd8>
    (pid=4388) [2020-01-30 11:38:58,338 PID:4445 INFO logger.py info] Session:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - index = 1
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e098da0>
    (pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>
    (pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>
    (pid=4388) [2020-01-30 11:38:58,338 PID:4445 INFO logger.py info] Running RL loop for trial 0 session 1
    (pid=4388) [2020-01-30 11:38:58,340 PID:4449 INFO __init__.py log_summary] Trial 0 session 2 reinforce_baseline_cartpole_t0_s2 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4388) [2020-01-30 11:38:58,340 PID:4452 INFO base.py __init__] Reinforce:
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e082a58>
    (pid=4388) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4388)  'action_policy': 'default',
    (pid=4388)  'center_return': True,
    (pid=4388)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                        'end_val': 0.001,
    (pid=4388)                        'name': 'linear_decay',
    (pid=4388)                        'start_step': 0,
    (pid=4388)                        'start_val': 0.01},
    (pid=4388)  'explore_var_spec': None,
    (pid=4388)  'gamma': 0.99,
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'training_frequency': 1}
    (pid=4388) - name = Reinforce
    (pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4388) - net_spec = {'clip_grad_val': None,
    (pid=4388)  'hid_layers': [64],
    (pid=4388)  'hid_layers_activation': 'selu',
    (pid=4388)  'loss_spec': {'name': 'MSELoss'},
    (pid=4388)  'lr_scheduler_spec': None,
    (pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)  'type': 'MLPNet'}
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e082a58>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e0540b8>"
    (pid=4388) }
    (pid=4388) - action_pdtype = default
    (pid=4388) - action_policy = <function default at 0x7fce304ad620>
    (pid=4388) - center_return = True
    (pid=4388) - explore_var_spec = None
    (pid=4388) - entropy_coef_spec = {'end_step': 20000,
    (pid=4388)  'end_val': 0.001,
    (pid=4388)  'name': 'linear_decay',
    (pid=4388)  'start_step': 0,
    (pid=4388)  'start_val': 0.01}
    (pid=4388) - policy_loss_coef = 1.0
    (pid=4388) - gamma = 0.99
    (pid=4388) - training_frequency = 1
    (pid=4388) - to_train = 0
    (pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e054080>
    (pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e054160>
    (pid=4388) - net = MLPNet(
    (pid=4388)   (model): Sequential(
    (pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4388)     (1): SELU()
    (pid=4388)   )
    (pid=4388)   (model_tail): Sequential(
    (pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4388)   )
    (pid=4388)   (loss_fn): MSELoss()
    (pid=4388) )
    (pid=4388) - net_names = ['net']
    (pid=4388) - optim = Adam (
    (pid=4388) Parameter Group 0
    (pid=4388)     amsgrad: False
    (pid=4388)     betas: (0.9, 0.999)
    (pid=4388)     eps: 1e-08
    (pid=4389)     lr: 0.002
    (pid=4389)     weight_decay: 0
    (pid=4389) )
    (pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10b9a2e8>
    (pid=4389) - global_net = None
    (pid=4389) [2020-01-30 11:38:58,350 PID:4456 INFO __init__.py __init__] Agent:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4389)                'action_policy': 'default',
    (pid=4389)                'center_return': False,
    (pid=4389)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                                      'end_val': 0.001,
    (pid=4389)                                      'name': 'linear_decay',
    (pid=4389)                                      'start_step': 0,
    (pid=4389)                                      'start_val': 0.01},
    (pid=4389)                'explore_var_spec': None,
    (pid=4389)                'gamma': 0.99,
    (pid=4389)                'name': 'Reinforce',
    (pid=4389)                'training_frequency': 1},
    (pid=4389)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'net': {'clip_grad_val': None,
    (pid=4389)          'hid_layers': [64],
    (pid=4389)          'hid_layers_activation': 'selu',
    (pid=4389)          'loss_spec': {'name': 'MSELoss'},
    (pid=4389)          'lr_scheduler_spec': None,
    (pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)          'type': 'MLPNet'}}
    (pid=4389) - name = Reinforce
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdcf28>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc7940>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdcfd0>"
    (pid=4389) }
    (pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10bdcf60>
    (pid=4389) [2020-01-30 11:38:58,351 PID:4456 INFO logger.py info] Session:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - index = 2
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdcf28>
    (pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>
    (pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>
    (pid=4389) [2020-01-30 11:38:58,351 PID:4456 INFO logger.py info] Running RL loop for trial 1 session 2
    (pid=4389) [2020-01-30 11:38:58,351 PID:4450 INFO __init__.py log_summary] Trial 1 session 0 reinforce_baseline_cartpole_t1_s0 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4389) [2020-01-30 11:38:58,351 PID:4453 INFO __init__.py log_summary] Trial 1 session 1 reinforce_baseline_cartpole_t1_s1 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4389) [2020-01-30 11:38:58,352 PID:4458 INFO base.py __init__] Reinforce:
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddd68>
    (pid=4389) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4389)  'action_policy': 'default',
    (pid=4389)  'center_return': False,
    (pid=4389)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                        'end_val': 0.001,
    (pid=4389)                        'name': 'linear_decay',
    (pid=4389)                        'start_step': 0,
    (pid=4389)                        'start_val': 0.01},
    (pid=4389)  'explore_var_spec': None,
    (pid=4389)  'gamma': 0.99,
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'training_frequency': 1}
    (pid=4389) - name = Reinforce
    (pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4389) - net_spec = {'clip_grad_val': None,
    (pid=4389)  'hid_layers': [64],
    (pid=4389)  'hid_layers_activation': 'selu',
    (pid=4389)  'loss_spec': {'name': 'MSELoss'},
    (pid=4389)  'lr_scheduler_spec': None,
    (pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)  'type': 'MLPNet'}
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddd68>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4388)     lr: 0.002
    (pid=4388)     weight_decay: 0
    (pid=4388) )
    (pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e054400>
    (pid=4388) - global_net = None
    (pid=4388) [2020-01-30 11:38:58,342 PID:4445 INFO __init__.py log_summary] Trial 0 session 1 reinforce_baseline_cartpole_t0_s1 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4388) [2020-01-30 11:38:58,342 PID:4452 INFO __init__.py __init__] Agent:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4388)                'action_policy': 'default',
    (pid=4388)                'center_return': True,
    (pid=4388)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                                      'end_val': 0.001,
    (pid=4388)                                      'name': 'linear_decay',
    (pid=4388)                                      'start_step': 0,
    (pid=4388)                                      'start_val': 0.01},
    (pid=4388)                'explore_var_spec': None,
    (pid=4388)                'gamma': 0.99,
    (pid=4388)                'name': 'Reinforce',
    (pid=4388)                'training_frequency': 1},
    (pid=4388)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'net': {'clip_grad_val': None,
    (pid=4388)          'hid_layers': [64],
    (pid=4388)          'hid_layers_activation': 'selu',
    (pid=4388)          'loss_spec': {'name': 'MSELoss'},
    (pid=4388)          'lr_scheduler_spec': None,
    (pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)          'type': 'MLPNet'}}
    (pid=4388) - name = Reinforce
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e082a58>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e0540b8>"
    (pid=4388) }
    (pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e054048>
    (pid=4388) [2020-01-30 11:38:58,342 PID:4452 INFO logger.py info] Session:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - index = 3
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e082a58>
    (pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>
    (pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>
    (pid=4388) [2020-01-30 11:38:58,342 PID:4452 INFO logger.py info] Running RL loop for trial 0 session 3
    (pid=4388) [2020-01-30 11:38:58,343 PID:4440 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
    (pid=4388) [2020-01-30 11:38:58,346 PID:4452 INFO __init__.py log_summary] Trial 0 session 3 reinforce_baseline_cartpole_t0_s3 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4388) [2020-01-30 11:38:58,348 PID:4440 INFO base.py __init__] Reinforce:
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e09ac88>
    (pid=4388) - algorithm_spec = {'action_pdtype': 'default',
    (pid=4388)  'action_policy': 'default',
    (pid=4388)  'center_return': True,
    (pid=4388)  'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                        'end_val': 0.001,
    (pid=4388)                        'name': 'linear_decay',
    (pid=4388)                        'start_step': 0,
    (pid=4388)                        'start_val': 0.01},
    (pid=4388)  'explore_var_spec': None,
    (pid=4388)  'gamma': 0.99,
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'training_frequency': 1}
    (pid=4388) - name = Reinforce
    (pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
    (pid=4388) - net_spec = {'clip_grad_val': None,
    (pid=4388)  'hid_layers': [64],
    (pid=4388)  'hid_layers_activation': 'selu',
    (pid=4388)  'loss_spec': {'name': 'MSELoss'},
    (pid=4388)  'lr_scheduler_spec': None,
    (pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)  'type': 'MLPNet'}
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e09ac88>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388) terminate called after throwing an instance of 'c10::Error'
    (pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4388) 
    (pid=4388) Fatal Python error: Aborted
    (pid=4388) 
    (pid=4388) Stack (most recent call first):
    (pid=4388) terminate called after throwing an instance of 'c10::Error'
    (pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4388) 
    (pid=4388) Fatal Python error: Aborted
    (pid=4388) 
    (pid=4388) Stack (most recent call first):
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc6a58>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10b9a0b8>"
    (pid=4389) }
    (pid=4389) - action_pdtype = default
    (pid=4389) - action_policy = <function default at 0x7fcc21560620>
    (pid=4389) - center_return = False
    (pid=4389) - explore_var_spec = None
    (pid=4389) - entropy_coef_spec = {'end_step': 20000,
    (pid=4389)  'end_val': 0.001,
    (pid=4389)  'name': 'linear_decay',
    (pid=4389)  'start_step': 0,
    (pid=4389)  'start_val': 0.01}
    (pid=4389) - policy_loss_coef = 1.0
    (pid=4389) - gamma = 0.99
    (pid=4389) - training_frequency = 1
    (pid=4389) - to_train = 0
    (pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10b9a080>
    (pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10b9a160>
    (pid=4389) - net = MLPNet(
    (pid=4389)   (model): Sequential(
    (pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4389)     (1): SELU()
    (pid=4389)   )
    (pid=4389)   (model_tail): Sequential(
    (pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4389)   )
    (pid=4389)   (loss_fn): MSELoss()
    (pid=4389) )
    (pid=4389) - net_names = ['net']
    (pid=4389) - optim = Adam (
    (pid=4389) Parameter Group 0
    (pid=4389)     amsgrad: False
    (pid=4389)     betas: (0.9, 0.999)
    (pid=4389)     eps: 1e-08
    (pid=4389)     lr: 0.002
    (pid=4389)     weight_decay: 0
    (pid=4389) )
    (pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10b9a400>
    (pid=4389) - global_net = None
    (pid=4389) [2020-01-30 11:38:58,354 PID:4458 INFO __init__.py __init__] Agent:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4389)                'action_policy': 'default',
    (pid=4389)                'center_return': False,
    (pid=4389)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4389)                                      'end_val': 0.001,
    (pid=4389)                                      'name': 'linear_decay',
    (pid=4389)                                      'start_step': 0,
    (pid=4389)                                      'start_val': 0.01},
    (pid=4389)                'explore_var_spec': None,
    (pid=4389)                'gamma': 0.99,
    (pid=4389)                'name': 'Reinforce',
    (pid=4389)                'training_frequency': 1},
    (pid=4389)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4389)  'name': 'Reinforce',
    (pid=4389)  'net': {'clip_grad_val': None,
    (pid=4389)          'hid_layers': [64],
    (pid=4389)          'hid_layers_activation': 'selu',
    (pid=4389)          'loss_spec': {'name': 'MSELoss'},
    (pid=4389)          'lr_scheduler_spec': None,
    (pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4389)          'type': 'MLPNet'}}
    (pid=4389) - name = Reinforce
    (pid=4389) - body = body: {
    (pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddd68>",
    (pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>",
    (pid=4389)   "a": 0,
    (pid=4389)   "e": 0,
    (pid=4389)   "b": 0,
    (pid=4389)   "aeb": "(0, 0, 0)",
    (pid=4389)   "explore_var": NaN,
    (pid=4389)   "entropy_coef": 0.01,
    (pid=4389)   "loss": NaN,
    (pid=4389)   "mean_entropy": NaN,
    (pid=4389)   "mean_grad_norm": NaN,
    (pid=4389)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "total_reward_ma": NaN,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e09ad30>"
    (pid=4388) }
    (pid=4388) - action_pdtype = default
    (pid=4388) - action_policy = <function default at 0x7fce304ad620>
    (pid=4388) - center_return = True
    (pid=4388) - explore_var_spec = None
    (pid=4388) - entropy_coef_spec = {'end_step': 20000,
    (pid=4388)  'end_val': 0.001,
    (pid=4388)  'name': 'linear_decay',
    (pid=4388)  'start_step': 0,
    (pid=4388)  'start_val': 0.01}
    (pid=4388) - policy_loss_coef = 1.0
    (pid=4388) - gamma = 0.99
    (pid=4388) - training_frequency = 1
    (pid=4388) - to_train = 0
    (pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e09acf8>
    (pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e09ae10>
    (pid=4388) - net = MLPNet(
    (pid=4388)   (model): Sequential(
    (pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
    (pid=4388)     (1): SELU()
    (pid=4388)   )
    (pid=4388)   (model_tail): Sequential(
    (pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
    (pid=4388)   )
    (pid=4388)   (loss_fn): MSELoss()
    (pid=4388) )
    (pid=4388) - net_names = ['net']
    (pid=4388) - optim = Adam (
    (pid=4388) Parameter Group 0
    (pid=4388)     amsgrad: False
    (pid=4388)     betas: (0.9, 0.999)
    (pid=4388)     eps: 1e-08
    (pid=4388)     lr: 0.002
    (pid=4388)     weight_decay: 0
    (pid=4388) )
    (pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e05c0b8>
    (pid=4388) - global_net = None
    (pid=4388) [2020-01-30 11:38:58,350 PID:4440 INFO __init__.py __init__] Agent:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
    (pid=4388)                'action_policy': 'default',
    (pid=4388)                'center_return': True,
    (pid=4388)                'entropy_coef_spec': {'end_step': 20000,
    (pid=4388)                                      'end_val': 0.001,
    (pid=4388)                                      'name': 'linear_decay',
    (pid=4388)                                      'start_step': 0,
    (pid=4388)                                      'start_val': 0.01},
    (pid=4388)                'explore_var_spec': None,
    (pid=4388)                'gamma': 0.99,
    (pid=4388)                'name': 'Reinforce',
    (pid=4388)                'training_frequency': 1},
    (pid=4388)  'memory': {'name': 'OnPolicyReplay'},
    (pid=4388)  'name': 'Reinforce',
    (pid=4388)  'net': {'clip_grad_val': None,
    (pid=4388)          'hid_layers': [64],
    (pid=4388)          'hid_layers_activation': 'selu',
    (pid=4388)          'loss_spec': {'name': 'MSELoss'},
    (pid=4388)          'lr_scheduler_spec': None,
    (pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
    (pid=4388)          'type': 'MLPNet'}}
    (pid=4388) - name = Reinforce
    (pid=4388) - body = body: {
    (pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e09ac88>",
    (pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>",
    (pid=4388)   "a": 0,
    (pid=4388)   "e": 0,
    (pid=4388)   "b": 0,
    (pid=4388)   "aeb": "(0, 0, 0)",
    (pid=4388)   "explore_var": NaN,
    (pid=4388)   "entropy_coef": 0.01,
    (pid=4388)   "loss": NaN,
    (pid=4388)   "mean_entropy": NaN,
    (pid=4388)   "mean_grad_norm": NaN,
    (pid=4388)   "best_total_reward_ma": -Infinity,
    (pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc6a58>",
    (pid=4389)   "tb_actions": [],
    (pid=4389)   "tb_tracker": {},
    (pid=4389)   "observation_space": "Box(4,)",
    (pid=4389)   "action_space": "Discrete(2)",
    (pid=4389)   "observable_dim": {
    (pid=4389)     "state": 4
    (pid=4389)   },
    (pid=4389)   "state_dim": 4,
    (pid=4389)   "action_dim": 2,
    (pid=4389)   "is_discrete": true,
    (pid=4389)   "action_type": "discrete",
    (pid=4389)   "action_pdtype": "Categorical",
    (pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10b9a0b8>"
    (pid=4389) }
    (pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10b9a048>
    (pid=4389) [2020-01-30 11:38:58,354 PID:4458 INFO logger.py info] Session:
    (pid=4389) - spec = reinforce_baseline_cartpole
    (pid=4389) - index = 3
    (pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddd68>
    (pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>
    (pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>
    (pid=4389) [2020-01-30 11:38:58,354 PID:4458 INFO logger.py info] Running RL loop for trial 1 session 3
    (pid=4389) [2020-01-30 11:38:58,355 PID:4456 INFO __init__.py log_summary] Trial 1 session 2 reinforce_baseline_cartpole_t1_s2 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4389) [2020-01-30 11:38:58,358 PID:4458 INFO __init__.py log_summary] Trial 1 session 3 reinforce_baseline_cartpole_t1_s3 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4388)   "total_reward_ma": NaN,
    (pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
    (pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
    (pid=4388)   "tb_actions": [],
    (pid=4388)   "tb_tracker": {},
    (pid=4388)   "observation_space": "Box(4,)",
    (pid=4388)   "action_space": "Discrete(2)",
    (pid=4388)   "observable_dim": {
    (pid=4388)     "state": 4
    (pid=4388)   },
    (pid=4388)   "state_dim": 4,
    (pid=4388)   "action_dim": 2,
    (pid=4388)   "is_discrete": true,
    (pid=4388)   "action_type": "discrete",
    (pid=4388)   "action_pdtype": "Categorical",
    (pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
    (pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e09ad30>"
    (pid=4388) }
    (pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e09acc0>
    (pid=4388) [2020-01-30 11:38:58,350 PID:4440 INFO logger.py info] Session:
    (pid=4388) - spec = reinforce_baseline_cartpole
    (pid=4388) - index = 0
    (pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e09ac88>
    (pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>
    (pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>
    (pid=4388) [2020-01-30 11:38:58,350 PID:4440 INFO logger.py info] Running RL loop for trial 0 session 0
    (pid=4388) [2020-01-30 11:38:58,354 PID:4440 INFO __init__.py log_summary] Trial 0 session 0 reinforce_baseline_cartpole_t0_s0 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
    (pid=4388) terminate called after throwing an instance of 'c10::Error'
    (pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4388) 
    (pid=4388) Fatal Python error: Aborted
    (pid=4388) 
    (pid=4388) Stack (most recent call first):
    (pid=4389) terminate called after throwing an instance of 'c10::Error'
    (pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4389) 
    (pid=4389) Fatal Python error: Aborted
    (pid=4389) 
    (pid=4389) Stack (most recent call first):
    (pid=4389) terminate called after throwing an instance of 'c10::Error'
    (pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4389) 
    (pid=4389) Fatal Python error: Aborted
    (pid=4389) 
    (pid=4389) Stack (most recent call first):
    (pid=4389) terminate called after throwing an instance of 'c10::Error'
    (pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4389) 
    (pid=4389) Fatal Python error: Aborted
    (pid=4389) 
    (pid=4389) Stack (most recent call first):
    (pid=4389) terminate called after throwing an instance of 'c10::Error'
    (pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
    (pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
    (pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
    (pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
    (pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    (pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
    (pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
    (pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
    (pid=4389) 
    (pid=4389) Fatal Python error: Aborted
    (pid=4389) 
    (pid=4389) Stack (most recent call first):
    (pid=4388) 2020-01-30 11:38:58,550	ERROR function_runner.py:96 -- Runner Thread raised error.
    (pid=4388) Traceback (most recent call last):
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
    (pid=4388)     self._entrypoint()
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
    (pid=4388)     return self._trainable_func(config, self._status_reporter)
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
    (pid=4388)     output = train_func(config, reporter)
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
    (pid=4388)     metrics = Trial(spec).run()
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
    (pid=4388)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
    (pid=4388)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
    (pid=4388)     frames = session_metrics_list[0]['local']['frames']
    (pid=4388) IndexError: list index out of range
    (pid=4388) Exception in thread Thread-1:
    (pid=4388) Traceback (most recent call last):
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
    (pid=4388)     self._entrypoint()
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
    (pid=4388)     return self._trainable_func(config, self._status_reporter)
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
    (pid=4388)     output = train_func(config, reporter)
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
    (pid=4388)     metrics = Trial(spec).run()
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
    (pid=4388)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
    (pid=4388)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
    (pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
    (pid=4388)     frames = session_metrics_list[0]['local']['frames']
    (pid=4388) IndexError: list index out of range
    (pid=4388) 
    (pid=4388) During handling of the above exception, another exception occurred:
    (pid=4388) 
    (pid=4388) Traceback (most recent call last):
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    (pid=4388)     self.run()
    (pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 102, in run
    (pid=4388)     err_tb = err_tb.format_exc()
    (pid=4388) AttributeError: 'traceback' object has no attribute 'format_exc'
    (pid=4388) 
    (pid=4389) 2020-01-30 11:38:58,570	ERROR function_runner.py:96 -- Runner Thread raised error.
    (pid=4389) Traceback (most recent call last):
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
    (pid=4389)     self._entrypoint()
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
    (pid=4389)     return self._trainable_func(config, self._status_reporter)
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
    (pid=4389)     output = train_func(config, reporter)
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
    (pid=4389)     metrics = Trial(spec).run()
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
    (pid=4389)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
    (pid=4389)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
    (pid=4389)     frames = session_metrics_list[0]['local']['frames']
    (pid=4389) IndexError: list index out of range
    (pid=4389) Exception in thread Thread-1:
    (pid=4389) Traceback (most recent call last):
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
    (pid=4389)     self._entrypoint()
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
    (pid=4389)     return self._trainable_func(config, self._status_reporter)
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
    (pid=4389)     output = train_func(config, reporter)
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
    (pid=4389)     metrics = Trial(spec).run()
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
    (pid=4389)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
    (pid=4389)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
    (pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
    (pid=4389)     frames = session_metrics_list[0]['local']['frames']
    (pid=4389) IndexError: list index out of range
    (pid=4389) 
    (pid=4389) During handling of the above exception, another exception occurred:
    (pid=4389) 
    (pid=4389) Traceback (most recent call last):
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    (pid=4389)     self.run()
    (pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 102, in run
    (pid=4389)     err_tb = err_tb.format_exc()
    (pid=4389) AttributeError: 'traceback' object has no attribute 'format_exc'
    (pid=4389) 
    2020-01-30 11:38:59,690	ERROR trial_runner.py:497 -- Error processing event.
    Traceback (most recent call last):
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 446, in _process_trial
        result = self.trial_executor.fetch_result(trial)
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 316, in fetch_result
        result = ray.get(trial_future[0])
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/worker.py", line 2197, in get
        raise value
    ray.exceptions.RayTaskError: ray_worker (pid=4388, host=Gauss)
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trainable.py", line 151, in train
        result = self._train()
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 203, in _train
        ("Wrapped function ran until completion without reporting "
    ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.
    
    2020-01-30 11:38:59,694	INFO ray_trial_executor.py:180 -- Destroying actor for trial ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
    2020-01-30 11:38:59,705	ERROR trial_runner.py:497 -- Error processing event.
    Traceback (most recent call last):
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 446, in _process_trial
        result = self.trial_executor.fetch_result(trial)
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 316, in fetch_result
        result = ray.get(trial_future[0])
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/worker.py", line 2197, in get
        raise value
    ray.exceptions.RayTaskError: ray_worker (pid=4389, host=Gauss)
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trainable.py", line 151, in train
        result = self._train()
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 203, in _train
        ("Wrapped function ran until completion without reporting "
    ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.
    
    2020-01-30 11:38:59,707	INFO ray_trial_executor.py:180 -- Destroying actor for trial ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
    == Status ==
    Using FIFO scheduling algorithm.
    Resources requested: 0/8 CPUs, 0/1 GPUs
    Memory usage on this node: 2.5/16.7 GB
    Result logdir: /home/joe/ray_results/reinforce_baseline_cartpole
    Number of trials: 2 ({'ERROR': 2})
    ERROR trials:
     - ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0:	ERROR, 1 failures: /home/joe/ray_results/reinforce_baseline_cartpole/ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0_2020-01-30_11-38-57n2qc80ke/error_2020-01-30_11-38-59.txt
     - ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1:	ERROR, 1 failures: /home/joe/ray_results/reinforce_baseline_cartpole/ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1_2020-01-30_11-38-57unqmlqvg/error_2020-01-30_11-38-59.txt
    
    Traceback (most recent call last):
      File "run_lab.py", line 80, in <module>
        main()
      File "run_lab.py", line 72, in main
        read_spec_and_run(*args)
      File "run_lab.py", line 56, in read_spec_and_run
        run_spec(spec, lab_mode)
      File "run_lab.py", line 35, in run_spec
        Experiment(spec).run()
      File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 203, in run
        trial_data_dict = search.run_ray_search(self.spec)
      File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 124, in run_ray_search
        server_port=util.get_port(),
      File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/tune.py", line 265, in run
        raise TuneError("Trials did not complete", errored_trials)
    ray.tune.error.TuneError: ('Trials did not complete', [ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0, ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1])
    
    dependency 
    opened by xombio 16
  • ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command

    ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command

    Describe the bug After successfully installing SLM-Lab and proceeding to the "Quick Start" portion which involves running DQN on the CartPole environment, everything works well i.e. (final_return_ma increases).

    Command entered: python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev

    After several log summary and metric instances an OpenGL error code occurs :

    [101017:1015/191313.594764:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command

    and then the process seems to end without showing any graphs.

    To Reproduce

    1. OS and environment: Ubuntu 20.04 LTS

    2. SLM Lab git SHA (run git rev-parse HEAD to get it):dda02d00031553aeda4c49c5baa7d0706c53996b

    3. spec file used: slm_lab/spec/demo.json

    Error logs

    [2020-10-15 19:13:09,800 PID:100781 INFO __init__.py log_summary] Trial 0 session 0 dqn_cartpole_t0_s0 [train_df] epi: 123  t: 120  wall_t: 153  opt_step: 398720  frame: 10000  fps: 65.3595  total_reward: 200  total_reward_ma: 142.7  loss: 5.46846  lr: 0.00774841  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: 0.230459
    [2020-10-15 19:13:09,821 PID:100781 INFO __init__.py log_metrics] Trial 0 session 0 dqn_cartpole_t0_s0 [train_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.00019783  training_efficiency: 5.02079e-06  stability: 0.926742
    [100946:1015/191310.923076:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
    [2020-10-15 19:13:12,794 PID:100781 INFO __init__.py log_metrics] Trial 0 session 0 dqn_cartpole_t0_s0 [eval_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.00019783  training_efficiency: 5.02079e-06  stability: 0.926742
    [2020-10-15 19:13:12,798 PID:100781 INFO logger.py info] Session 0 done
    [101017:1015/191313.594764:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
    [2020-10-15 19:13:15,443 PID:100781 INFO logger.py info] Trial 0 done
    
    
    
    
    opened by Nick-Kou 11
  • Error at end the execution

    Error at end the execution

    Hi, I get stuck at the end of the trial, when it finish, can't create the respective graphics, i got the next traceback error, what can it be?

    Traceback (most recent call last): File "run_lab.py", line 63, in main() File "run_lab.py", line 59, in main run_by_mode(spec_file, spec_name, lab_mode) File "run_lab.py", line 38, in run_by_mode Trial(spec).run() File "/home/kelo/librerias/SLM-Lab/slm_lab/experiment/control.py", line 122, in run session_datas = util.parallelize_fn(self.init_session_and_run, info_spaces, num_cpus) File "/home/kelo/librerias/SLM-Lab/slm_lab/lib/util.py", line 533, in parallelize_fn results = pool.map(fn, args) File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value ValueError: Invalid property specified for object of type plotly.graph_objs.Layout: 'yaxis2'

    opened by angel-ayala 10
  • Arch Install

    Arch Install

    Hi, i'm having trouble in the installation because the linux distro, can you indicate the packages required for a correct installation for run the "yarn install" command.

    It's look a great framework and i'll like to test it, thanks and regards.

    opened by angel-ayala 8
  • How to add a non-gym envrinment?

    How to add a non-gym envrinment?

    Hi, kengz, How to add a non-gym environment, such as Mahjong or Poker enviroment in rlcard project(https://github.com/datamllab/rlcard). Would you provide a simple demo for adding a new non-gym env, or give some suggestions about how to quickly add ?

    opened by Jzhou0 6
  • why i get

    why i get "terminating" ?

    HI!

    I get terminating when i trainning with search mode and connect to env by grpc ,the log like this: "(pid=2023) terminating" and has nothing else logs about this "terminating", my process also killed by it at the same time. why i get that? @kengz @lgraesser

    opened by lidongke 6
  • missing module cv2

    missing module cv2

    /SLM-Lab/slm_lab/lib/util.py", line 5, in import cv2 ModuleNotFoundError: No module named 'cv2'

    To Reproduce

    1. OS used: Ubuntu 18 LTS
    2. SLM-Lab git: git cloned
    3. demo.json not working

    Additional context had to add cmake libgcc manually

    Error logs (base) l*@l*-HP-Pavilion-dv7-PC:~/SLM-Lab$ python3 run_lab.py slm_lab/spec/demo.json dqn_cartpole dev Traceback (most recent call last): File "run_lab.py", line 10, in from slm_lab.experiment import analysis, retro_analysis File "/home/l*/SLM-Lab/slm_lab/experiment/analysis.py", line 5, in from slm_lab.agent import AGENT_DATA_NAMES File "/home/lr/SLM-Lab/slm_lab/agent/init.py", line 21, in from slm_lab.agent import algorithm, memory File "/home/l/SLM-Lab/slm_lab/agent/algorithm/init.py", line 8, in from .actor_critic import * File "/home/l*/SLM-Lab/slm_lab/agent/algorithm/actor_critic.py", line 1, in from slm_lab.agent import net File "/home/l*/SLM-Lab/slm_lab/agent/net/init.py", line 6, in from slm_lab.agent.net.conv import * File "/home/l*/SLM-Lab/slm_lab/agent/net/conv.py", line 1, in from slm_lab.agent.net import net_util File "/home/l*/SLM-Lab/slm_lab/agent/net/net_util.py", line 3, in from slm_lab.lib import logger, util File "/home/lr/SLM-Lab/slm_lab/lib/logger.py", line 1, in from slm_lab.lib import util File "/home/l/SLM-Lab/slm_lab/lib/util.py", line 5, in import cv2 ModuleNotFoundError: No module named 'cv2'

    opened by LodeVanB 6
  • Undefined names

    Undefined names

    Undefined names have the potential to raise NameError at runtime.

    flake8 testing of https://github.com/kengz/SLM-Lab on Python 3.6.3

    $ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

    ./slm_lab/agent/algorithm/base.py:73:16: F821 undefined name 'action'
            return action
                   ^
    ./slm_lab/agent/algorithm/base.py:99:16: F821 undefined name 'batch'
            return batch
                   ^
    ./slm_lab/agent/algorithm/policy_util.py:43:13: F821 undefined name 'new_prob'
                new_prob[torch.argmax(probs, dim=0)] = 1.0
                ^
    ./slm_lab/env/__init__.py:97:49: F821 undefined name 'nvec'
            setattr(gym_space, 'low', np.zeros_like(nvec))
                                                    ^
    ./slm_lab/experiment/search.py:131:9: F821 undefined name 'config'
            config['trial_index'] = self.experiment.info_space.tick('trial')['trial']
            ^
    ./slm_lab/experiment/search.py:133:16: F821 undefined name 'config'
            return config
                   ^
    ./slm_lab/experiment/search.py:146:16: F821 undefined name 'trial_data_dict'
            return trial_data_dict
                   ^
    ./test/agent/net/test_nn.py:83:25: F821 undefined name 'net_util'
            before_params = net_util.copy_trainable_params(net)
                            ^
    ./test/agent/net/test_nn.py:88:24: F821 undefined name 'net_util'
            after_params = net_util.copy_trainable_params(net)
                           ^
    ./test/agent/net/test_nn.py:114:25: F821 undefined name 'net_util'
            before_params = net_util.copy_fixed_params(net)
                            ^
    ./test/agent/net/test_nn.py:118:24: F821 undefined name 'net_util'
            after_params = net_util.copy_fixed_params(net)
                           ^
    11    F821 undefined name 'action'
    11
    
    opened by cclauss 6
  • docker gotchas

    docker gotchas

    Hi. I tried running this through Docker, and ran into a few gotchas following the gitbook instructions:

    • the files in bin somehow gave me permission errors, despite being root. pasting these manually helped as a work-around.
    • the setup script used sudo a lot, but the docker container did not recognize this. removing these helped. fwiw, installing sudo helped as well.
    • source activate lab errored stating source was not recognized. I then tried:
    # conda config --add channels anaconda
    # conda activate lab
    # conda env update
    (lab) # python3 --version
    Python 3.6.4
    (lab) # yarn start
    $ python3 run_lab.py
    Traceback (most recent call last):
      File "run_lab.py", line 6, in <module>
        from slm_lab.experiment.control import Session, Trial, Experiment
      File "/opt/SLM-Lab/slm_lab/__init__.py", line 12
        with open(os.path.join(ROOT_DIR, 'config', f'{config_name}.json')) as f:
                                                                       ^
    SyntaxError: invalid syntax
    error Command failed with exit code 1
    

    Trying this line in this python3 seemed not to yield syntax errors though, so f-strings do seem supported. Weird.

    I haven't fully gotten this to work, but hopefully some of this may be useful for the tutorial. I tried looking for the gitbook source in case I could add to the installation instructions based on this, but couldn't find it.

    opened by KiaraGrouwstra 6
  • Potential Memory Leak

    Potential Memory Leak

    Hello,

    I am currently using SLM lab as the learning component of my custom Unity environments. I am using a modified UnityEnv wrapper and I run my experiments using a modified version of the starter code here.

    When I am running both PPO and SAC I realized that my Unix kernel kills the job after a while due running out of memory (RAM/Swap).

    Given the custom nature of this bug, I don't expect you to replicate it, but rather, asking if you had ever faced a similar problem on your end.

    Some more detail:

    1. Initially, I assumed it was due to the size of the replay buffer. But even after the replay buffer was capped up a small number (1000) and got maxed out the problem persisted.
    2. The memory increase is roughly on the order of 1mb/s which is relatively high.
    3. I managed to trace it to the "train step" in SAC. Can't trace if memory is created there, but when the training steps aren't taken, there is no problem.
    4. I tested with the default Unity envs to ensure I didn't cause the problem with my custom env--this doesn't seem to be the cause.
    5. We will be testing with the provided Cartpole env to see if the problem persists.

    Any guidance or tips would be appreciated! And once again thank you for the great library!

    question 
    opened by batu 5
  • Fail to save graphs

    Fail to save graphs

    I follow the book "Foundations of Deep Reinforcement Learning" to conduct the experiments of reinformace algorithm.Although the algorithm can be conducted successfully, its graphs fail to be saved successfully, with an error from orca "service unavaialble".

    1. OS and environment: Ubuntu 16.04
    2. spec file used: reinforce_cartpole.json

    Additional context Add any other context about the problem here.

    Error logs Failed to generate graph. Run retro-analysis to generate graphs later. The image request was rejected by the orca conversion utility with the following error: 503:

    503 Service Unavailable

    Service Unavailable

    The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

    dependency 
    opened by alessandroweiliu 5
  • Exception: pyglet 2.0.0 requires Python 3.8 or newer

    Exception: pyglet 2.0.0 requires Python 3.8 or newer

    Exception: pyglet 2.0.0 requires Python 3.8 or newer. After launch this command for the demo "python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev" i get this error "Exception: pyglet 2.0.0 requires Python 3.8 or newer." I.ve checked tha the python version in the lab conda environment is 3.7.3

    To Reproduce

    1. OS Linux Mint 21:
    2. python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev

    Error logs Traceback (most recent call last): File "run_lab.py", line 99, in main() File "run_lab.py", line 91, in main get_spec_and_run(*args) File "run_lab.py", line 75, in get_spec_and_run run_spec(spec, lab_mode) File "run_lab.py", line 58, in run_spec Trial(spec).run() File "/home/javi/Code/AI/RL/foundation/SLM-Lab/slm_lab/experiment/control.py", line 179, in run session_metrics_list = self.run_sessions() File "/home/javi/Code/AI/RL/foundation/SLM-Lab/slm_lab/experiment/control.py", line 157, in run_sessions session_metrics_list = [Session(spec).run()] File "/home/javi/Code/AI/RL/foundation/SLM-Lab/slm_lab/experiment/control.py", line 118, in run self.run_rl() File "/home/javi/Code/AI/RL/foundation/SLM-Lab/slm_lab/experiment/control.py", line 90, in run_rl state = self.env.reset() File "/home/javi/Code/AI/RL/foundation/SLM-Lab/slm_lab/env/openai.py", line 62, in reset self.u_env.render() File "/home/javi/anaconda3/envs/lab/lib/python3.7/site-packages/gym/core.py", line 249, in render return self.env.render(mode, **kwargs) File "/home/javi/anaconda3/envs/lab/lib/python3.7/site-packages/gym/core.py", line 249, in render return self.env.render(mode, **kwargs) File "/home/javi/anaconda3/envs/lab/lib/python3.7/site-packages/gym/envs/classic_control/cartpole.py", line 150, in render from gym.envs.classic_control import rendering File "/home/javi/anaconda3/envs/lab/lib/python3.7/site-packages/gym/envs/classic_control/rendering.py", line 17, in import pyglet File "/home/javi/anaconda3/envs/lab/lib/python3.7/site-packages/pyglet/init.py", line 54, in raise Exception(f"pyglet {version} requires Python {MIN_PYTHON_VERSION_STR} or newer.") Exception: pyglet 2.0.0 requires Python 3.8 or newer.

    opened by jefalcon 0
  • How to add a custom gym environment in json spec file.

    How to add a custom gym environment in json spec file.

    Hi, I am interested in creating my own environment on gym open ai and train and evaluate different slm-lab algorithms on it. Can you kindly guide me how can i add the custom created gym environment in the spec files. I am new to it so I will highly appreciate it you can explain it in laymen terms.

    opened by abdullahbm09 0
  • Docker build fails on environment.yml installation

    Docker build fails on environment.yml installation

    Describe the bug running docker build hits an error during the build process

    To Reproduce

    1. OS and environment: Windows 10, Docker for Windows
    2. SLM Lab git SHA (run git rev-parse HEAD to get it): 2890277c8d499dbc925a16bda40acd8c29cb6819
    3. spec file used: unknown

    Additional context this appears to be caused by a problem earlier in the Dockerfile, where the python-pyglet package is failing to install.

    Error logs

     > [7/9] RUN . ~/miniconda3/etc/profile.d/conda.sh &&     conda create -n lab python=3.7.3 -y &&     conda activate lab &&     conda env update -f environment.yml &&     conda clean -y --all &&     rm -rf ~/.cache/pip:
    #11 1.493 Collecting package metadata (current_repodata.json): ...working... done
    #11 5.422 Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
    #11 5.424 Collecting package metadata (repodata.json): ...working... done
    #11 15.34 Solving environment: ...working... done
    #11 15.85
    #11 15.85
    #11 15.85 ==> WARNING: A newer version of conda exists. <==
    #11 15.85   current version: 4.12.0
    #11 15.85   latest version: 4.14.0
    #11 15.85
    #11 15.85 Please update conda by running
    #11 15.85
    #11 15.85     $ conda update -n base -c defaults conda
    #11 15.85
    #11 15.85
    #11 15.93
    #11 15.93 ## Package Plan ##
    #11 15.93
    #11 15.93   environment location: /root/miniconda3/envs/lab
    #11 15.93
    #11 15.93   added / updated specs:
    #11 15.93     - python=3.7.3
    #11 15.93
    #11 15.93
    #11 15.93 The following packages will be downloaded:
    #11 15.93
    #11 15.93     package                    |            build
    #11 15.93     ---------------------------|-----------------
    #11 15.93     _openmp_mutex-5.1          |            1_gnu          21 KB
    #11 15.93     ca-certificates-2022.07.19 |       h06a4308_0         124 KB
    #11 15.93     certifi-2022.6.15          |   py37h06a4308_0         153 KB
    #11 15.93     libedit-3.1.20210910       |       h7f8727e_0         166 KB
    #11 15.93     libffi-3.2.1               |    hf484d3e_1007          48 KB
    #11 15.93     libgcc-ng-11.2.0           |       h1234567_1         5.3 MB
    #11 15.93     libgomp-11.2.0             |       h1234567_1         474 KB
    #11 15.93     libstdcxx-ng-11.2.0        |       h1234567_1         4.7 MB
    #11 15.93     ncurses-6.3                |       h5eee18b_3         781 KB
    #11 15.93     openssl-1.1.1q             |       h7f8727e_0         2.5 MB
    #11 15.93     pip-22.1.2                 |   py37h06a4308_0         2.4 MB
    #11 15.93     python-3.7.3               |       h0371630_0        32.1 MB
    #11 15.93     readline-7.0               |       h7b6447c_5         324 KB
    #11 15.93     setuptools-63.4.1          |   py37h06a4308_0         1.1 MB
    #11 15.93     sqlite-3.33.0              |       h62c20be_0         1.1 MB
    #11 15.93     tk-8.6.12                  |       h1ccaba5_0         3.0 MB
    #11 15.93     xz-5.2.5                   |       h7f8727e_1         339 KB
    #11 15.93     zlib-1.2.12                |       h7f8727e_2         106 KB
    #11 15.93     ------------------------------------------------------------
    #11 15.93                                            Total:        54.8 MB
    #11 15.93
    #11 15.93 The following NEW packages will be INSTALLED:
    #11 15.93
    #11 15.93   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
    #11 15.93   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu
    #11 15.93   ca-certificates    pkgs/main/linux-64::ca-certificates-2022.07.19-h06a4308_0
    #11 15.93   certifi            pkgs/main/linux-64::certifi-2022.6.15-py37h06a4308_0
    #11 15.93   libedit            pkgs/main/linux-64::libedit-3.1.20210910-h7f8727e_0
    #11 15.93   libffi             pkgs/main/linux-64::libffi-3.2.1-hf484d3e_1007
    #11 15.93   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1
    #11 15.93   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1
    #11 15.93   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1
    #11 15.93   ncurses            pkgs/main/linux-64::ncurses-6.3-h5eee18b_3
    #11 15.93   openssl            pkgs/main/linux-64::openssl-1.1.1q-h7f8727e_0
    #11 15.93   pip                pkgs/main/linux-64::pip-22.1.2-py37h06a4308_0
    #11 15.93   python             pkgs/main/linux-64::python-3.7.3-h0371630_0
    #11 15.93   readline           pkgs/main/linux-64::readline-7.0-h7b6447c_5
    #11 15.93   setuptools         pkgs/main/linux-64::setuptools-63.4.1-py37h06a4308_0
    #11 15.93   sqlite             pkgs/main/linux-64::sqlite-3.33.0-h62c20be_0
    #11 15.93   tk                 pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0
    #11 15.93   wheel              pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0
    #11 15.93   xz                 pkgs/main/linux-64::xz-5.2.5-h7f8727e_1
    #11 15.93   zlib               pkgs/main/linux-64::zlib-1.2.12-h7f8727e_2
    #11 15.93
    #11 15.93
    #11 15.93
    #11 15.93 Downloading and Extracting Packages
    zlib-1.2.12          | 106 KB    | ########## | 100%
    xz-5.2.5             | 339 KB    | ########## | 100%
    libedit-3.1.20210910 | 166 KB    | ########## | 100%
    _openmp_mutex-5.1    | 21 KB     | ########## | 100%
    sqlite-3.33.0        | 1.1 MB    | ########## | 100%
    libstdcxx-ng-11.2.0  | 4.7 MB    | ########## | 100%
    ncurses-6.3          | 781 KB    | ########## | 100%
    python-3.7.3         | 32.1 MB   | ########## | 100%
    certifi-2022.6.15    | 153 KB    | ########## | 100%
    tk-8.6.12            | 3.0 MB    | ########## | 100%
    libgomp-11.2.0       | 474 KB    | ########## | 100%
    libffi-3.2.1         | 48 KB     | ########## | 100%
    ca-certificates-2022 | 124 KB    | ########## | 100%
    setuptools-63.4.1    | 1.1 MB    | ########## | 100%
    pip-22.1.2           | 2.4 MB    | ########## | 100%
    openssl-1.1.1q       | 2.5 MB    | ########## | 100%
    readline-7.0         | 324 KB    | ########## | 100%
    libgcc-ng-11.2.0     | 5.3 MB    | ########## | 100%
    #11 24.41 Preparing transaction: ...working... done
    #11 24.74 Verifying transaction: ...working... done
    #11 25.93 Executing transaction: ...working... done
    #11 28.11 #
    #11 28.11 # To activate this environment, use
    #11 28.11 #
    #11 28.11 #     $ conda activate lab
    #11 28.11 #
    #11 28.11 # To deactivate an active environment, use
    #11 28.11 #
    #11 28.11 #     $ conda deactivate
    #11 28.11
    #11 29.82 Collecting package metadata (repodata.json): ...working... done
    #11 101.5 Solving environment: ...working... done
    #11 148.2
    #11 148.2
    #11 148.2 ==> WARNING: A newer version of conda exists. <==
    #11 148.2   current version: 4.12.0
    #11 148.2   latest version: 4.14.0
    #11 148.2
    #11 148.2 Please update conda by running
    #11 148.2
    #11 148.2     $ conda update -n base -c defaults conda
    #11 148.2
    #11 148.2
    #11 148.3
    #11 148.3 Downloading and Extracting Packages
    libgfortran-ng-7.5.0 | 23 KB     | ########## | 100%
    colorlog-4.0.2       | 19 KB     | ########## | 100%
    lz4-c-1.9.3          | 179 KB    | ########## | 100%
    jdcal-1.4.1          | 9 KB      | ########## | 100%
    scipy-1.3.0          | 18.8 MB   | ########## | 100%
    ujson-1.35           | 28 KB     | ########## | 100%
    mkl-2022.0.1         | 127.7 MB  | ########## | 100%
    xlrd-1.2.0           | 108 KB    | ########## | 100%
    libopenblas-0.3.12   | 8.2 MB    | ########## | 100%
    regex-2019.05.25     | 365 KB    | ########## | 100%
    pytest-4.5.0         | 354 KB    | ########## | 100%
    libgcc-7.2.0         | 304 KB    | ########## | 100%
    libwebp-base-1.2.2   | 824 KB    | ########## | 100%
    six-1.16.0           | 14 KB     | ########## | 100%
    zipp-3.8.1           | 13 KB     | ########## | 100%
    cffi-1.14.4          | 224 KB    | ########## | 100%
    et_xmlfile-1.0.1     | 11 KB     | ########## | 100%
    liblapack-3.9.0      | 11 KB     | ########## | 100%
    olefile-0.46         | 32 KB     | ########## | 100%
    importlib-metadata-4 | 33 KB     | ########## | 100%
    cudatoolkit-10.1.243 | 427.6 MB  | ########## | 100%
    py-1.11.0            | 74 KB     | ########## | 100%
    backports.functools_ | 9 KB      | ########## | 100%
    wcwidth-0.2.5        | 33 KB     | ########## | 100%
    pydash-4.2.1         | 60 KB     | ########## | 100%
    retrying-1.3.3       | 11 KB     | ########## | 100%
    libgfortran4-7.5.0   | 1.2 MB    | ########## | 100%
    flaky-3.5.3          | 19 KB     | ########## | 100%
    ca-certificates-2022 | 149 KB    | ########## | 100%
    pluggy-0.13.1        | 29 KB     | ########## | 100%
    python-3.7.3         | 35.7 MB   | ########## | 100%
    libtiff-4.2.0        | 590 KB    | ########## | 100%
    typing_extensions-4. | 28 KB     | ########## | 100%
    autopep8-1.4.4       | 38 KB     | ########## | 100%
    psutil-5.6.2         | 320 KB    | ########## | 100%
    openssl-1.1.1o       | 2.1 MB    | ########## | 100%
    importlib_metadata-4 | 4 KB      | ########## | 100%
    libcblas-3.9.0       | 11 KB     | ########## | 100%
    pytorch-1.3.1        | 428.0 MB  | ########## | 100%
    python-dateutil-2.8. | 240 KB    | ########## | 100%
    zstd-1.5.0           | 490 KB    | ########## | 100%
    yaml-0.2.5           | 87 KB     | ########## | 100%
    libpng-1.6.37        | 306 KB    | ########## | 100%
    ninja-1.11.0         | 2.8 MB    | ########## | 100%
    attrs-22.1.0         | 48 KB     | ########## | 100%
    coverage-4.5.3       | 216 KB    | ########## | 100%
    pytest-cov-2.7.1     | 17 KB     | ########## | 100%
    certifi-2022.6.15    | 155 KB    | ########## | 100%
    pillow-6.2.0         | 634 KB    | ########## | 100%
    bzip2-1.0.8          | 484 KB    | ########## | 100%
    pyyaml-5.1.2         | 184 KB    | ########## | 100%
    numpy-1.16.3         | 4.3 MB    | ########## | 100%
    atomicwrites-1.4.1   | 12 KB     | ########## | 100%
    jpeg-9e              | 268 KB    | ########## | 100%
    pytz-2022.2.1        | 224 KB    | ########## | 100%
    libblas-3.9.0        | 11 KB     | ########## | 100%
    intel-openmp-2022.0. | 4.2 MB    | ########## | 100%
    pycparser-2.21       | 100 KB    | ########## | 100%
    pandas-0.24.2        | 8.6 MB    | ########## | 100%
    pip-19.1.1           | 1.8 MB    | ########## | 100%
    more-itertools-8.14. | 45 KB     | ########## | 100%
    plotly-4.9.0         | 5.8 MB    | ########## | 100%
    pycodestyle-2.5.0    | 36 KB     | ########## | 100%
    pytest-timeout-1.3.3 | 12 KB     | ########## | 100%
    freetype-2.10.4      | 890 KB    | ########## | 100%
    python_abi-3.7       | 4 KB      | ########## | 100%
    backports-1.0        | 4 KB      | ########## | 100%
    openpyxl-2.6.1       | 152 KB    | ########## | 100%
    #11 419.8 Preparing transaction: ...working... done
    #11 422.4 Verifying transaction: ...working... done
    #11 425.0 Executing transaction: ...working... By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html
    #11 436.6
    #11 436.6 done
    #11 437.1 Installing pip dependencies: ...working... Ran pip subprocess with arguments:
    #11 818.8 ['/root/miniconda3/envs/lab/bin/python', '-m', 'pip', 'install', '-U', '-r', '/root/SLM-Lab/condaenv.u9_zu190.requirements.txt']
    #11 818.8 Pip subprocess output:
    #11 818.8 Collecting box2d-py==2.3.8 (from -r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 1))
    #11 818.8   Downloading https://files.pythonhosted.org/packages/87/34/da5393985c3ff9a76351df6127c275dcb5749ae0abbe8d5210f06d97405d/box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448kB)
    #11 818.8 Collecting cloudpickle==0.5.2 (from -r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 2))
    #11 818.8   Downloading https://files.pythonhosted.org/packages/aa/18/514b557c4d8d4ada1f0454ad06c845454ad438fd5c5e0039ba51d6b032fe/cloudpickle-0.5.2-py2.py3-none-any.whl
    #11 818.8 Collecting colorlover==0.3.0 (from -r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 3))
    #11 818.8   Downloading https://files.pythonhosted.org/packages/9a/53/f696e4480b1d1de3b1523991dea71cf417c8b19fe70c704da164f3f90972/colorlover-0.3.0-py3-none-any.whl
    #11 818.8 Collecting future==0.18.2 (from -r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 4))
    #11 818.8   Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
    ...
    ...
    
    #11 818.8 Requirement already satisfied, skipping upgrade: zipp>=0.5 in /root/miniconda3/envs/lab/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->click->ray==0.7.0->-r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 8)) (3.8.1)
    #11 818.8 Requirement already satisfied, skipping upgrade: typing-extensions>=3.6.4; python_version < "3.8" in /root/miniconda3/envs/lab/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->click->ray==0.7.0->-r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 8)) (4.3.0)
    #11 818.8 Collecting pyasn1>=0.1.3 (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<2,>=1.6.3->tensorboard==2.1.1->-r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 10))
    #11 818.8   Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
    #11 818.8 Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard==2.1.1->-r /root/SLM-Lab/condaenv.u9_zu190.requirements.txt (line 10))
    #11 818.8   Downloading https://files.pythonhosted.org/packages/1d/46/5ee2475e1b46a26ca0fa10d3c1d479577fde6ee289f8c6aa6d7ec33e31fd/oauthlib-3.2.0-py3-none-any.whl (151kB)
    #11 818.8 Building wheels for collected packages: future, pyopengl, xvfbwrapper, gym, typing, grpcio, MarkupSafe
    #11 818.8   Building wheel for future (setup.py): started
    #11 818.8   Building wheel for future (setup.py): finished with status 'done'
    #11 818.8   Stored in directory: /root/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e
    #11 818.8   Building wheel for pyopengl (setup.py): started
    #11 818.8   Building wheel for pyopengl (setup.py): finished with status 'done'
    #11 818.8   Stored in directory: /root/.cache/pip/wheels/6c/00/7f/1dd736f380848720ad79a1a1de5272e0d3f79c15a42968fb58
    #11 818.8   Building wheel for xvfbwrapper (setup.py): started
    #11 818.8   Building wheel for xvfbwrapper (setup.py): finished with status 'done'
    #11 818.8   Stored in directory: /root/.cache/pip/wheels/10/f2/61/cacfaf84b352c223761ea8d19616e3b5ac5c27364da72863f0
    #11 818.8   Building wheel for gym (setup.py): started
    #11 818.8   Building wheel for gym (setup.py): finished with status 'done'
    #11 818.8   Stored in directory: /root/.cache/pip/wheels/57/b0/13/4153e1acab826fbe612c95b1336a63a3fa6416902a8d74a1b7
    #11 818.8   Building wheel for typing (setup.py): started
    #11 818.8   Building wheel for typing (setup.py): finished with status 'done'
    #11 818.8   Stored in directory: /root/.cache/pip/wheels/2d/04/41/8e1836e79581989c22eebac3f4e70aaac9af07b0908da173be
    #11 818.8   Building wheel for grpcio (setup.py): started
    #11 818.8   Building wheel for grpcio (setup.py): still running...
    #11 818.8   Building wheel for grpcio (setup.py): still running...
    #11 818.8   Building wheel for grpcio (setup.py): finished with status 'error'
    #11 818.8   Running setup.py clean for grpcio
    #11 818.8   Building wheel for MarkupSafe (setup.py): started
    #11 818.8   Building wheel for MarkupSafe (setup.py): finished with status 'done'
    #11 818.8   Stored in directory: /root/.cache/pip/wheels/f5/40/34/d60ef965622011684037ea53e53fd44ef58ed2062f26878ce2
    #11 818.8 Successfully built future pyopengl xvfbwrapper gym typing MarkupSafe
    #11 818.8 Failed to build grpcio
    #11 818.8 Installing collected packages: box2d-py, cloudpickle, colorlover, future, kaleido, opencv-python, pyopengl, typing, funcsigs, click, colorama, flatbuffers, redis, filelock, ray, absl-py, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, markdown, MarkupSafe, werkzeug, charset-normalizer, idna, urllib3, requests, oauthlib, requests-oauthlib, google-auth-oauthlib, grpcio, protobuf, tensorboard, xvfbwrapper, pyglet, gym, pybullet, roboschool, atari-py
    #11 818.8   Running setup.py install for grpcio: started
    #11 818.8     Running setup.py install for grpcio: still running...
    #11 818.8     Running setup.py install for grpcio: still running...
    #11 818.8     Running setup.py install for grpcio: finished with status 'error'
    #11 818.8 Pip subprocess error:
    #11 818.8   ERROR: Complete output from command /root/miniconda3/envs/lab/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-n1qdzi5c/grpcio/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-gt9n7xut --python-tag cp37:
    #11 818.8   ERROR: Found cython-generated files...
    #11 818.8   running bdist_wheel
    #11 818.8   running build
    #11 818.8   running build_py
    #11 818.8   running build_project_metadata
    #11 818.8   creating python_build
    #11 818.8   creating python_build/lib.linux-x86_64-cpython-37
    #11 818.8   creating python_build/lib.linux-x86_64-cpython-37/grpc
    #11 818.8   copying src/python/grpcio/grpc/_channel.py -> python_build/lib.linux-x86_64-cpython-37/grpc
    #11 818.8   copying src/python/grpcio/grpc/_utilities.py -> python_build/lib.linux-x86_64-cpython-37/grpc
    
    ...
    ...
                             ^
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/base64/base64.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/base64/base64.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1: warning: command line option ‘-std=c++14’ is valid for C++/ObjC++ but not for C
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/abseil-cpp/absl/strings/internal/str_format/bind.cc -o python_build/temp.linux-x86_64-cpython-37/third_party/abseil-cpp/absl/strings/internal/str_format/bind.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/abseil-cpp/absl/time/internal/cctz/src/time_zone_lookup.cc -o python_build/temp.linux-x86_64-cpython-37/third_party/abseil-cpp/absl/time/internal/cctz/src/time_zone_lookup.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/cpu-aarch64-fuchsia.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/cpu-aarch64-fuchsia.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1: warning: command line option ‘-std=c++14’ is valid for C++/ObjC++ but not for C
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/cpu-aarch64-linux.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/cpu-aarch64-linux.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1: warning: command line option ‘-std=c++14’ is valid for C++/ObjC++ but not for C
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/cpu-aarch64-win.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/cpu-aarch64-win.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1: warning: command line option ‘-std=c++14’ is valid for C++/ObjC++ but not for C
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/cpu-arm-linux.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/cpu-arm-linux.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1: warning: command line option ‘-std=c++14’ is valid for C++/ObjC++ but not for C
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/cpu-arm.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/cpu-arm.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   cc1: warning: command line option ‘-std=c++14’ is valid for C++/ObjC++ but not for C
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m -c third_party/boringssl-with-bazel/src/crypto/cpu-intel.c -o python_build/temp.linux-x86_64-cpython-37/third_party/boringssl-with-bazel/src/crypto/cpu-intel.o -std=c++14 -fvisibility=hidden -fno-wrapv -fno-exceptions -pthread
    #11 818.8   gcc -pthread -B /root/miniconda3/envs/lab/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_WIN32_WINNT=1536 -DGRPC_XDS_USER_AGENT_NAME_SUFFIX=\"Python\" -DGRPC_XDS_USER_AGENT_VERSION_SUFFIX=\"1.48.0\" -DGPR_BACKWARDS_COMPATIBILITY_MODE=1 -DHAVE_CONFIG_H=1 -DGRPC_ENABLE_FORK_SUPPORT=1 "-DPyMODINIT_FUNC=extern \"C\" __attribute__((visibility (\"default\"))) PyObject*" -DGRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK=1 -Isrc/python/grpcio -Iinclude -I. -Ithird_party/abseil-cpp -Ithird_party/address_sorting/include -Ithird_party/cares/cares/include -Ithird_party/cares -Ithird_party/cares/cares -Ithird_party/cares/config_linux -Ithird_party/re2 -Ithird_party/boringssl-with-bazel/src/include -Ithird_party/upb -Isrc/core/ext/upb-generated -Isrc/core/ext/upbdefs-generated -Ithird_party/xxhash -Ithird_party/zlib -I/root/miniconda3/envs/lab/include/python3.7m
    #11 818.8 [output clipped, log limit 1MiB reached]
    #11 818.8
    #11 818.8 failed
    ------
    executor failed running [/bin/bash -c . ~/miniconda3/etc/profile.d/conda.sh &&     conda create -n lab python=3.7.3 -y &&     conda activate lab &&     conda env update -f environment.yml &&     conda clean -y --all &&     rm -rf ~/.cache/pip]: exit code: 1
    
    opened by jtruxon 1
  • `optimizer.step()` before `lr_scheduler.step()` Warning Occurred

    `optimizer.step()` before `lr_scheduler.step()` Warning Occurred

    I really appreciate to you for your book, It's a great help for me to start RL. ^^

    Describe the bug A clear and concise description of what the bug is. When executing example code 4.7 (vanilla_dpn without any change), there comes a warning msg as below

    To Reproduce

    1. OS and environment: Ubuntu 20.04
    2. SLM Lab git SHA (run git rev-parse HEAD to get it): 5fa5ee3d034a38d5644f6f96b4c02ec366c831d0 (from the file "SLM-lab/data/vanilla_dqn_boltzmann_cartpole_2022_07_15_092012/vanilla_dqn_boltzmann_cartpole_t0_spec.json")
    3. spec file used: SLM-lab/slm_lab/spec/benchmark/dqn/dqn_cartpole.json

    Additional context After it occurred, it proceeded too slow (it took over an hour) than other methods (15 minutes for SARSA), and the result is also strange that mean_returns_ma decreases gradually to about 50 after 30k frames. I wonder the result of this trial is related to the warning situation

    Error logs

    [2022-07-15 09:20:14,002 PID:245693 INFO logger.py info] Running RL loop for trial 0 session 3
    [2022-07-15 09:20:14,006 PID:245693 INFO __init__.py log_summary] Trial 0 session 3 vanilla_dqn_boltzmann_cartpole_t0_s3 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.01  explore_var: 5  entropy_coef: nan  entropy: nan  grad_norm: nan
    /home/eric/miniconda3/envs/lab/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning:
    
    Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    
    /home/eric/miniconda3/envs/lab/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning:
    
    Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    
    /home/eric/miniconda3/envs/lab/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning:
    
    Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    
    opened by younghwa-hong 0
  • how to improve the convergence performance of training loss?

    how to improve the convergence performance of training loss?

    Hi kengz, I find that the convergence performance of training loss (=value loss+policy loss) of ppo algorithem applied in game pong is poor (see Fig.1), but the corresponding mean_returns shows a good upward trend and reaches convergence (see Fig.2). That is why? how to improve the convergence performance of training loss? I tried many imporved tricks with ppo, but none of them worked. ppo_pong_t0_s0_session_graph_eval_loss_vs_frame Fig.1 ppo_pong_t0_s0_session_graph_eval_mean_returns_vs_frames Fig.2

    opened by Jzhou0 0
Releases(v4.2.4)
  • v4.2.4(Dec 18, 2021)

    What's Changed

    • upgrade plotly, replace orca with kaleido by @kengz in https://github.com/kengz/SLM-Lab/pull/501

    Full Changelog: https://github.com/kengz/SLM-Lab/compare/v4.2.3...v4.2.4

    Source code(tar.gz)
    Source code(zip)
  • v4.2.3(Dec 6, 2021)

    What's Changed

    • Added Algorithms config files for VideoPinball-v0 game by @dd-iuonac in https://github.com/kengz/SLM-Lab/pull/488
    • fix build for new RTX GPUs by @kengz and @Karl-Grantham in https://github.com/kengz/SLM-Lab/pull/496
    • remove the reinforce_pong.json spec to prevent confusion in https://github.com/kengz/SLM-Lab/pull/499

    New Contributors

    • @dd-iuonac made their first contribution in https://github.com/kengz/SLM-Lab/pull/488
    • @Karl-Grantham for help with debugging #496

    Full Changelog: https://github.com/kengz/SLM-Lab/compare/v4.2.2...v4.2.3

    Source code(tar.gz)
    Source code(zip)
  • v4.2.2(May 25, 2021)

    Improve Installation Stability

    :raised_hands: Thanks to @Nickfagiano help with debugging.

    • #487 update installation to work with MacOS BigSur
    • #487 improve setup with Conda path guard
    • #487 lock atari-py version to 0.2.6 for safety

    Google Colab/Jupyter

    :raised_hands: Thanks to @piosif97 for helping.

    Windows setup

    :raised_hands: Thanks to @vladimirnitu and @steindaian for providing the PDF.

    Source code(tar.gz)
    Source code(zip)
  • v4.2.1(May 17, 2021)

    Update installation

    Dependencies and systems around SLM Lab has changed and caused some breakages. This release fixes these installation issues.

    • #461, #476 update to homebrew/cask (thanks @ben-e, @amjadmajid )
    • #463 add pybullet to dependencies (thanks @rafapi)
    • #483 fix missing install command in Arch Linux setup (thanks @sebimarkgraf)
    • #485 update GitHub Actions CI to v2
    • #485 fix demo spec to use strict json
    Source code(tar.gz)
    Source code(zip)
  • v4.2.0(Apr 14, 2020)

    Resume mode

    • #455 adds train@ resume mode and refactors the enjoy mode. See PR for detailed info.

    train@ usage example

    Specify train mode as train@{predir}, where {predir} is the data directory of the last training run, or simply uselatest` to use the latest. e.g.:

    python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole train
    # terminate run before its completion
    # optionally edit the spec file in a past-future-consistent manner
    
    # run resume with either of the commands:
    python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole [email protected]
    # or to use a specific run folder
    python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole [email protected]/reinforce_cartpole_2020_04_13_232521
    

    enjoy mode refactor

    The train@ resume mode API allows for the enjoy mode to be refactored. Both share similar syntax. Continuing with the example above, to enjoy a train model, we now use:

    python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole [email protected]/reinforce_cartpole_2020_04_13_232521/reinforce_cartpole_t0_s0_spec.json
    

    Plotly and PyTorch update

    • #453 updates Plotly to 4.5.4 and PyTorch to 1.3.1.
    • #454 explicitly shuts down Plotly orca server after plotting to prevent zombie processes

    PPO batch size optimization

    • #453 adds chunking to allow PPO to run on larger batch size by breaking up the forward loop.

    New OnPolicyCrossEntropy memory

    • #446 adds a new OnPolicyCrossEntropy memory class. See PR for details. Credits to @ingambe.
    Source code(tar.gz)
    Source code(zip)
  • v4.1.1(Nov 13, 2019)

    Discrete SAC benchmark update

    |||||||| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| | Env. \ Alg. | DQN | DDQN+PER | A2C (GAE) | A2C (n-step) | PPO | SAC | | Breakout

    graph
    | 80.88 | 182 | 377 | 398 | 443 | 3.51* | | Pong
    graph
    | 18.48 | 20.5 | 19.31 | 19.56 | 20.58 | 19.87* | | Seaquest
    graph
    | 1185 | 4405 | 1070 | 1684 | 1715 | 171* | | Qbert
    graph
    | 5494 | 11426 | 12405 | 13590 | 13460 | 923* | | LunarLander
    graph
    | 192 | 233 | 25.21 | 68.23 | 214 | 276 | | UnityHallway
    graph
    | -0.32 | 0.27 | 0.08 | -0.96 | 0.73 | 0.01 | | UnityPushBlock
    graph
    | 4.88 | 4.93 | 4.68 | 4.93 | 4.97 | -0.70 |

    Episode score at the end of training attained by SLM Lab implementations on discrete-action control problems. Reported episode scores are the average over the last 100 checkpoints, and then averaged over 4 Sessions. A Random baseline with score averaged over 100 episodes is included. Results marked with * were trained using the hybrid synchronous/asynchronous version of SAC to parallelize and speed up training time. For SAC, Breakout, Pong and Seaquest were trained for 2M frames instead of 10M frames.

    For the full Atari benchmark, see Atari Benchmark

    Source code(tar.gz)
    Source code(zip)
  • v4.1.0(Oct 29, 2019)

    This marks a stable release of SLM Lab with full benchmark results

    RAdam+Lookahead optimizer

    • Lookahead + RAdam optimizer significantly improves the performance of some RL algorithms (A2C (n-step), PPO) on continuous domain problems, but does not improve (A2C (GAE), SAC). #416

    TensorBoard

    • Add TensorBoard in body to auto-log summary variables, graph, network parameter histograms, action histogram. To launch TensorBoard, run tensorboard --logdir=data after a session/trial is completed. Example screenshot:
    Screen Shot 2019-10-14 at 10 41 36 PM

    Full Benchmark Upload

    Plot Legend

    legend

    Discrete Benchmark

    |||||||| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| | Env. \ Alg. | DQN | DDQN+PER | A2C (GAE) | A2C (n-step) | PPO | SAC | | Breakout

    graph
    | 80.88 | 182 | 377 | 398 | 443 | - | | Pong
    graph
    | 18.48 | 20.5 | 19.31 | 19.56 | 20.58 | 19.87* | | Seaquest
    graph
    | 1185 | 4405 | 1070 | 1684 | 1715 | - | | Qbert
    graph
    | 5494 | 11426 | 12405 | 13590 | 13460 | 214* | | LunarLander
    graph
    | 192 | 233 | 25.21 | 68.23 | 214 | 276 | | UnityHallway
    graph
    | -0.32 | 0.27 | 0.08 | -0.96 | 0.73 | - | | UnityPushBlock
    graph
    | 4.88 | 4.93 | 4.68 | 4.93 | 4.97 | - |

    Episode score at the end of training attained by SLM Lab implementations on discrete-action control problems. Reported episode scores are the average over the last 100 checkpoints, and then averaged over 4 Sessions. Results marked with * were trained using the hybrid synchronous/asynchronous version of SAC to parallelize and speed up training time.

    For the full Atari benchmark, see Atari Benchmark

    Continuous Benchmark

    |||||| |:---:|:---:|:---:|:---:|:---:| | Env. \ Alg. | A2C (GAE) | A2C (n-step) | PPO | SAC | | RoboschoolAnt

    graph
    | 787 | 1396 | 1843 | 2915 | | RoboschoolAtlasForwardWalk
    graph
    | 59.87 | 88.04 | 172 | 800 | | RoboschoolHalfCheetah
    graph
    | 712 | 439 | 1960 | 2497 | | RoboschoolHopper
    graph
    | 710 | 285 | 2042 | 2045 | | RoboschoolInvertedDoublePendulum
    graph
    | 996 | 4410 | 8076 | 8085 | | RoboschoolInvertedPendulum
    graph
    | 995 | 978 | 986 | 941 | | RoboschoolReacher
    graph
    | 12.9 | 10.16 | 19.51 | 19.99 | | RoboschoolWalker2d
    graph
    | 280 | 220 | 1660 | 1894 | | RoboschoolHumanoid
    graph
    | 99.31 | 54.58 | 2388 | 2621* | | RoboschoolHumanoidFlagrun
    graph
    | 73.57 | 178 | 2014 | 2056* | | RoboschoolHumanoidFlagrunHarder
    graph
    | -429 | 253 | 680 | 280* | | Unity3DBall
    graph
    | 33.48 | 53.46 | 78.24 | 98.44 | | Unity3DBallHard
    graph
    | 62.92 | 71.92 | 91.41 | 97.06 |

    Episode score at the end of training attained by SLM Lab implementations on continuous control problems. Reported episode scores are the average over the last 100 checkpoints, and then averaged over 4 Sessions. Results marked with * require 50M-100M frames, so we use the hybrid synchronous/asynchronous version of SAC to parallelize and speed up training time.

    Atari Benchmark

    ||||||| |:---:|:---:|:---:|:---:|:---:|:---:| | Env. \ Alg. | DQN | DDQN+PER | A2C (GAE) | A2C (n-step) | PPO | | Adventure

    graph
    | -0.94 | -0.92 | -0.77 | -0.85 | -0.3 | | AirRaid
    graph
    | 1876 | 3974 | 4202 | 3557 | 4028 | | Alien
    graph
    | 822 | 1574 | 1519 | 1627 | 1413 | | Amidar
    graph
    | 90.95 | 431 | 577 | 418 | 795 | | Assault
    graph
    | 1392 | 2567 | 3366 | 3312 | 3619 | | Asterix
    graph
    | 1253 | 6866 | 5559 | 5223 | 6132 | | Asteroids
    graph
    | 439 | 426 | 2951 | 2147 | 2186 | | Atlantis
    graph
    | 68679 | 644810 | 2747371 | 2259733 | 2148077 | | BankHeist
    graph
    | 131 | 623 | 855 | 1170 | 1183 | | BattleZone
    graph
    | 6564 | 6395 | 4336 | 4533 | 13649 | | BeamRider
    graph
    | 2799 | 5870 | 2659 | 4139 | 4299 | | Berzerk
    graph
    | 319 | 401 | 1073 | 763 | 860 | | Bowling
    graph
    | 30.29 | 39.5 | 24.51 | 23.75 | 31.64 | | Boxing
    graph
    | 72.11 | 90.98 | 1.57 | 1.26 | 96.53 | | Breakout
    graph
    | 80.88 | 182 | 377 | 398 | 443 | | Carnival
    graph
    | 4280 | 4773 | 2473 | 1827 | 4566 | | Centipede
    graph
    | 1899 | 2153 | 3909 | 4202 | 5003 | | ChopperCommand
    graph
    | 1083 | 4020 | 3043 | 1280 | 3357 | | CrazyClimber
    graph
    | 46984 | 88814 | 106256 | 109998 | 116820 | | Defender
    graph
    | 281999 | 313018 | 665609 | 657823 | 534639 | | DemonAttack
    graph
    | 1705 | 19856 | 23779 | 19615 | 121172 | | DoubleDunk
    graph
    | -21.44 | -22.38 | -5.15 | -13.3 | -6.01 | | ElevatorAction
    graph
    | 32.62 | 17.91 | 9966 | 8818 | 6471 | | Enduro
    graph
    | 437 | 959 | 787 | 0.0 | 1926 | | FishingDerby
    graph
    | -88.14 | -1.7 | 16.54 | 1.65 | 36.03 | | Freeway
    graph
    | 24.46 | 30.49 | 30.97 | 0.0 | 32.11 | | Frostbite
    graph
    | 98.8 | 2497 | 277 | 261 | 1062 | | Gopher
    graph
    | 1095 | 7562 | 929 | 1545 | 2933 | | Gravitar
    graph
    | 87.34 | 258 | 313 | 433 | 223 | | Hero
    graph
    | 1051 | 12579 | 16502 | 19322 | 17412 | | IceHockey
    graph
    | -14.96 | -14.24 | -5.79 | -6.06 | -6.43 | | Jamesbond
    graph
    | 44.87 | 702 | 521 | 453 | 561 | | JourneyEscape
    graph
    | -4818 | -2003 | -921 | -2032 | -1094 | | Kangaroo
    graph
    | 1965 | 8897 | 67.62 | 554 | 4989 | | Krull
    graph
    | 5522 | 6650 | 7785 | 6642 | 8477 | | KungFuMaster
    graph
    | 2288 | 16547 | 31199 | 25554 | 34523 | | MontezumaRevenge
    graph
    | 0.0 | 0.02 | 0.08 | 0.19 | 1.08 | | MsPacman
    graph
    | 1175 | 2215 | 1965 | 2158 | 2350 | | NameThisGame
    graph
    | 3915 | 4474 | 5178 | 5795 | 6386 | | Phoenix
    graph
    | 2909 | 8179 | 16345 | 13586 | 30504 | | Pitfall
    graph
    | -68.83 | -73.65 | -101 | -31.13 | -35.93 | | Pong
    graph
    | 18.48 | 20.5 | 19.31 | 19.56 | 20.58 | | Pooyan
    graph
    | 1958 | 2741 | 2862 | 2531 | 6799 | | PrivateEye
    graph
    | 784 | 303 | 93.22 | 78.07 | 50.12 | | Qbert
    graph
    | 5494 | 11426 | 12405 | 13590 | 13460 | | Riverraid
    graph
    | 953 | 10492 | 8308 | 7565 | 9636 | | RoadRunner
    graph
    | 15237 | 29047 | 30152 | 31030 | 32956 | | Robotank
    graph
    | 3.43 | 9.05 | 2.98 | 2.27 | 2.27 | | Seaquest
    graph
    | 1185 | 4405 | 1070 | 1684 | 1715 | | Skiing
    graph
    | -14094 | -12883 | -19481 | -14234 | -24713 | | Solaris
    graph
    | 612 | 1396 | 2115 | 2236 | 1892 | | SpaceInvaders
    graph
    | 451 | 670 | 733 | 750 | 797 | | StarGunner
    graph
    | 3565 | 38238 | 44816 | 48410 | 60579 | | Tennis
    graph
    | -23.78 | -10.33 | -22.42 | -19.06 | -11.52 | | TimePilot
    graph
    | 2819 | 1884 | 3331 | 3440 | 4398 | | Tutankham
    graph
    | 35.03 | 159 | 161 | 175 | 211 | | UpNDown
    graph
    | 2043 | 11632 | 89769 | 18878 | 262208 | | Venture
    graph
    | 4.56 | 9.61 | 0.0 | 0.0 | 11.84 | | VideoPinball
    graph
    | 8056 | 79730 | 35371 | 40423 | 58096 | | WizardOfWor
    graph
    | 869 | 328 | 1516 | 1247 | 4283 | | YarsRevenge
    graph
    | 5816 | 15698 | 27097 | 11742 | 10114 | | Zaxxon
    graph
    | 442 | 54.28 | 64.72 | 24.7 | 641 |

    The table above presents results for 62 Atari games. All agents were trained for 10M frames (40M including skipped frames). Reported results are the episode score at the end of training, averaged over the previous 100 evaluation checkpoints with each checkpoint averaged over 4 Sessions. Agents were checkpointed every 10k training frames.

    Source code(tar.gz)
    Source code(zip)
  • v4.0.1(Aug 11, 2019)

    This release adds a new algorithm: Soft Actor-Critic (SAC).

    Soft Actor-Critic

    -implement the original paper: "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" https://arxiv.org/abs/1801.01290 #398

    • implement the improvement of SAC paper: "Soft Actor-Critic Algorithms and Applications" https://arxiv.org/abs/1812.05905 #399
    • extend SAC to work directly for discrete environment using GumbelSoftmax distribution (custom)

    Roboschool (continuous control) Benchmark

    Note that the Roboschool reward scales are different from MuJoCo's.

    | Env. \ Alg. | SAC | |:---|---| | RoboschoolAnt | 2451.55

    graph
    | | RoboschoolHalfCheetah | 2004.27
    graph
    | | RoboschoolHopper | 2090.52
    graph
    | | RoboschoolWalker2d | 1711.92
    graph
    |

    LunarLander (discrete control) Benchmark

    | | | |---|---| |sac_lunar_t0_trial_graph_mean_returns_vs_frames|sac_lunar_t0_trial_graph_mean_returns_ma_vs_frames| | Trial graph | Moving average |

    Source code(tar.gz)
    Source code(zip)
  • v4.0.0(Jul 31, 2019)

    This release corrects and optimizes all the algorithms from benchmarking on Atari. New metrics are introduced. The lab's API is also redesigned for simplicity.

    Benchmark

    • full algorithm benchmark on 4 core Atari environments #396
    • LunarLander benchmark #388 and BipedalWalker benchmark #377

    This benchmark table is pulled from PR396. See the full benchmark results here.

    | Env. \ Alg. | A2C (GAE) | A2C (n-step) | PPO | DQN | DDQN+PER | |:---|---|---|---|---|---| | Breakout

    graph
    | 389.99
    graph
    | 391.32
    graph
    | 425.89
    graph
    | 65.04
    graph
    | 181.72
    graph
    | | Pong
    graph
    | 20.04
    graph
    | 19.66
    graph
    | 20.09
    graph
    | 18.34
    graph
    | 20.44
    graph
    | | Qbert
    graph
    | 13,328.32
    graph
    | 13,259.19
    graph
    | 13,691.89
    graph
    | 4,787.79
    graph
    | 11,673.52
    graph
    | | Seaquest
    graph
    | 892.68
    graph
    | 1,686.08
    graph
    | 1,583.04
    graph
    | 1,118.50
    graph
    | 3,751.34
    graph
    |

    Algorithms

    • correct and optimize all algorithms with benchmarking #315 #327 #328 #361
    • introduce "shared" and "synced" Hogwild modes for distributed training #337 #340
    • streamline and optimize agent components too

    Now, the full list of algorithms are:

    • SARSA
    • DQN, distributed-DQN
    • Double-DQN, Dueling-DQN, PER-DQN
    • REINFORCE
    • A2C, A3C (N-step & GAE)
    • PPO, distributed-PPO
    • SIL (A2C, PPO) All the algorithms can be ran in distributed mode also; which in some cases they have their special names (mentioned above)

    Environments

    • implement vector environments #302
    • implement more environment wrappers for preprocessing. Some replay memories are retired. #303 #330 #331 #342
    • make Lab Env wrapper interface identical to gym #304, #305, #306, #307

    API

    • all the Space objects (AgentSpace, EnvSpace, AEBSpace, InfoSpace) are retired, to opt for a much simpler interface. #335 #348
    • major API simplification throughout

    Analysis

    • rework analysis, introduce new metrics: strength, sample efficiency, training efficiency, stability, consistency #347 #349
    • fast evaluation using vectorized env for rigorous_eval #390 , and using inference for fast eval #391

    Search

    • update and rework Ray search #350 #351
    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Apr 17, 2019)

    Improve installation

    • #288 split out yarn installation as extra step

    Improve functions

    • #283 #284 redesign fitness slightly
    • #281 simplify PER sample index
    • #287 #290 improve DQN polyak and network switching
    • #291 refactor advantage functions
    • #295 #296 refactor various utils, fix PyTorch inplace ops

    Add out layer activation

    • #300 add out layer activation
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Feb 5, 2019)

    Eval rework

    #275 #278 #279 #280

    This release adds an eval mode that is the same as OpenAI baseline. Spawn 2 environments, 1 for training and 1 more eval. In the same process (blocking), run training as usual, then at ckpt, run an episode on eval env and update stats.

    The logic for the stats are the same as before, except the original body.df is now split into two: body.train_df and body.eval_df. Eval df uses the main env stats except for t, reward to reflect progress on eval env. Correspondingly, session analysis also produces both versions of data.

    Data from body.eval_df is used to generate session_df, session_graph, session_fitness_df, whereas the data from body.train_df is used to generate a new set of trainsession_df, trainsession_graph, trainsession_fitness_df for debugging.

    The previous process-based eval functionality is kept, but is now considered as parallel_eval. This can be useful for more robust checkpointing and eval.

    Refactoring

    #279

    • purge useless computations
    • properly and efficiently gather and organize all update variable computations.

    This also speeds up run time by x2. For Atari Beamrider with DQN on V100 GPU, manual benchmark measurement gives 110 FPS for training every 4 frames, while eval achieves 160 FPS. This translates to 10M frames in roughly 24 hours.

    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Jan 20, 2019)

    Docker image kengz/slm_lab:v3.0.0 released

    Add Retro Eval

    • #270 add retro eval mode to run fail online eval sessions. Use command yarn retro_eval data/reinforce_cartpole_2018_01_22_211751
    • #272 #273 fix eval saving 0 index to eval_session_df causing trial analysis to break; add reset_index for safety

    fix Boltzmann spec

    • #271 change Boltzmann spec to use Categorical instead of the wrong Argmax

    misc

    • #273 update colorlover package to proper pip after they fixed division error
    • #274 remove unused torchvision package to lighten build
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Jan 9, 2019)

    v3.1.0: L1 fitness norm, code and spec refactor, online eval

    Docker image kengz/slm_lab:v3.1.0 released

    L1 fitness norm (breaking change)

    • change fitness vector norm from L2 to L1 for intuitiveness and non-extreme values

    code and spec refactor

    • #254 PPO cleanup: remove hack and restore minimization scheme
    • #255 remove use_gae and use_nstep param to infer from lam, num_step_returns
    • #260 fix decay start_step offset, add unit tests for rate decay methods
    • #262 make epi start from 0 instead of 1 for code logic consistency
    • #264 switch max_total_t, max_epi to max_tick and max_tick_unit for directness. retire graph_x for the unit above
    • #266 add Atari fitness std, fix CUDA coredump issue
    • #269 update gym, remove box2d hack

    Online Eval mode

    #252 #257 #261 #267 Evaluation sessions during training on a subprocess. This does not interfere with the training process, but spawns multiple subprocesses to do independent evaluation, which then adds to an eval file, and at the end a final eval will finish and plot all the graphs and save all the data for eval.

    • enabled by meta spec 'training_eval'
    • configure NUM_EVAL_EPI in analysis.py
    • update enjoy and eval mode syntax. see README.
    • change ckpt behavior to use e.g. tag ckpt-epi10-totalt1000
    • add new eval mode to lab. runs on a checkpoint file. see below

    Eval Session

    • add a proper eval Session which loads from the ckpt like above, and does not interfere with existing files. This can be ran on terminal, and it's also used by the internal eval logic, e.g. command python run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole [email protected]_cartpole_t0_s2_ckpt-epi10-totalt1000
    • when eval session is done, it will average all of its ran episodes and append to a row in an eval_session_df.csv
    • after that it will delete the ckpt files it had just used (to prevent large storage)
    • then, it will run a trial analysis to update eval_trial_graph.png, and an accompanying trial_df as average of all session_dfs

    How eval mode works

    • checkpoint will save the models using the scheme which records its epi and total_t. This allows one to eval using the ckpt model
    • after creating ckpt files, if spec.meta.training_eval intrainmode, a subprocess will launch using the ckpt prepath to run an eval Session, using the same way abovepython run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole [email protected]_cartpole_t0_s2_ckpt-epi10-totalt1000`
    • eval session runs as above. ckpt will now run at the starting timestep, ckpt timestep, and at the end
    • the main Session will wait for the final eval session and it's final eval trial to finish before closing, to ensure that other processes like zipping wait for them.

    Example eval trial graph:

    dqn_cartpole_t0_ckpt-eval_trial_graph

    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Dec 3, 2018)

    V3: PyTorch 1.0, faster Neural Network, Variable Scheduler

    Docker image kengz/slm_lab:v3.0.0 released

    PRs included #240 #241 #239 #238 #244 #248

    PyTorch 1.0 and parallel CUDA

    • switch to PyTorch 1.0 with various improvements and parallel CUDA fix

    new Neural Network API (breaking changes)

    To accommodate more advanced features and improvements, all the networks have been improved with better spec and code design, faster operations, and added features

    • single-tail networks will now not use list but a single tail for fast output compute (for loop is slow)
    • use PyTorch optim.lr_scheduler for learning rate decay. retire old methods.
    • more efficient spec format for network, clip_grad, lr_scheduler_spec
    • fix and add proper generalization for ConvNet and RecurrentNet
    • add full basic network unit tests

    DQN

    • rewrite DQN loss for 2x speedup and code simplicity. extend to SARSA
    • retire MultitaskDQN for HydraDQN

    Memory

    • add OnpolicyConcatReplay
    • standardize preprocess_state logic in onpolicy memories

    Variable Scheduler (breaking spec changes)

    • implement variable decay class VarScheduler similar to pytorch's LR scheduler. use clock with flexible scheduling units epi or total_t
    • unify VarScheduler to use standard clock.max_tick_unit specified from env
    • retire action_policy_update, update agent spec to explore_var_spec
    • replace entropy_coef with entropy_coef_spec
    • replace clip_eps with clip_eps_spec (PPO)
    • update all specs

    Math util

    • move decay methods to math_util.py
    • move math_util.py from algorithm/ to lib/

    env max tick (breaking spec changes)

    • spec/variable renamings:
      • max_episode to max_epi
      • max_timestep to max_t
      • save_epi_frequency to save_frequency
      • traininig_min_timestep to training_start_step
    • allow env to stop based on max_epi as well as max_total_t. propagate clock unit usage
    • introduce max_tick, max_tick_unit properties to env and clock from above
    • allow save_frequency to use the same units accordingly
    • update Pong and Beamrider to use max_total_t as end-condition

    Update Ray to reenable CUDA in search

    • update ray from 0.3.1 to 0.5.3 to address broken GPU with pytorch 1.0.0
    • to fix CUDA not discovered in Ray worker, have to manually set CUDA devices at ray remote function due to poor design.

    Improved logging and Enjoy mode

    #243 #245

    • Best models checkpointing measured using the the reward_ma
    • Early termination if the environment is solved
    • method for logging learning rate to session data frame needed to be updated after move to PyTorch lr_scheduler
    • Also removed training_net from the mean learning rate reported in the session dataframe since the learning rate doesn't change
    • update naming scheme to work with enjoy mode
    • unify and simplify prepath methods
    • info_space now uses a ckpt for loading ckpt model. Example usage: yarn start pong.json dqn_pong enjoy@data/dqn_cartpole_2018_12_02_124127/dqn_cartpole_t0_s0_ckptbest
    • update agent load and policy to properly set variables to end_val in enjoy mode
    • random-seed env as well

    Working Atari

    #242 Atari benchmark had been failing, but the root cause had finally been discovered and fix: wrong image preprocessing. This can be due to several factors, and we are doing ablation studies to check against the old code: - Image normalization cause the input values to be lowered by ~255, and the resultant loss is too small for optimizer.

    • blackframes in stacking at the beginning timesteps
    • wrong image permutation

    PR #242 introduces:

    • global environment preprocessor in the form of env wrapper borrowed from OpenAI baselines, in env/wrapper.py
    • a TransformImage to do the proper image transform: grayscale, downsize, and shape from w,h,c to PyTorch format c,h,w
    • a FrameStack which uses LazyFrames for efficiency to replace the agent-specific Atari stack frame preprocessing. This simplifies the Atari memories
    • update convnet to use honest shape (c,h,w) without extra transform, and remove its expensive image axis permutation since input now is in the right shape
    • update Vizdoom to produce (c,h,w) shape consistent with convnet input expectation

    Tuned parameters will be obtained and released next version.

    Attached is a quick training curve on Pong, DQN, where the solution avg is +18: fast_dqn_pong_t0_s0_session_graph pong

    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Nov 3, 2018)

    Add VizDoom environment

    #222 #224

    • add new OnPolicyImageReplay and ImageReplay memories
    • add VizDoom environment, thanks to @joelouismarino

    Add NN Weight Initialization functionality

    #223 #225

    • allow specification of NN weight init function in spec, thanks to @mwcvitkovic

    Update Plotly to v3

    #221

    • move to v3 to allow Python based (instead of bash) image saving for stability

    Fixes

    • #207 fix PPO loss function broken during refactoring
    • #217 fix multi-device CUDA parallelization in grad assignment
    Source code(tar.gz)
    Source code(zip)
  • v2.1.2(Oct 2, 2018)

    Benchmark

    • #177 #183 zip experiment data file for easy upload
    • #178 #186 #188 #194 add benchmark spec files
    • #193 add benchmark standard data to compute fitness
    • #196 add benchmark mode

    Reward scaling

    • #175 add environment-specific reward scaling

    HydraDQN

    • #175 HydraDQN works on cartpole and 2dball using reward scaling. spec committed

    Add code of conduct

    • #199 add a code of conduct file for community

    Misc

    • #172 add MA reward to dataframe
    • #174 refactor session parallelization
    • #196 add sys args to run lab
    • #198 add train@ mode
    Source code(tar.gz)
    Source code(zip)
  • v2.1.1(Sep 15, 2018)

    Enable Distributed CUDA

    #170 Fix the long standing pytorch + distributed using spawn multiprocessing due to Lab classes not pickleable. Just let the class wrapped in a mp_runner passed as mp.Process(target=mp_runner, args) so the classes don't get cloned from memory when spawning process, since it is now passed from outside.

    DQN replace method fix

    #169 DQN target network replacement was in the wrong direction. Fix that.

    AtariPrioritizedReplay

    #170 #171 Add a quick AtariPrioritizedReplay via some multi-inheritance black magic with PrioritizedReplay, AtariReplay

    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Sep 9, 2018)

    This release optimizes the RAM consumption and memory sampling speed after stress-testing with Atari. RAM growth is curbed, and replay memory RAM usage is now near theoretical optimality.

    Thanks to @mwcvitkovic for providing major help with this release.

    Remove DataSpace history

    #163

    • debug and fix memory growth (cause: data space saving history)
    • remove history saving altogether, and mdp data. remove aeb add_single. This changes the API.
    • create body.df to track data efficiently as a replacement. This is the API replacement for above.

    Optimize Replay Memory RAM

    #163 first optimization, halves replay RAM

    • make memory state numpy storage float16 to accommodate big memory size. half a million max_size virtual memory goes from 200GB to 50GB
    • memory index sampling for training with large size is very slow. add a method fast_uniform_sampling to speed up

    #165 second optimization, halves replay RAM again to the theoretical minimum

    • do not save next_states for replay memories due to redundancy
    • replace with sentinel self.latest_next_states during sampling
    • 1 mil max_size for Atari replay now consumes 50Gb instead of 100Gb (was 200Gb before float16 downcasting in #163 )

    Add OnPolicyAtariReplay

    #164

    • add OnPolicyAtariReplay memory so that policy based algorithms can be applied to the Atari suite.

    Misc

    • #157 allow usage as a python module via pip install -e . or python setup.py install
    • #160 guard lab default.json creation on first install
    • #161 fix agent save method, improve logging
    • #162 split logger by session for easier debugging
    • #164 fix N-Step-returns calculation
    • #166 fix pandas weird casting breaking issue causing process to hang
    • #167 uninstall unused tensorflow and tensorboard that come with Unity ML-Agents. rebuild Docker image.
    • #168 rebuild Docker and CI images
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Sep 3, 2018)

    This major v2.0.0 release addresses the user feedbacks on usability and feature requests:

    • makes the singleton case (single-agent-env) default
    • adds CUDA GPU support for all algorithms (except for distributed)
    • adds distributed training to all algorithms (ala A3C style)
    • optimizes compute, fixes some computation bugs

    Note that this release is backward-incompatible with v1.x. and earlier.

    v2.0.0: make components independent of the framework so it can be used outside of SLM-Lab for development and production, and improve usability. Backward-incompatible with v1.x.

    Singleton Mode as Default

    #153

    • singleton case (single-agent-env-body) is now the default. Any implementations need only to worry about singleton. Uses the Session in lab.
    • space case (multi-agent-env-body) is now an extension from singleton case. Simply add space_{method} to handle the space logic. Uses the SpaceSession in lab.
    • make components more independent from framework
    • major logic simplification to improve usability. Simplify the AEB and init sequences. remove post_body_init()
    • make network update and grad norm check more robust

    CUDA support

    #153

    • add attribute Net.cuda_id for device assignment (per network basis), and auto-calculate the cuda_id by trial and session index to distribute jobs
    • enable CUDA and add GPU support for all algorithms, except for distributed (A3C, DPPO etc.)
    • properly assign tensors to CUDA automatically depending if GPU is available and desired
    • run unit tests on machine with GTX 1070

    Distributed Training

    #153 #148

    • add distributed key to meta spec
    • enable distributed training using pytorch multiprocessing. Create new DistSession class which acts as the worker.
    • In distributed training, Trial creates the global networks for agents, then passes to and spawns DistSession. Effectively, the semantics of a session changes from being a disjoint copy to being a training worker.
    • make distributed usable for both singleton (single agent) and space (multiagent) cases.
    • add distributed cases to unit tests

    State Normalization

    #155

    • add state normalization using running mean and std: state = (state - mean) / std
    • apply to all algorithms
    • TODO conduct a large scale systematic study of the effect is state normalization vs without it

    Bug Fixes and Improvements

    #153

    • save() and load() now include network optimizers
    • refactor set_manual_seed to util
    • rename StackReplay to ConcatReplay for clarity
    • improve network training check of weights and grad norms
    • introduce BaseEnv as base class to OpenAIEnv and UnityEnv
    • optimize computations, major refactoring
    • update Dockerfile and release

    Misc

    • #155 add state normalization using running mean and std
    • #154 fix A2C advantage calculation for Nstep returns
    • #152 refactor SIL implementation using multi-inheritance
    • #151 refactor Memory module
    • #150 refactor Net module
    • #147 update grad clipping, norm check, multicategorical API
    • #156 fix multiprocessing for device with cuda, without using cuda
    • #156 fix multi policy arguments to be consistent, and add missing state append logic
    Source code(tar.gz)
    Source code(zip)
  • v1.1.2(Aug 8, 2018)

    This release adds PPOSIL, fixes some small issues with continuous actions, and PPO ratio computation.

    Implementations

    #145 Implement PPOSIL. Improve debug logging #143 add Arch installer thanks to @angel-ayala

    Bug Fixes

    #138 kill hanging processes of Electron for plotting #145 fix PPO wrong graph update sequence causing ratio to be 1. Fix continuous action output construction. add guards. #146 fix continuous actions and add full tests

    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Jun 28, 2018)

    This release adds some new implementations, and fixes some bugs from first benchmark runs.

    Implementations

    #127 Self-Imitation Learning #128 Checkpointing for saving models #129 Dueling Networks

    Bug Fixes

    #132 GPU test-run fixes #133 fix ActorCritic family loss compute getting detached, and linux plotting issues, add SHA to generated specs

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Jun 19, 2018)

    Canonical Algorithms and Components

    This release is research-ready.

    Finish implementation of all canonical algorithms and components. All design is fully refactored and usable across components as suitable. This release is ready for research. Read the updated doc

    SLM Lab implements most of the recent canonical algorithms and various extensions. These are used as the base of research.

    Algorithm

    code: slm_lab/agent/algorithm

    Various algorithms are in fact extensions of some simpler ones, and they are implemented as such. This makes the code very concise.

    Policy Gradient:

    • REINFORCE
    • AC (Vanilla Actor-Critic)
      • shared or separate actor critic networks
      • plain TD
      • entropy term control
    • A2C (Advantage Actor-Critic)
      • extension of AC with with advantage function
      • N-step returns as advantage
      • GAE (Generalized Advantage Estimate) as advantage
    • PPO (Proximal Policy Optimization)
      • extension of A3C with PPO loss function

    Value-based:

    • SARSA
    • DQN (Deep Q Learning)
      • boltzmann or epsilon-greedy policy
    • DRQN (Recurrent DQN)
    • Double DQN
    • Double DRQN
    • Multitask DQN (multi-environment DQN)
    • Hydra DQN (multi-environment DQN)

    Below are the modular building blocks for the algorithms. They are designed to be general, and are reused extensively.

    Memory

    code: slm_lab/agent/memory

    For on-policy algorithms (policy gradient):

    • OnPolicyReplay
    • OnPolicySeqReplay
    • OnPolicyBatchReplay
    • OnPolicyBatchSeqReplay

    For off-policy algorithms (value-based)

    • Replay
    • SeqReplay
    • StackReplay
    • AtariReplay
    • PrioritizedReplay

    Neural Network

    code: slm_lab/agent/net

    These networks are usable for all algorithms.

    • MLPNet (Multi Layer Perceptron)
    • MLPHeterogenousTails (multi-tails)
    • HydraMLPNet (multi-heads, multi-tails)
    • RecurrentNet
    • ConvNet

    Policy

    code: slm_lab/agent/algorithm/policy_util.py

    • different probability distributions for sampling actions
    • default policy
    • Boltzmann policy
    • Epsilon-greedy policy
    • numerous rate decay methods
    Source code(tar.gz)
    Source code(zip)
  • v1.0.3(May 16, 2018)

    New features and improvements

    • some code cleanup to prepare for the next version
    • DQN Atari working, not optimized yet
    • Dockerfile finished, ready to run lab at scale on server
    • implemented PPO in tensorflow from OpenAI, along with the utils
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Mar 4, 2018)

    New features and improvements

    • add EvolutionarySearch for hyperparameter search
    • rewrite and simplify the underlying Ray logic
    • fix categorical error in a2c
    • improve experiment graph: wider, add opacity
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Feb 17, 2018)

    New features and improvements

    • improve fitness computation after usage
    • add retro analysis script, via yarn analyze <dir>
    • improve plotly renderings
    • improve CNN and RNN architectures, bring to Reinforce
    • fine tune A2C and Reinforce specs
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Feb 4, 2018)

    This is the first stable release of the lab, with the core API and features finalized.

    Refer to the docs: Github Repo | Lab Documentation | Experiment Log Book

    Features

    All the crucial features of the lab are stable and tested:

    • baseline algorithms
    • OpenAI gym, Unity environments
    • modular reusable components
    • multi-agents, multi-environments
    • scalable hyperparameter search with ray
    • useful graphs and analytics
    • fitness vector for universal benchmarking of agents, environments

    Baselines

    The first release includes the following algorithms, with more to come later.

    • DQN
    • Double DQN
    • REINFORCE
      • Option to add entropy to encourage exploration
    • Actor-Critic
      • Batch or episodic training
      • Shared or separate actor and critic params
      • Advantage calculated using n-step returns or generalized advantage estimation
      • Option to add entropy to encourage exploration
    Source code(tar.gz)
    Source code(zip)
Owner
Wah Loon Keng
Engineer by day, rock climber by night. Mathematician at heart.
Wah Loon Keng
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Intel Labs 2.2k Jan 05, 2023
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Facebook Research 3.3k Jan 05, 2023
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 01, 2023
A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Stable Baselines Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a

Ashley Hill 3.7k Jan 01, 2023
TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

DeepMind 3.1k Dec 29, 2022
Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

Serpent.AI 6.4k Jan 05, 2023
Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Keras-RL 5.4k Jan 04, 2023
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Chainer 1.1k Dec 26, 2022
A customisable 3D platform for agent-based AI research

DeepMind Lab is a 3D learning environment based on id Software's Quake III Arena via ioquake3 and other open source software. DeepMind Lab provides a

DeepMind 6.8k Jan 05, 2023
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Reinforcement Learning Working Group 823 Jan 06, 2023
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3.2k Jan 02, 2023
This is the official implementation of Multi-Agent PPO.

MAPPO Chao Yu*, Akash Velu*, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. Website: https://sites.google.com/view/mappo This repository implem

653 Jan 06, 2023
Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

Google 10k Jan 07, 2023
Retro Games in Gym

Status: Maintenance (expect bug fixes and minor updates) Gym Retro Gym Retro lets you turn classic video games into Gym environments for reinforcement

OpenAI 2.8k Jan 03, 2023
TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

2.4k Dec 29, 2022
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

OpenAI 13.5k Jan 07, 2023
Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Marek Wydmuch 1.5k Dec 30, 2022
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

404 Dec 25, 2022
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Reinforcement Learning Working Group 1.6k Jan 09, 2023