rl: An efficient framework for reinforcement learning
Requirements
name | version |
---|---|
Python | >=3.7 |
numpy | >=1.19 |
torch | >=1.7 |
tensorboard | >=2.5 |
tensorboardX | >=2.4 |
gym | >=0.18.3 |
Make sure your Python environment is activated before installing following requirements.
pip install -U gym tensorboard tensorboardx
Introduction
Quick Start
CartPole-v0:
python demo.py
Enter the following commands in terminal to start training Pendulum-v0:
python demo.py --env_name Pendulum-v0 --target_reward -250.0
Use Recurrent Neural Network:
python demo.py --env_name Pendulum-v0 --target_reward -250.0 --use_rnn --log_dir Pendulum-v0_RNN
Open a new terminal:
tensorboard --logdir=result
Then you can access the training information by visiting http://localhost:6006/ in browser.
Structure
- core/ Reinforcement Learning core module
- log.py/ logging
- ppo.py/ Proximal Policy Optimization algorithm
- network.py definition of actor and critic network
- env/ environment for multiprocessing
- test_env.py test environment
- vec_env.py wrapped vector environment
- result/ training curves and models
- demo.py demonstration
Proximal Policy Optimization
PPO is an on-policy and model-free reinforcement learning algorithm.
Components
- Generalized Advantage Estimation (GAE)
- Gate Recurrent Unit (GRU)
Hyperparameters
hyperparameter | note | value |
---|---|---|
env_num | number of parallel processes | 16 |
chunk_len | BPTT for GRU | 10 |
eps | clipping parameter | 0.2 |
gamma | discount factor | 0.99 |
gae_lambda | trade-off between TD and MC | 0.95 |
entropy_coef | coefficient of entropy | 0.05 |
ppo_epoch | data usage | 5 |
adv_norm | normalized advantage | 1 (True) |
max_norm | gradient clipping (L2) | 20.0 |
weight_decay | weight decay (L2) | 1e-6 |
lr_actor | learning rate of actor network | 1e-3 |
lr_critic | learning rate of critic network | 1e-3 |
Test Environment
A simple test environment for verifying the effectiveness of this algorithm (of course, the algorithm can also be implemented by yourself).
Simple logic with less code.
Mechanism
The environment chooses one number randomly in every step, and returns the one-hot matrix.
If the action taken matches the number chosen in the last 3 steps, you will get a complete reward of 1.
>>> from env.test_env import TestEnv
>>> env = TestEnv()
>>> env.seed(0)
>>> env.reset()
array([1., 0., 0.], dtype=float32)
>>> env.step(9 * 0 + 3 * 0 + 1 * 0)
(array([0., 1., 0.], dtype=float32), 1.0, False, {'str': 'Completely correct.'})
>>> env.step(9 * 1 + 3 * 0 + 1 * 0)
(array([1., 0., 0.], dtype=float32), 1.0, False, {'str': 'Completely correct.'})
>>> env.step(9 * 0 + 3 * 1 + 1 * 0)
(array([0., 1., 0.], dtype=float32), 1.0, False, {'str': 'Completely correct.'})
>>> env.step(9 * 0 + 3 * 1 + 1 * 0)
(array([0., 1., 0.], dtype=float32), 0.0, False, {'str': 'Completely wrong.'})
>>> env.step(9 * 0 + 3 * 1 + 1 * 0)
(array([0., 0., 1.], dtype=float32), 0.6666666666666666, False, {'str': 'Partially correct.'})
>>> env.step(9 * 2 + 3 * 0 + 1 * 0)
(array([1., 0., 0.], dtype=float32), 0.3333333333333333, False, {'str': 'Partially correct.'})
>>> env.step(9 * 0 + 3 * 2 + 1 * 1)
(array([0., 0., 1.], dtype=float32), 1.0, False, {'str': 'Completely correct.'})
>>>
Convergence Reward
- General RL algorithms will achieve an average reward of 55.5.
- Because of the state memory unit, RNN based RL algorithms can reach the goal of 100.0.
2021, ICCD Lab, Dalian University of Technology. Author: Jingcheng Jiang.