Apollo Optimizer in Tensorflow 2.x
Notes:
- Warmup is important with Apollo optimizer, so be sure to pass in a learning rate schedule vs. a constant learning rate for
learning_rate. One cycle scheduler is given as an example in one_cycle_lr_schedule.py - To clip gradient norms as in paper, add either
clipnorm(parameter-wise clipping by norm) orglobal_clipnormto the arguments (for exampleclipnorm=0.1). - Decoupled weight decay is used by default.