Keras implementation of AdaBound

Overview

AdaBound for Keras

Keras port of AdaBound Optimizer for PyTorch, from the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Usage

Add the adabound.py script to your project, and import it. Can be a dropin replacement for Adam Optimizer.

Also supports AMSBound variant of the above, equivalent to AMSGrad from Adam.

from adabound import AdaBound

optm = AdaBound(lr=1e-03,
                final_lr=0.1,
                gamma=1e-03,
                weight_decay=0.,
                amsbound=False)

Results

With a wide ResNet 34 and horizontal flips data augmentation, and 100 epochs of training with batchsize 128, it hits 92.16% (called v1).

Weights are available inside the Releases tab

NOTE

  • The smaller ResNet 20 models have been removed as they did not perform as expected and were depending on a flaw during the initial implementation. The ResNet 32 shows the actual performance of this optimizer.

With a small ResNet 20 and width + height data + horizontal flips data augmentation, and 100 epochs of training with batchsize 1024, it hits 89.5% (called v1).

On a small ResNet 20 with only width and height data augmentations, with batchsize 1024 trained for 100 epochs, the model gets close to 86% on the test set (called v3 below).

Train Set Accuracy

Train Set Loss

Test Set Accuracy

Test Set Loss

Requirements

  • Keras 2.2.4+ & Tensorflow 1.12+ (Only supports TF backend for now).
  • Numpy
Comments
  • suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    same as my PR https://github.com/keras-team/keras-contrib/pull/478 works only with TF backend

    class AdaBound(Optimizer):
        """AdaBound optimizer.
        Default parameters follow those provided in the original paper.
        # Arguments
            lr: float >= 0. Learning rate.
            final_lr: float >= 0. Final learning rate.
            beta_1: float, 0 < beta < 1. Generally close to 1.
            beta_2: float, 0 < beta < 1. Generally close to 1.
            gamma: float >= 0. Convergence speed of the bound function.
            epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
            decay: float >= 0. Learning rate decay over each update.
            weight_decay: Weight decay weight.
            amsbound: boolean. Whether to apply the AMSBound variant of this
                algorithm.
            tf_cpu_mode: only for tensorflow backend
                  0 - default, no changes.
                  1 - allows to train x2 bigger network on same VRAM consuming RAM
                  2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                      and CPU power.
        # References
            - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
              (https://openreview.net/forum?id=Bkg3g2R9FX)
            - [Adam - A Method for Stochastic Optimization]
              (https://arxiv.org/abs/1412.6980v8)
            - [On the Convergence of Adam and Beyond]
              (https://openreview.net/forum?id=ryQu7f-RZ)
        """
    
        def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                     epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
            super(AdaBound, self).__init__(**kwargs)
    
            if not 0. <= gamma <= 1.:
                raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")
    
            with K.name_scope(self.__class__.__name__):
                self.iterations = K.variable(0, dtype='int64', name='iterations')
                self.lr = K.variable(lr, name='lr')
                self.beta_1 = K.variable(beta_1, name='beta_1')
                self.beta_2 = K.variable(beta_2, name='beta_2')
                self.decay = K.variable(decay, name='decay')
    
            self.final_lr = final_lr
            self.gamma = gamma
    
            if epsilon is None:
                epsilon = K.epsilon()
            self.epsilon = epsilon
            self.initial_decay = decay
            self.amsbound = amsbound
    
            self.weight_decay = float(weight_decay)
            self.base_lr = float(lr)
            self.tf_cpu_mode = tf_cpu_mode
    
        def get_updates(self, loss, params):
            grads = self.get_gradients(loss, params)
            self.updates = [K.update_add(self.iterations, 1)]
    
            lr = self.lr
            if self.initial_decay > 0:
                lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                          K.dtype(self.decay))))
    
            t = K.cast(self.iterations, K.floatx()) + 1
    
            # Applies bounds on actual learning rate
            step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                              (1. - K.pow(self.beta_1, t)))
    
            final_lr = self.final_lr * lr / self.base_lr
            lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
            upper_bound = final_lr * (1. + 1. / (self.gamma * t))
    
            e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
            if e: e.__enter__()
            ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            if self.amsbound:
                vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            else:
                vhats = [K.zeros(1) for _ in params]
            if e: e.__exit__(None, None, None)
            
            self.weights = [self.iterations] + ms + vs + vhats
    
            for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
                # apply weight decay
                if self.weight_decay != 0.:
                    g += self.weight_decay * K.stop_gradient(p)
    
                e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
                if e: e.__enter__()                    
                m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
                v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
                if self.amsbound:
                    vhat_t = K.maximum(vhat, v_t)
                    self.updates.append(K.update(vhat, vhat_t))
                if e: e.__exit__(None, None, None)
                
                if self.amsbound:
                    denom = (K.sqrt(vhat_t) + self.epsilon)
                else:
                    denom = (K.sqrt(v_t) + self.epsilon)                        
    
                # Compute the bounds
                step_size_p = step_size * K.ones_like(denom)
                step_size_p_bound = step_size_p / denom
                bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                         lower_bound), upper_bound)
    
                p_t = p - bounded_lr_t
    
                self.updates.append(K.update(m, m_t))
                self.updates.append(K.update(v, v_t))
                new_p = p_t
    
                # Apply constraints.
                if getattr(p, 'constraint', None) is not None:
                    new_p = p.constraint(new_p)
    
                self.updates.append(K.update(p, new_p))
            return self.updates
    
        def get_config(self):
            config = {'lr': float(K.get_value(self.lr)),
                      'final_lr': float(self.final_lr),
                      'beta_1': float(K.get_value(self.beta_1)),
                      'beta_2': float(K.get_value(self.beta_2)),
                      'gamma': float(self.gamma),
                      'decay': float(K.get_value(self.decay)),
                      'epsilon': self.epsilon,
                      'weight_decay': self.weight_decay,
                      'amsbound': self.amsbound}
            base_config = super(AdaBound, self).get_config()
            return dict(list(base_config.items()) + list(config.items()))
    
    opened by iperov 13
  • AdaBound.iterations

    AdaBound.iterations

    this param is not saved.

    I looked at official pytorch implementation from original paper. https://github.com/Luolc/AdaBound/blob/master/adabound/adabound.py

    it has

    # State initialization
    if len(state) == 0:
        state['step'] = 0
    

    state is saved with the optimizer.

    also it has

    # Exponential moving average of gradient values
    state['exp_avg'] = torch.zeros_like(p.data)
    # Exponential moving average of squared gradient values
    state['exp_avg_sq'] = torch.zeros_like(p.data)
    

    these values should also be saved

    So your keras implementation is wrong.

    opened by iperov 10
  • Using SGDM with lr=0.1 leads to not learning

    Using SGDM with lr=0.1 leads to not learning

    Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below. image

    I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

    opened by syorami 10
  • clip by value

    clip by value

    https://github.com/CyberZHG/keras-adabound/blob/master/keras_adabound/optimizers.py

    K.minimum(K.maximum(step, lower_bound), upper_bound)

    will not work?

    opened by iperov 2
  • Unexpected keyword argument passed to optimizer: amsbound

    Unexpected keyword argument passed to optimizer: amsbound

    I installed with pip install keras-adabound imported with: from keras_adabound import AdaBound and declared the optimizer as: opt = AdaBound(lr=1e-03,final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Then, I'm getting the error: TypeError: Unexpected keyword argument passed to optimizer: amsbound

    changing the pip install to adabound (instead of keras-adabound) and the import to from adabound import AdaBound, the keyword amsbound is recognized, but then I get the error: TypeError: __init__() missing 1 required positional argument: 'params'

    Am I mixing something up here or missing something?

    opened by stabilus 0
  • Unclear how to import and use tf.keras version

    Unclear how to import and use tf.keras version

    I have downloaded the files and placed them in a folder in the site packages for my virtual environment but I can't get this to work. I have added the folder path to sys.path and verified it is listed. I'm running Tensorflow 2.1.0. What am I doing wrong?

    opened by mnweaver1 0
  • about lr

    about lr

    Thanks for a good optimizer According to usage optm = AdaBound(lr=1e-03, final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Does the learning rate gradually increase by the number of steps?


    final lr is described as Final learning rate. but it actually is leaning rate relative to base lr and current klearning rate? https://github.com/titu1994/keras-adabound/blob/5ce819b6ca1cd95e32d62e268bd2e0c99c069fe8/adabound.py#L72

    opened by tanakataiki 1
Releases(0.1)
Owner
Somshubra Majumdar
Interested in Machine Learning, Deep Learning and Data Science in general
Somshubra Majumdar
MaskTrackRCNN for video instance segmentation based on mmdetection

MaskTrackRCNN for video instance segmentation Introduction This repo serves as the official code release of the MaskTrackRCNN model for video instance

411 Jan 05, 2023
Benchmark for Answering Existential First Order Queries with Single Free Variable

EFO-1-QA Benchmark for First Order Query Estimation on Knowledge Graphs This repository contains an entire pipeline for the EFO-1-QA benchmark. EFO-1

HKUST-KnowComp 14 Oct 24, 2022
Easy-to-use micro-wrappers for Gym and PettingZoo based RL Environments

SuperSuit introduces a collection of small functions which can wrap reinforcement learning environments to do preprocessing ('microwrappers'). We supp

Farama Foundation 357 Jan 06, 2023
Parameter Efficient Deep Probabilistic Forecasting

PEDPF Parameter Efficient Deep Probabilistic Forecasting (PEDPF) is a repository containing code to run experiments for several deep learning based pr

Olivier Sprangers 10 Jun 13, 2022
A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

34 Dec 28, 2022
Show-attend-and-tell - TensorFlow Implementation of "Show, Attend and Tell"

Show, Attend and Tell Update (December 2, 2016) TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attent

Yunjey Choi 902 Nov 29, 2022
The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

EIGNN: Efficient Infinite-Depth Graph Neural Networks The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 20

Juncheng Liu 14 Nov 22, 2022
This is an official source code for implementation on Extensive Deep Temporal Point Process

Extensive Deep Temporal Point Process This is an official source code for implementation on Extensive Deep Temporal Point Process, which is composed o

Haitao Lin 8 Aug 15, 2022
A Review of Deep Learning Techniques for Markerless Human Motion on Synthetic Datasets

HOW TO USE THIS PROJECT A Review of Deep Learning Techniques for Markerless Human Motion on Synthetic Datasets Based on DeepLabCut toolbox, we run wit

1 Jan 10, 2022
Unet network with mean teacher for altrasound image segmentation

Unet network with mean teacher for altrasound image segmentation

5 Nov 21, 2022
YOLOX-RMPOLY

本算法为适应robomaster比赛,而改动自矩形识别的yolox算法。 基于旷视科技YOLOX,实现对不规则四边形的目标检测 TODO 修改onnx推理模型 更改/添加标注: 1.yolox/models/yolox_polyhead.py: 1.1继承yolox/models/yolo_

3 Feb 25, 2022
Si Adek Keras is software VR dangerous object detection.

Si Adek Python Keras Sistem Informasi Deteksi Benda Berbahaya Keras Python. Version 1.0 Developed by Ananda Rauf Maududi. Developed date: 24 November

Ananda Rauf 1 Dec 21, 2021
QQ Browser 2021 AI Algorithm Competition Track 1 1st Place Program

QQ Browser 2021 AI Algorithm Competition Track 1 1st Place Program

249 Jan 03, 2023
Attendance Monitoring with Face Recognition using Python

Attendance Monitoring with Face Recognition using Python A python GUI integrated attendance system using face recognition to take attendance. In this

Vaibhav Rajput 2 Jun 21, 2022
StyleGAN2-ada for practice

This version of the newest PyTorch-based StyleGAN2-ada is intended mostly for fellow artists, who rarely look at scientific metrics, but rather need a working creative tool. Tested on Python 3.7 + Py

vadim epstein 170 Nov 16, 2022
EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

EntityQuestions This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-ce

Princeton Natural Language Processing 119 Sep 28, 2022
Losslandscapetaxonomy - Taxonomizing local versus global structure in neural network loss landscapes

Taxonomizing local versus global structure in neural network loss landscapes Int

Yaoqing Yang 8 Dec 30, 2022
Simulation of moving particles under microscopic imaging

Simulation of moving particles under microscopic imaging Install scipy numpy scikit-image tiffile Run python simulation.py Read result https://imagej

Zehao Wang 2 Dec 14, 2021
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 03, 2022
A check for whether the dependency jobs are all green.

alls-green A check for whether the dependency jobs are all green. Why? Do you have more than one job in your GitHub Actions CI/CD workflows setup? Do

Re:actors 33 Jan 03, 2023