SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Last update: Nov 07, 2022

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

This repo contains our codes for the paper "No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models" (ICLR 2022).

Getting Start

Pull and run docker
pytorch/pytorch:1.5.1-cuda10.1-cudnn7-devel
Install requirements
pip install -r requirements.txt

Data and Model

Download data and pre-trained models
./download.sh
Please refer to this link for details on the GLUE benchmark.
Preprocess data
./experiments/glue/prepro.sh
For the most updated data processing details, please refer to the mt-dnn repo.

Fine-tuning Pre-trained Models using SAGE

We provide an example script for fine-tuning a pre-trained BERT-base model on MNLI using Adamax-SAGE:

./scripts/train_mnli_usadamax.sh GPUID

A few notices:

learning_rate and beta3 are two of the most important hyper-parameters. learning_rate that works well for Adamax/AdamW-SAGE is usually 2 to 5 times larger than that works well for Adamax/AdamW, depending on the tasks. beta3 that works well for Adamax/AdamW-SAGE is usually in the range of 0.6 and 0.9, depending on the tasks.
To use AdamW-SAGE, set argument --optim=usadamw. The current codebase only contains the implementation of Adamax-SAGE and AdamW-SAGE. Please refer to module/bert_optim.py for details. Please refer to our paper for integrating SAGE on other optimizers.
To fine-tune a pre-trained RoBERTa-base model, set arguments --init_checkpoint to the model path and set --encoder_type to 2. Other supported models are listed in pretrained_models.py.
To fine-tune on other tasks, set arguments --train_datasets and --test_datasets to the corresponding task names.

Citation

@inproceedings{
liang2022no,
title={No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models},
author={Chen Liang and Haoming Jiang and Simiao Zuo and Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen and Tuo Zhao},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=cuvga_CiVND}
}

Contact Information

For help or issues related to this package, please submit a GitHub issue. For personal questions related to this paper, please contact Chen Liang ([email protected]).

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Related tags

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Getting Start

Data and Model

Fine-tuning Pre-trained Models using SAGE

Citation

Contact Information

Owner

Chen Liang

Harmonious Textual Layout Generation over Natural Images via Deep Aesthetics Learning

Impelmentation for paper Feature Generation and Hypothesis Verification for Reliable Face Anti-Spoofing

This repo contains the code for the paper "Efficient hierarchical Bayesian inference for spatio-temporal regression models in neuroimaging" that has been accepted to NeurIPS 2021.

The devkit of the nuScenes dataset.

The AugNet Python module contains functions for the fast computation of image similarity.

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Consensus Learning from Heterogeneous Objectives for One-Class Collaborative Filtering

Portfolio Optimization and Quantitative Strategic Asset Allocation in Python

A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Part-Aware Data Augmentation for 3D Object Detection in Point Cloud

Pyramid addon for OpenAPI3 validation of requests and responses.

Audio2Face - Audio To Face With Python

Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating

[ICCV 2021] Learning A Single Network for Scale-Arbitrary Super-Resolution

PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Implementation of the state of the art beat-detection, downbeat-detection and tempo-estimation model

Baseline and template code for node21 detection track

Example of semantic segmentation in Keras

INSPIRED: A Transparent Dialogue Dataset for Interactive Semantic Parsing