SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Last update: Dec 25, 2022

Related tags

Overview

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer

I find that when training a transformer, the embedding matrix moves slowly, hence it's difficult for the model to jump out of the initial noisy embedding.

(initial embedding)
[[-0.0073  0.0062 -0.0261 ...  0.0086  0.0107 -0.008 ] ... ]
 (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4)
[[-0.0069  0.0066 -0.0265 ...  0.009   0.0111 -0.0084] ... ]

So I propose initializing the embedding matrix to tiny values, and put another LayerNorm after it (before all the SA & FFN layers):

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
if self.config.USE_SMALL_EMB and self.layer_id == 0:
    x = self.lnPre(x) # LN(SmallInit(Emb))
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

And then you get improved convergence (especially for BPE models) because the model can quickly jump out of the tiny initial embedding (small changes after 1 step -> significant changes of directions -> significant changes after LayerNorm).

Loss curve comparison: https://wandb.ai/blinkdl/SmallEmbTest

(the gap between LayerNorm(SmallEmb)) and baseline persists after more training)

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
x = self.ln1(x) # this plays the same role as the lnPre in the above PreLN code
x = x + self.att(x)
x = self.ln2(x)
x = x + self.ffn(x)
(note you shall have another LN after the final ffn)

SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Related tags

Overview

SmallInitEmb

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

Owner

PENG Bo

Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

The goal of the exercises below is to evaluate the candidate knowledge and problem solving expertise regarding the main development focuses for the iFood ML Platform team: MLOps and Feature Store development.

Unofficial implementation of Proxy Anchor Loss for Deep Metric Learning

PyTorch implementation of Decoupling Value and Policy for Generalization in Reinforcement Learning

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

Code repository for the paper: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild (ICCV 2021)

This repository attempts to replicate the SqueezeNet architecture and implement the same on an image classification task.

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

Real-Time High-Resolution Background Matting

HODEmu, is both an executable and a python library that is based on Ragagnin 2021 in prep.

BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

Real-time analysis of intracranial neurophysiology recordings.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

chainladder - Property and Casualty Loss Reserving in Python

PyTorch code for training MM-DistillNet for multimodal knowledge distillation

Pytorch implementation of our method for regularizing nerual radiance fields for few-shot neural volume rendering.

The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

WSDM2022 Challenge - Large scale temporal graph link prediction

Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

LAnguage Model Analysis