PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Last update: Oct 30, 2022

Related tags

Overview

MAE for Self-supervised ViT

Introduction

This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

This repo is mainly based on moco-v3, pytorch-image-models and BEiT

TODO

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning.

Vit-Base

pretrain epochs	with pixel-norm	linear acc	fine-tuning acc
100	False	--	75.58 [1]
100	True	--	77.19
800	True	--	--

On 8 NVIDIA GeForce RTX 3090 GPUs, pretrain for 100 epochs needs about 9 hours, 4096 batch size needs about 24 GB GPU memory.

[1]. fine-tuning for 50 epochs;

Vit-Large

pretrain epochs	with pixel-norm	linear acc	fine-tuning acc
100	False	--	--
100	True	--	--

On 8 NVIDIA A40 GPUs, pretrain for 100 epochs needs about 34 hours, 4096 batch size needs about xx GB GPU memory.

Usage: Preparation

The code has been tested with CUDA 11.4, PyTorch 1.8.2.

Notes:

The batch size specified by -b is the total batch size across all GPUs from all nodes.
The learning rate specified by --lr is the base lr (corresponding to 256 batch-size), and is adjusted by the linear lr scaling rule.
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.
Only pretraining and finetuning have been tested.

Usage: Self-supervised Pre-Training

Below is examples for MAE pre-training.

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 4096

python main_mae.py \
  -c cfgs/ViT-B16_ImageNet1K_pretrain.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

sh train_mae.sh

ViT-Large with 1-node (8-GPU, NVIDIA A40) pre-training, batch 2048

python main_mae.py \
  -c cfgs/ViT-L16_ImageNet1K_pretrain.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

Below is examples for MAE fine-tuning.

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 1024

python main_fintune.py \
  -c cfgs/ViT-B16_ImageNet1K_finetune.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Large with 2-node (16-GPU, 8 NVIDIA GeForce RTX 3090 + 8 NVIDIA A40) training, batch 512

python main_fintune.py \
  -c cfgs/ViT-B16_ImageNet1K_finetune.yaml \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On another node, run the same command with --rank 1.

Note:

We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.

[TODO] Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you use the code of this repo, please cite the original papre and this repo:

@Article{he2021mae,
  author  = {Kaiming He* and Xinlei Chen* and Saining Xie and Yanghao Li and Piotr Dolla ́r and Ross Girshick},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  journal = {arXiv preprint arXiv:2111.06377},
  year    = {2021},
}

@misc{yang2021maepriv,
  author       = {Lu Yang* and Pu Cao* and Yang Nie and Qing Song},
  title        = {MAE-priv},
  howpublished = {\url{https://github.com/BUPT-PRIV/MAE-priv}},
  year         = {2021},
}

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Related tags

Overview

MAE for Self-supervised ViT

Introduction

TODO

Main Results

Vit-Base

Vit-Large

Usage: Preparation

Notes:

Usage: Self-supervised Pre-Training

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 4096

ViT-Large with 1-node (8-GPU, NVIDIA A40) pre-training, batch 2048

Usage: End-to-End Fine-tuning ViT

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 1024

ViT-Large with 2-node (16-GPU, 8 NVIDIA GeForce RTX 3090 + 8 NVIDIA A40) training, batch 512

[TODO] Usage: Linear Classification

License

Citation

Owner

Multi Camera Calibration

Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

EssentialMC2 Video Understanding

This repository contains the implementation of the following paper: Cross-Descriptor Visual Localization and Mapping

The repository forked from NVlabs uses our data. (Differentiable rasterization applied to 3D model simplification tasks)

Official Repsoitory for "Activate or Not: Learning Customized Activation." [CVPR 2021]

RTSeg: Real-time Semantic Segmentation Comparative Study

Trying to understand alias-free-gan.

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

An OpenAI Gym environment for multi-agent car racing based on Gym's original car racing environment.

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in Eurographics 2021

PED: DETR for Crowd Pedestrian Detection

exponential adaptive pooling for PyTorch

DCSL - Generalizable Crowd Counting via Diverse Context Style Learning

Repo for CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning

Official implementation of TMANet.

Multi-Modal Fingerprint Presentation Attack Detection: Evaluation On A New Dataset

Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system