PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners

Last update: Jan 04, 2023

Related tags

Overview

Masked Autoencoders: A PyTorch Implementation

This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU.
This repo is a modification on the DeiT repo. Installation and preparation follow that repo.
This repo is based on timm==0.3.2, for which a fix is needed to work with PyTorch 1.8.1+.

Catalog

Visualization demo
Pre-trained checkpoints + fine-tuning code
Pre-training code

Visualization demo

Run our interactive visualization demo using Colab notebook (no GPU needed):

Fine-tuning with pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:

	ViT-Base	ViT-Large	ViT-Huge
pre-trained checkpoint	download	download	download
md5	`8cad7c`	`b8b06e`	`9bdbb0`

The fine-tuning instruction is in FINETUNE.md.

By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):

	ViT-B	ViT-L	ViT-H	ViT-H₄₄₈	prev best
ImageNet-1K (no external data)	83.6	85.9	86.9	87.8	87.1
following are evaluation of the same model weights (fine-tuned in original ImageNet-1K):
ImageNet-Corruption (error rate)	51.7	41.8	33.8	36.8	42.5
ImageNet-Adversarial	35.9	57.1	68.2	76.7	35.8
ImageNet-Rendition	48.3	59.9	64.4	66.5	48.7
ImageNet-Sketch	34.5	45.3	49.6	50.9	36.0
following are transfer learning by fine-tuning the pre-trained MAE on the target dataset:
iNaturalists 2017	70.5	75.7	79.3	83.4	75.4
iNaturalists 2018	75.4	80.1	83.0	86.8	81.2
iNaturalists 2019	80.5	83.4	85.7	88.3	84.1
Places205	63.9	65.8	65.9	66.8	66.0
Places365	57.9	59.4	59.8	60.3	58.0

Pre-training

The pre-training instruction is in PRETRAIN.md.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners

Related tags

Overview

Masked Autoencoders: A PyTorch Implementation

Catalog

Visualization demo

Fine-tuning with pre-trained checkpoints

Pre-training

License

Owner

Meta Research

IhoneyBakFileScan Modify - 批量网站备份文件扫描器，增加文件规则，优化内存占用

3D-aware GANs based on NeRF (arXiv).

Implementation for "Domain-Specific Bias Filtering for Single Labeled Domain Generalization"

Official code for paper "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight"

This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murder rates etc.

These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

EZ graph is an easy to use AI solution that allows you to make and train your neural networks without a single line of code.

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021)

Code for "Causal autoregressive flows" - AISTATS, 2021

Supervised forecasting of sequential data in Python.

OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

Repository of our paper 'Refer-it-in-RGBD' in CVPR 2021

LIMEcraft: Handcrafted superpixel selectionand inspection for Visual eXplanations

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

Codes for NeurIPS 2021 paper "Adversarial Neuron Pruning Purifies Backdoored Deep Models"

PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction

Papers about explainability of GNNs

Official Repsoitory for "Activate or Not: Learning Customized Activation." [CVPR 2021]