ConvMAE: Masked Convolution Meets Masked Autoencoders

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1,

1 Shanghai AI Laboratory, 2 MMLab, CUHK, 3 Sensetime Research.

This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.

Updates

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

  1. Pretrained models on ImageNet-1K for ConvMAE.
  2. The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

  • We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
  • ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
  • ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

tenser

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

ConvMAE-Base
pretrained checkpoints download
logs download

Main Results on ImageNet-1K

Models #Params(M) Supervision Encoder Ratio Pretrain Epochs FT [email protected](%) LIN [email protected](%) FT logs/weights LIN logs/weights
BEiT 88 DALLE 100% 300 83.0 37.6 - -
MAE 88 RGB 25% 1600 83.6 67.8 - -
SimMIM 88 RGB 100% 800 84.0 56.7 - -
MaskFeat 88 HOG 100% 300 83.6 N/A - -
data2vec 88 RGB 100% 800 84.2 N/A - -
ConvMAE-B 88 RGB 25% 1600 85.0 70.9 log/weight

Main Results on COCO

Mask R-CNN

Models Pretrain Pretrain Epochs Finetune Epochs #Params(M) FLOPs(T) box AP mask AP logs/weights
Swin-B IN21K w/ labels 300 36 109 0.7 51.4 45.4 -
Swin-L IN21K w/ labels 300 36 218 1.1 52.4 46.2 -
MViTv2-B IN21K w/ labels 300 36 73 0.6 53.1 47.4 -
MViTv2-L IN21K w/ labels 300 36 239 1.3 53.6 47.5 -
Benchmarking-ViT-B IN1K w/o labels 1600 100 118 0.9 50.4 44.9 -
Benchmarking-ViT-L IN1K w/o labels 1600 100 340 1.9 53.3 47.2 -
ViTDet IN1K w/o labels 1600 100 111 0.8 51.2 45.5 -
MIMDet-ViT-B IN1K w/o labels 1600 36 127 1.1 51.5 46.0 -
MIMDet-ViT-L IN1K w/o labels 1600 36 345 2.6 53.3 47.5 -
ConvMAE-B IN1K w/o lables 1600 25 104 0.9 53.2 47.1 log/weight

Main Results on ADE20K

UperNet

Models Pretrain Pretrain Epochs Finetune Iters #Params(M) FLOPs(T) mIoU logs/weights
DeiT-B IN1K w/ labels 300 16K 163 0.6 45.6 -
Swin-B IN1K w/ labels 300 16K 121 0.3 48.1 -
MoCo V3 IN1K 300 16K 163 0.6 47.3 -
DINO IN1K 400 16K 163 0.6 47.2 -
BEiT IN1K+DALLE 1600 16K 163 0.6 47.1 -
PeCo IN1K 300 16K 163 0.6 46.7 -
CAE IN1K+DALLE 800 16K 163 0.6 48.8 -
MAE IN1K 1600 16K 163 0.6 48.1 -
ConvMAE-B IN1K 1600 16K 153 0.6 51.7 soon

Main Results on Kinetics-400

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 100 87 77.8
VideoMAE-B 800 100 87 79.4
VideoMAE-B 1600 100 87 79.8
VideoMAE-B 1600 100 (w/ Repeated Aug) 87 80.7 94.7
SpatioTemporalLearner-B 800 150 (w/ Repeated Aug) 87 81.3 94.9
VideoConvMAE-B 200 100 86 80.1 94.3 Soon
VideoConvMAE-B 800 100 86 81.7 95.1 Soon
VideoConvMAE-B-MSD 800 100 86 82.7 95.5 Soon

Main Results on Something-Something V2

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 40 87 66.1
VideoMAE-B 800 40 87 69.3
VideoMAE-B 2400 40 87 70.3
VideoConvMAE-B 200 40 86 67.7 91.2 Soon
VideoConvMAE-B 800 40 86 69.9 92.4 Soon
VideoConvMAE-B-MSD 800 40 86 70.7 93.0 Soon

Getting Started

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Training and evaluation

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}
Owner
Alpha VL Team of Shanghai AI Lab
Alpha VL Team of Shanghai AI Lab
YOLOv5 in PyTorch > ONNX > CoreML > TFLite

This repository represents Ultralytics open-source research into future object detection methods, and incorporates lessons learned and best practices evolved over thousands of hours of training and e

Ultralytics 34.1k Dec 31, 2022
Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

video_lie_detector_using_xgboost a video lie detector using OpenFace and xgboost

2 Jan 11, 2022
Joint Detection and Identification Feature Learning for Person Search

Person Search Project This repository hosts the code for our paper Joint Detection and Identification Feature Learning for Person Search. The code is

712 Dec 17, 2022
Neural Dynamic Policies for End-to-End Sensorimotor Learning

This is a PyTorch based implementation for our NeurIPS 2020 paper on Neural Dynamic Policies for end-to-end sensorimotor learning.

Shikhar Bahl 47 Dec 11, 2022
PyTorch implementation of DCT fast weight RNNs

DCT based fast weights This repository contains the official code for the paper: Training and Generating Neural Networks in Compressed Weight Space. T

Kazuki Irie 4 Dec 24, 2022
SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

SmallInitEmb LayerNorm(SmallInit(Embedding)) in a Transformer I find that when t

PENG Bo 11 Dec 25, 2022
Set of models for classifcation of 3D volumes

Classification models 3D Zoo - Keras and TF.Keras This repository contains 3D variants of popular CNN models for classification like ResNets, DenseNet

69 Dec 28, 2022
This repository contains the code for the ICCV 2019 paper "Occupancy Flow - 4D Reconstruction by Learning Particle Dynamics"

Occupancy Flow This repository contains the code for the project Occupancy Flow - 4D Reconstruction by Learning Particle Dynamics. You can find detail

189 Dec 29, 2022
PyTorch implementation of PSPNet

PSPNet with PyTorch Unofficial implementation of "Pyramid Scene Parsing Network" (https://arxiv.org/abs/1612.01105). This repository is just for caffe

Kazuto Nakashima 52 Nov 16, 2022
Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Single-view robot pose and joint angle estimation via render & compare Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic CVPR: Conference on C

Yann Labbé 51 Oct 14, 2022
Line-level Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Line-level Handwritten Text Recognition with TensorFlow This model is an extended version of the Simple HTR system implemented by @Harald Scheidl and

Hoàng Tùng Lâm (Linus) 72 May 07, 2022
Joint learning of images and text via maximization of mutual information

mutual_info_img_txt Joint learning of images and text via maximization of mutual information. This repository incorporates the algorithms presented in

Ruizhi Liao 10 Dec 22, 2022
Learning a mapping from images to psychological similarity spaces with neural networks.

LearningPsychologicalSpaces v0.1: v1.1: v1.2: v1.3: v1.4: v1.5: The code in this repository explores learning a mapping from images to psychological s

Lucas Bechberger 8 Dec 12, 2022
PyTorch code for our paper "Gated Multiple Feedback Network for Image Super-Resolution" (BMVC2019)

Gated Multiple Feedback Network for Image Super-Resolution This repository contains the PyTorch implementation for the proposed GMFN [arXiv]. The fram

Qilei Li 66 Nov 03, 2022
AI that generate music

PianoGPT ai that generate music try it here https://share.streamlit.io/annasajkh/pianogpt/main/main.py or here https://huggingface.co/spaces/Annas/Pia

Annas 28 Nov 27, 2022
This repository is for our EMNLP 2021 paper "Automated Generation of Accurate & Fluent Medical X-ray Reports"

Introduction: X-Ray Report Generation This repository is for our EMNLP 2021 paper "Automated Generation of Accurate & Fluent Medical X-ray Reports". O

no name 36 Dec 16, 2022
Crowd-Kit is a powerful Python library that implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets

Crowd-Kit: Computational Quality Control for Crowdsourcing Documentation Crowd-Kit is a powerful Python library that implements commonly-used aggregat

Toloka 125 Dec 30, 2022
(SIGIR2020) “Asymmetric Tri-training for Debiasing Missing-Not-At-Random Explicit Feedback’’

Asymmetric Tri-training for Debiasing Missing-Not-At-Random Explicit Feedback About This repository accompanies the real-world experiments conducted i

yuta-saito 19 Dec 01, 2022
Clockwork Variational Autoencoder

Clockwork Variational Autoencoders (CW-VAE) Vaibhav Saxena, Jimmy Ba, Danijar Hafner If you find this code useful, please reference in your paper: @ar

Vaibhav Saxena 35 Nov 06, 2022
Python program that works as a contact list

Lista de Contatos Programa em Python que funciona como uma lista de contatos. Features Adicionar novo contato Remover contato Atualizar contato Pesqui

Victor B. Lino 3 Dec 16, 2021