This is an official implementation for "Self-Supervised Learning with Swin Transformers".

Overview

Self-Supervised Learning with Vision Transformers

By Zhenda Xie*, Yutong Lin*, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao and Han Hu

This repo is the official implementation of "Self-Supervised Learning with Swin Transformers".

A important feature of this codebase is to include Swin Transformer as one of the backbones, such that we can evaluate the transferring performance of the learnt representations on down-stream tasks of object detection and semantic segmentation. This evaluation is usually not included in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.

It currently includes code and models for the following tasks:

Self-Supervised Learning and Linear Evaluation: Included in this repo. See get_started.md for a quick start.

Transferring Performance on Object Detection/Instance Segmentation: See Swin Transformer for Object Detection.

Transferring Performance on Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Highlights

  • Include down-stream evaluation: the first work to evaluate the transferring performance on down-stream tasks for SSL using Transformers
  • Small tricks: significantly less tricks than previous works, such as MoCo v3 and DINO
  • High accuracy on ImageNet-1K linear evaluation: 72.8 vs 72.5 (MoCo v3) vs 72.5 (DINO) using DeiT-S/16 and 300 epoch pre-training

Updates

05/13/2021

  1. Self-Supervised models with DeiT-Small on ImageNet-1K (MoBY-DeiT-Small-300Ep-Pretrained, MoBY-DeiT-Small-300Ep-Linear) are provided.
  2. The supporting code and config for self-supervised learning with DeiT-Small are provided.

05/11/2021

Initial Commits:

  1. Self-Supervised Pre-training models on ImageNet-1K (MoBY-Swin-T-300Ep-Pretrained, MoBY-Swin-T-300Ep-Linear) are provided.
  2. The supported code and models for self-supervised pre-training and ImageNet-1K linear evaluation, COCO object detection and ADE20K semantic segmentation are provided.

Introduction

MoBY: a self-supervised learning approach by combining MoCo v2 and BYOL

MoBY (the name MoBY stands for MoCo v2 with BYOL) is initially described in arxiv, which is a combination of two popular self-supervised learning approaches: MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL.

MoBY achieves reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.3% top-1 accuracy using DeiT and Swin-T, respectively, by 300-epoch training. The performance is on par with recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.

teaser_moby

Swin Transformer as a backbone

Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a general-purpose backbone for computer vision. It achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.

We involve Swin Transformer as one of backbones to evaluate the transferring performance on down-stream tasks such as object detection. This differentiate this codebase with other approaches studying SSL on Transformer architectures.

ImageNet-1K linear evaluation

Method Architecture Epochs Params FLOPs img/s Top-1 Accuracy Pre-trained Checkpoint Linear Checkpoint
Supervised Swin-T 300 28M 4.5G 755.2 81.2 Here
MoBY Swin-T 100 28M 4.5G 755.2 70.9 TBA
MoBY1 Swin-T 100 28M 4.5G 755.2 72.0 TBA
MoBY DeiT-S 300 22M 4.6G 940.4 72.8 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
MoBY Swin-T 300 28M 4.5G 755.2 75.3 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
  • 1 denotes the result of MoBY which has adopted a trick from MoCo v3 that replace theLayerNorm layers before the MLP blocks by BatchNorm.

  • Access code for baidu is moby.

Transferring to Downstream Tasks

COCO Object Detection (2017 val)

Backbone Method Model Schd. box mAP mask mAP Params FLOPs
Swin-T Mask R-CNN Sup. 1x 43.7 39.8 48M 267G
Swin-T Mask R-CNN MoBY 1x 43.6 39.6 48M 267G
Swin-T Mask R-CNN Sup. 3x 46.0 41.6 48M 267G
Swin-T Mask R-CNN MoBY 3x 46.0 41.7 48M 267G
Swin-T Cascade Mask R-CNN Sup. 1x 48.1 41.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 1x 48.1 41.5 86M 745G
Swin-T Cascade Mask R-CNN Sup. 3x 50.4 43.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 3x 50.2 43.5 86M 745G

ADE20K Semantic Segmentation (val)

Backbone Method Model Crop Size Schd. mIoU mIoU (ms+flip) Params FLOPs
Swin-T UPerNet Sup. 512x512 160K 44.51 45.81 60M 945G
Swin-T UPerNet MoBY 512x512 160K 44.06 45.58 60M 945G

Citing MoBY and Swin

MoBY

@article{xie2021moby,
  title={Self-Supervised Learning with Swin Transformers}, 
  author={Zhenda Xie and Yutong Lin and Zhuliang Yao and Zheng Zhang and Qi Dai and Yue Cao and Han Hu},
  journal={arXiv preprint arXiv:2105.04553},
  year={2021}
}

Swin Transformer

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Getting Started

Owner
Swin Transformer
This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer
Swin Transformer
Real-CUGAN - Real Cascade U-Nets for Anime Image Super Resolution

Real Cascade U-Nets for Anime Image Super Resolution δΈ­ζ–‡ | English πŸ”₯ Real-CUGAN

tarsin 111 Dec 28, 2022
Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

MetaMorph: Learning Universal Controllers with Transformers This is the code for the paper MetaMorph: Learning Universal Controllers with Transformers

Agrim Gupta 50 Jan 03, 2023
Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"

One-Shot Free-View Neural Talking Head Synthesis Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Vide

ZLH 406 Dec 23, 2022
Code for the Active Speakers in Context Paper (CVPR2020)

Active Speakers in Context This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper. Before Training The c

43 Oct 14, 2022
Implements a fake news detection program using classifiers.

Fake news detection Implements a fake news detection program using classifiers for Data Mining course at UoA. Description The project is the categoriz

Apostolos Karvelas 1 Jan 09, 2022
A decent AI that solves daily Wordle puzzles. Works with different websites with similar wordlists,.

Wordle-AI A decent AI that solves daily "Wordle" puzzles. Works with different websites with similar wordlists. When prompted with "Word:" enter the w

Ethan 1 Feb 10, 2022
Get 2D point positions (e.g., facial landmarks) projected on 3D mesh

points2d_projection_mesh Input 2D points (e.g. facial landmarks) on an image Camera parameters (extrinsic and intrinsic) of the image Aligned 3D mesh

5 Dec 08, 2022
StyleGAN of All Trades: Image Manipulation withOnly Pretrained StyleGAN

StyleGAN of All Trades: Image Manipulation withOnly Pretrained StyleGAN This is the PyTorch implementation of StyleGAN of All Trades: Image Manipulati

360 Dec 28, 2022
A PyTorch Implementation of PGL-SUM from "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE ISM 2021

PGL-SUM: Combining Global and Local Attention with Positional Encoding for Video Summarization PyTorch Implementation of PGL-SUM From "PGL-SUM: Combin

Evlampios Apostolidis 35 Dec 22, 2022
Official Pytorch Implementation for Splicing ViT Features for Semantic Appearance Transfer presenting Splice

Splicing ViT Features for Semantic Appearance Transfer [Project Page] Splice is a method for semantic appearance transfer, as described in Splicing Vi

Omer Bar Tal 253 Jan 06, 2023
Multi agent DDPG algorithm written in Python + Pytorch

Multi agent DDPG algorithm written in Python + Pytorch. It also includes a Jupyter notebook, Tennis.ipynb, as a showcase.

Rogier Wachters 2 Feb 26, 2022
FG-transformer-TTS Fine-grained style control in transformer-based text-to-speech synthesis

LST-TTS Official implementation for the paper Fine-grained style control in transformer-based text-to-speech synthesis. Submitted to ICASSP 2022. Audi

Li-Wei Chen 64 Dec 30, 2022
πŸ›°οΈ Awesome Satellite Imagery Datasets

Awesome Satellite Imagery Datasets List of aerial and satellite imagery datasets with annotations for computer vision and deep learning. Newest datase

Christoph Rieke 3k Jan 03, 2023
PyTorch code for the NAACL 2021 paper "Improving Generation and Evaluation of Visual Stories via Semantic Consistency"

Improving Generation and Evaluation of Visual Stories via Semantic Consistency PyTorch code for the NAACL 2021 paper "Improving Generation and Evaluat

Adyasha Maharana 28 Dec 08, 2022
MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift

MemStream Implementation of MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift . Siddharth Bhatia, Arjit Jain, Shivi

Stream-AD 61 Dec 02, 2022
The repository contain code for building compiler using puthon.

Building Compiler This is a python implementation of JamieBuild's "Super Tiny Compiler" Overview JamieBuilds developed a wonderfully educative compile

Shyam Das Shrestha 1 Nov 21, 2021
COVINS -- A Framework for Collaborative Visual-Inertial SLAM and Multi-Agent 3D Mapping

COVINS -- A Framework for Collaborative Visual-Inertial SLAM and Multi-Agent 3D Mapping Version 1.0 COVINS is an accurate, scalable, and versatile vis

ETHZ V4RL 183 Dec 27, 2022
OMLT: Optimization and Machine Learning Toolkit

OMLT is a Python package for representing machine learning models (neural networks and gradient-boosted trees) within the Pyomo optimization environment.

Cβš™G - Imperial College London 179 Jan 02, 2023
🐦 Quickly annotate data from the comfort of your Jupyter notebook

🐦 pigeon - Quickly annotate data on Jupyter Pigeon is a simple widget that lets you quickly annotate a dataset of unlabeled examples from the comfort

Anastasis Germanidis 647 Jan 05, 2023
A TensorFlow implementation of SOFA, the Simulator for OFfline LeArning and evaluation.

SOFA This repository is the implementation of SOFA, the Simulator for OFfline leArning and evaluation. Keeping Dataset Biases out of the Simulation: A

22 Nov 23, 2022