Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Overview

Focal Transformer

PWC PWC PWC PWC PWC PWC

This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transformers", by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

Introduction

focal-transformer-teaser

Our Focal Transfomer introduced a new self-attention mechanism called focal self-attention for vision transformers. In this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively.

With our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.6 and 84.0 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation.

Benchmarking

Image Classification on ImageNet-1K

Model Pretrain Use Conv Resolution [email protected] [email protected] #params FLOPs Checkpoint Config
Focal-T IN-1K No 224 82.2 95.9 28.9M 4.9G download yaml
Focal-T IN-1K Yes 224 82.7 96.1 30.8M 4.9G download yaml
Focal-S IN-1K No 224 83.6 96.2 51.1M 9.4G download yaml
Focal-B IN-1K No 224 84.0 96.5 89.8M 16.4G download yaml

Object Detection and Instance Segmentation on COCO

Mask R-CNN

Backbone Pretrain Lr Schd #params FLOPs box mAP mask mAP
Focal-T ImageNet-1K 1x 49M 291G 44.8 41.0
Focal-T ImageNet-1K 3x 49M 291G 47.2 42.7
Focal-S ImageNet-1K 1x 71M 401G 47.4 42.8
Focal-S ImageNet-1K 3x 71M 401G 48.8 43.8
Focal-B ImageNet-1K 1x 110M 533G 47.8 43.2
Focal-B ImageNet-1K 3x 110M 533G 49.0 43.7

RetinaNet

Backbone Pretrain Lr Schd #params FLOPs box mAP
Focal-T ImageNet-1K 1x 39M 265G 43.7
Focal-T ImageNet-1K 3x 39M 265G 45.5
Focal-S ImageNet-1K 1x 62M 367G 45.6
Focal-S ImageNet-1K 3x 62M 367G 47.3
Focal-B ImageNet-1K 1x 101M 514G 46.3
Focal-B ImageNet-1K 3x 101M 514G 46.9

Other detection methods

Backbone Pretrain Method Lr Schd #params FLOPs box mAP
Focal-T ImageNet-1K Cascade Mask R-CNN 3x 87M 770G 51.5
Focal-T ImageNet-1K ATSS 3x 37M 239G 49.5
Focal-T ImageNet-1K RepPointsV2 3x 45M 491G 51.2
Focal-T ImageNet-1K Sparse R-CNN 3x 111M 196G 49.0

Semantic Segmentation on ADE20K

Backbone Pretrain Method Resolution Iters #params FLOPs mIoU mIoU (MS)
Focal-T ImageNet-1K UPerNet 512x512 160k 62M 998G 45.8 47.0
Focal-S ImageNet-1K UPerNet 512x512 160k 85M 1130G 48.0 50.0
Focal-B ImageNet-1K UPerNet 512x512 160k 126M 1354G 49.0 50.5
Focal-L ImageNet-22K UPerNet 640x640 160k 240M 3376G 54.0 55.4

Getting Started

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2021focal,
    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
    year={2021},
    eprint={2107.00641},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

Our codebase is built based on Swin-Transformer. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A library for efficient similarity search and clustering of dense vectors.

Faiss Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any

Meta Research 18.8k Jan 08, 2023
A Transformer-Based Siamese Network for Change Detection

ChangeFormer: A Transformer-Based Siamese Network for Change Detection (Under review at IGARSS-2022) Wele Gedara Chaminda Bandara, Vishal M. Patel Her

Wele Gedara Chaminda Bandara 214 Dec 29, 2022
Official PyTorch Implementation of Hypercorrelation Squeeze for Few-Shot Segmentation, arXiv 2021

Hypercorrelation Squeeze for Few-Shot Segmentation This is the implementation of the paper "Hypercorrelation Squeeze for Few-Shot Segmentation" by Juh

Juhong Min 165 Dec 28, 2022
Code for the Active Speakers in Context Paper (CVPR2020)

Active Speakers in Context This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper. Before Training The c

43 Oct 14, 2022
WSDM2022 Challenge - Large scale temporal graph link prediction

WSDM 2022 Large-scale Temporal Graph Link Prediction - Baseline and Initial Test Set WSDM Cup Website link Link to this challenge This branch offers A

Deep Graph Library 34 Dec 29, 2022
Implementation of average- and worst-case robust flatness measures for adversarial training.

Relating Adversarially Robust Generalization to Flat Minima This repository contains code corresponding to the MLSys'21 paper: D. Stutz, M. Hein, B. S

David Stutz 13 Nov 27, 2022
Fully Convolutional DenseNet (A.K.A 100 layer tiramisu) for semantic segmentation of images implemented in TensorFlow.

FC-DenseNet-Tensorflow This is a re-implementation of the 100 layer tiramisu, technically a fully convolutional DenseNet, in TensorFlow (Tiramisu). Th

Hasnain Raza 121 Oct 12, 2022
A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

Hendrik Schröter 292 Dec 25, 2022
Simple Baselines for Human Pose Estimation and Tracking

Simple Baselines for Human Pose Estimation and Tracking News Our new work High-Resolution Representations for Labeling Pixels and Regions is available

Microsoft 2.7k Jan 05, 2023
Unofficial implementation of Pix2SEQ

Unofficial-Pix2seq: A Language Modeling Framework for Object Detection Unofficial implementation of Pix2SEQ. Please use this code with causion. Many i

159 Dec 12, 2022
From the basics to slightly more interesting applications of Tensorflow

TensorFlow Tutorials You can find python source code under the python directory, and associated notebooks under notebooks. Source code Description 1 b

Parag K Mital 5.6k Jan 09, 2023
Real-time Joint Semantic Reasoning for Autonomous Driving

MultiNet MultiNet is able to jointly perform road segmentation, car detection and street classification. The model achieves real-time speed and state-

Marvin Teichmann 518 Dec 12, 2022
PyTorch for Semantic Segmentation

PyTorch for Semantic Segmentation This repository contains some models for semantic segmentation and the pipeline of training and testing models, impl

Zijun Deng 1.7k Jan 06, 2023
AISTATS 2019: Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning Source code for AISTATS 2019 paper: Confidence-based Graph Convolutional Ne

MALL Lab (IISc) 56 Dec 03, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator This is a Pytorch implementation for the paper "An Empirical Study o

Cuong Nguyen 3 Nov 15, 2021
A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion This repo intends to release code for our work: Zhaoyang Lyu*, Zhifeng

Zhaoyang Lyu 68 Jan 03, 2023
Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.

DD3D: "Is Pseudo-Lidar needed for Monocular 3D Object detection?" Install // Datasets // Experiments // Models // License // Reference Full video Offi

Toyota Research Institute - Machine Learning 364 Dec 27, 2022
ScriptProfilerPy - Module to visualize where your python script is slow

ScriptProfiler helps you track where your code is slow It provides: Code lines t

Lucas BLP 3 Jun 02, 2022
The first dataset of composite images with rationality score indicating whether the object placement in a composite image is reasonable.

Object-Placement-Assessment-Dataset-OPA Object-Placement-Assessment (OPA) is to verify whether a composite image is plausible in terms of the object p

BCMI 53 Nov 15, 2022