You Only 👀 One Sequence

Overview

You Only 👀 One Sequence

  • TL;DR: We study the transferability of the vanilla ViT pre-trained on mid-sized ImageNet-1k to the more challenging COCO object detection benchmark.

  • This project is under active development.


You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

by Yuxin Fang1 *, Bencheng Liao1 *, Xinggang Wang1 ✉️ , Jiemin Fang2, 1, Jiyang Qi1, Rui Wu3, Jianwei Niu3, Wenyu Liu1.

1 School of EIC, HUST, 2 Institute of AI, HUST, 3 Horizon Robotics.

(*) equal contribution, ( ✉️ ) corresponding author.

arXiv technical report (arXiv 2106.00666)


You Only Look at One Sequence (YOLOS)

The Illustration of YOLOS

yolos

Highlights

Directly inherited from ViT (DeiT), YOLOS is not designed to be yet another high-performance object detector, but to unveil the versatility and transferability of Transformer from image recognition to object detection. Concretely, our main contributions are summarized as follows:

  • We use the mid-sized ImageNet-1k as the sole pre-training dataset, and show that a vanilla ViT (DeiT) can be successfully transferred to perform the challenging object detection task and produce competitive COCO results with the fewest possible modifications, i.e., by only looking at one sequence (YOLOS).

  • We demonstrate that 2D object detection can be accomplished in a pure sequence-to-sequence manner by taking a sequence of fixed-sized non-overlapping image patches as input. Among existing object detectors, YOLOS utilizes minimal 2D inductive biases. Moreover, it is feasible for YOLOS to perform object detection in any dimensional space unaware the exact spatial structure or geometry.

  • For ViT (DeiT), we find the object detection results are quite sensitive to the pre-train scheme and the detection performance is far from saturating. Therefore the proposed YOLOS can be used as a challenging benchmark task to evaluate different pre-training strategies for ViT (DeiT).

  • We also discuss the impacts as wel as the limitations of prevalent pre-train schemes and model scaling strategies for Transformer in vision through transferring to object detection.

Results

Model Pre-train Epochs ViT (DeiT) Weight / Log Fine-tune Epochs Eval Size YOLOS Checkpoint / Log AP @ COCO val
YOLOS-Ti 300 FB 300 512 Baidu Drive, Google Drive / Log 28.7
YOLOS-S 200 Baidu Drive, Google Drive / Log 150 800 Baidu Drive, Google Drive / Log 36.1
YOLOS-S 300 FB 150 800 Baidu Drive, Google Drive / Log 36.1
YOLOS-S (dWr) 300 Baidu Drive, Google Drive / Log 150 800 Baidu Drive, Google Drive / Log 37.6
YOLOS-B 1000 FB 150 800 Baidu Drive, Google Drive / Log 42.0

Notes:

  • The access code for Baidu Drive is yolo.
  • The FB stands for model weights provided by DeiT (paper, code). Thanks for their wonderful works.
  • We will update other models in the future, please stay tuned :)

Requirement

This codebase has been developed with python version 3.6, PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

Before finetuning on COCO, you need download the ImageNet pretrained model to the /path/to/YOLOS/ directory

To train the YOLOS-Ti model in the paper, run this command:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 2 \
    --lr 5e-5 \
    --epochs 300 \
    --backbone_name tiny \
    --pre_trained /path/to/deit-tiny.pth\
    --eval_size 512 \
    --init_pe_size 800 1333 \
    --output_dir /output/path/box_model
To train the YOLOS-S model with 200 epoch pretrained Deit-S in the paper, run this command:

python -m torch.distributed.launch
--nproc_per_node=8
--use_env main.py
--coco_path /path/to/coco --batch_size 1
--lr 2.5e-5
--epochs 150
--backbone_name small
--pre_trained /path/to/deit-small-200epoch.pth
--eval_size 800
--init_pe_size 512 864
--mid_pe_size 512 864
--output_dir /output/path/box_model

To train the YOLOS-S model with 300 epoch pretrained Deit-S in the paper, run this command:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 1 \
    --lr 2.5e-5 \
    --epochs 150 \
    --backbone_name small \
    --pre_trained /path/to/deit-small-300epoch.pth\
    --eval_size 800 \
    --init_pe_size 512 864 \
    --mid_pe_size 512 864 \
    --output_dir /output/path/box_model

To train the YOLOS-S (dWr) model in the paper, run this command:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 1 \
    --lr 2.5e-5 \
    --epochs 150 \
    --backbone_name small_dWr \
    --pre_trained /path/to/deit-small-dWr-scale.pth\
    --eval_size 800 \
    --init_pe_size 512 864 \
    --mid_pe_size 512 864 \
    --output_dir /output/path/box_model
To train the YOLOS-B model in the paper, run this command:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 1 \
    --lr 2.5e-5 \
    --epochs 150 \
    --backbone_name base \
    --pre_trained /path/to/deit-base.pth\
    --eval_size 800 \
    --init_pe_size 800 1344 \
    --mid_pe_size 800 1344 \
    --output_dir /output/path/box_model

Evaluation

To evaluate YOLOS-Ti model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 2 --backbone_name tiny --eval --eval_size 512 --init_pe_size 800 1333 --resume /path/to/YOLOS-Ti

To evaluate YOLOS-S model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume /path/to/YOLOS-S

To evaluate YOLOS-S (dWr) model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 1 --backbone_name small_dWr --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume /path/to/YOLOS-S(dWr)

To evaluate YOLOS-B model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 800 1344 --mid_pe_size 800 1344 --resume /path/to/YOLOS-B

Visualization

We have observed some intriguing properties of YOLOS, and we are working on a notebook to better demonstrate them, please stay tuned :)

Visualize box prediction and object categories distribution

  1. To Get visualization in the paper, you need the finetuned YOLOS models on COCO, run following command to get 100 Det-Toks prediction on COCO val split, then it will generate /path/to/YOLOS/visualization/modelname-eval-800-eval-pred.json
python cocoval_predjson_generation.py --coco_path /path/to/coco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume /path/to/yolos-s-model.pth --output_dir ./visualization
  1. To get all ground truth object categories on all images from COCO val split, run following command to generate /path/to/YOLOS/visualization/coco-valsplit-cls-dist.json
python cocoval_gtclsjson_generation.py --coco_path /path/to/coco --batch_size 1 --output_dir ./visualization
  1. To visualize the distribution of Det-Toks' bboxs and categories, run following command to generate .png files in /path/to/YOLOS/visualization/
 python visualize_dettoken_dist.py --visjson /path/to/YOLOS/visualization/modelname-eval-800-eval-pred.json --cococlsjson /path/to/YOLOS/visualization/coco-valsplit-cls-dist.json

cls cls

Visualize self-attention of the [DetTok] token on the different heads of the last layer:

we are working on a notebook to better demonstrate them, please stay tuned :)

Acknowledgement ❤️

This project is based on DETR (paper, code), DeiT (paper, code) and timm. Thanks for their wonderful works.

Citation

If you find our paper and code useful in your research, please consider giving a star and citation 📝 :

@article{YOLOS,
  title={You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection},
  author={Fang, Yuxin and Liao, Bencheng and Wang, Xinggang and Fang, Jiemin and Qi, Jiyang and Wu, Rui and Niu, Jianwei and Liu, Wenyu},
  journal={arXiv preprint arXiv:2106.00666},
  year={2021}
}
Owner
Hust Visual Learning Team
Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST
Hust Visual Learning Team
Scalable Optical Flow-based Image Montaging and Alignment

SOFIMA SOFIMA (Scalable Optical Flow-based Image Montaging and Alignment) is a tool for stitching, aligning and warping large 2d, 3d and 4d microscopy

Google Research 16 Dec 21, 2022
Unofficial Implementation of MLP-Mixer, Image Classification Model

MLP-Mixer Unoffical Implementation of MLP-Mixer, easy to use with terminal. Train and test easly. https://arxiv.org/abs/2105.01601 MLP-Mixer is an arc

Oğuzhan Ercan 6 Dec 05, 2022
PyTorch implementation for Convolutional Networks with Adaptive Inference Graphs

Convolutional Networks with Adaptive Inference Graphs (ConvNet-AIG) This repository contains a PyTorch implementation of the paper Convolutional Netwo

Andreas Veit 176 Dec 07, 2022
Advances in Neural Information Processing Systems (NeurIPS), 2020.

What is being transferred in transfer learning? This repo contains the code for the following paper: Behnam Neyshabur*, Hanie Sedghi*, Chiyuan Zhang*.

Google Research 36 Aug 26, 2022
Performant, differentiable reinforcement learning

deluca Performant, differentiable reinforcement learning Notes This is pre-alpha software and is undergoing a number of core changes. Updates to follo

Google 114 Dec 27, 2022
sense-py-AnishaBaishya created by GitHub Classroom

Compute Statistics Here we compute statistics for a bunch of numbers. This project uses the unittest framework to test functionality. Pass the tests T

1 Oct 21, 2021
Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

Meftun AKARSU 52 Dec 22, 2022
AFLFast (extends AFL with Power Schedules)

AFLFast Power schedules implemented by Marcel Böhme [email protected]

Marcel Böhme 380 Jan 03, 2023
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model.

Semantic Meshes A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model. Paper If you find this framework usefu

Florian 40 Dec 09, 2022
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition The official code of ABINet (CVPR 2021, Oral).

334 Dec 31, 2022
A tiny, friendly, strong baseline code for Person-reID (based on pytorch).

Pytorch ReID Strong, Small, Friendly A tiny, friendly, strong baseline code for Person-reID (based on pytorch). Strong. It is consistent with the new

Zhedong Zheng 3.5k Jan 08, 2023
The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting".

IGMTF The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting". Requirements The framework

Wentao Xu 24 Dec 05, 2022
Pytorch and Torch testing code of CartoonGAN

CartoonGAN-Test-Pytorch-Torch Pytorch and Torch testing code of CartoonGAN [Chen et al., CVPR18]. With the released pretrained models by the authors,

Yijun Li 642 Dec 27, 2022
Voxel-based Network for Shape Completion by Leveraging Edge Generation (ICCV 2021, oral)

Voxel-based Network for Shape Completion by Leveraging Edge Generation This is the PyTorch implementation for the paper "Voxel-based Network for Shape

10 Dec 04, 2022
Code for the upcoming CVPR 2021 paper

The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel J. Brostow and Michael

Niantic Labs 496 Dec 30, 2022
Normalizing Flows with a resampled base distribution

Resampling Base Distributions of Normalizing Flows Normalizing flows are a popular class of models for approximating probability distributions. Howeve

Vincent Stimper 24 Nov 03, 2022
Aydin is a user-friendly, feature-rich, and fast image denoising tool

Aydin is a user-friendly, feature-rich, and fast image denoising tool that provides a number of self-supervised, auto-tuned, and unsupervised image denoising algorithms.

Royer Lab 99 Dec 14, 2022
unet-family: Ultimate version

unet-family: Ultimate version 基于之前my-unet代码,我整理出来了这一份终极版本unet-family,方便其他人阅读。 相比于之前的my-unet代码,代码分类更加规范,有条理 对于clone下来的代码不需要修改各种复杂繁琐的路径问题,直接就可以运行。 并且代码有

2 Sep 19, 2022
Pytorch domain adaptation package

DomainAdaptation This package is created to tackle the problem of domain shifts when dealing with two domains of different feature distributions. In d

Institute of Computational Perception 7 Oct 22, 2022