Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Overview

MViTs Excel at Class-agnostic Object Detection

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer and Ming-Hsuan Yang

Paper: https://arxiv.org/abs/2111.11430


main figure

Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and for unseen objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. To bridge this gap, we explore recent Multi-modal Vision Transformers (MViT) that have been trained with aligned image-text pairs. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on these findings, we develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs offer enhanced interactability with intelligible text queries.


Architecture overview of MViTs used in this work

Architecture overview


Results


Class-agnostic OD performance of MViTs in comparison with uni-modal detector (RetinaNet) on several datasets. MViTs show consistently good results on all datasets.

Results


Enhanced Interactability: Effect of using different intuitive text queries on the MDef-DETR class-agnostic OD performance. Combining detections from multiple queries captures varying aspects of objectness.

Results


Generalization to Rare/Novel Classes: MDef-DETR class-agnostic OD performance on rarely and frequently occurring categories in the pretraining captions. The numbers on top of the bars indicate occurrences of the corresponding category in the training dataset. The MViT achieves good recall values even for the classes with no or very few occurrences.

Results


Open-world Object Detection: Effect of using class-agnostic OD proposals from MDef-DETR for pseudo labelling of unknowns in Open World Detector (ORE).

Results


Pretraining for Class-aware Object Detection: Effect of using MDef-DETR proposals for pre-training of DETReg instead of Selective Search proposals.

Results


Evaluation

The provided codebase contains the pre-computed detections for all datasets using ours MDef-DETR model. The provided directory structure is as follows,

-> README.md
-> LICENSE
-> get_eval_metrics.py
-> get_multi_dataset_eval_metrics.py
-> data
    -> voc2007
        -> combined.pkl
    -> coco
        -> combined.pkl
    -> kitti
        -> combined.pkl
    -> kitchen
        -> combined.pkl
    -> cliaprt
        -> combined.pkl
    -> comic
        -> combined.pkl
    -> watercolor
        -> combined.pkl
    -> dota
        -> combined.pkl

Where combined.pkl contains the combined detections from multiple intutive text queries for corresponding datasets. (Refer Section 5.1: Enhanced Interactability for more details)

Download the annotations for all datasets and arrange them as shown below. Note that the script expect COCO annotations in standard COCO format & annotations of all other datasets in VOC format.

...
...
-> data
    -> voc2007
        -> combined.pkl
        -> Annotations
    -> coco
        -> combined.pkl
        -> instances_val2017_filt.json
    -> kitti
        -> combined.pkl
        -> Annotations
        ...
    -> kitchen
        -> combined.pkl
        -> Annotations
    -> cliaprt
        -> combined.pkl
        -> Annotations
    -> comic
        -> combined.pkl
        -> Annotations
    -> watercolor
        -> combined.pkl
        -> Annotations
    -> dota
        -> combined.pkl
        -> Annotations

Once the above mentioned directory structure is created, follow the following steps to calculate the metrics.

  1. Install numpy
$ pip install numpy
  1. Calculate metrics
$ python get_multi_dataset_eval_metrics.py

The calculated metrics will be stored in a data.csv file in the same directory.


Citation

If you use our work, please consider citing:

@article{Maaz2021Multimodal,
    title={Multi-modal Transformers Excel at Class-agnostic Object Detection},
    author={Muhammad Maaz and Hanoona Rasheed and Salman Khan and Fahad Shahbaz Khan and Rao Muhammad Anwer and Ming-Hsuan Yang},
    journal={ArXiv 2111.11430},
    year={2021}
}

Contact

Should you have any question, please contact [email protected] or [email protected]

🚀 Note: The repository contains the minimum evaluation code. The complete training and inference scripts along with pretrained models will be released soon. Stay Tuned!

Comments
  • aligning image text pairs

    aligning image text pairs

    I have a question on the paper: you train on aligned image-text pairs. How do you create this alignment? is it the same way as in MDeTr? I did not fully understand from the paper, especially for non-natural images like satellite images or medical images.

    opened by nikky4D 6
  • Loading checkpoints for inference

    Loading checkpoints for inference

    Which checkpoints in drive link you provided will load correctly in default MDefDETR model without any errors? Im getting missing/unexpected keys errors.

    documentation 
    opened by KaleemW 4
  • Is EMA used in this work?

    Is EMA used in this work?

    Hello author, thanks for your great work. I raise a question about the usage of Exponential Moving Average (EMA) in this paper, hoping you can provide me with some clues. It seems that this paper does not detail in this part. As far as I know, MDETR uses it and evaluate use the EMA model. So I wonder is it used in this work? If it is actually used, why should we evaluate by the EMA model rather than the original one?

    opened by JacobYuan7 4
  • one of the variables needed for gradient computation has been modified by an inplace operation

    one of the variables needed for gradient computation has been modified by an inplace operation

    RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 20]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

    This error will terminate the training procedure when training mdef_detr using the PyTorch environment as you advise(torch==1.8.0+cu111).

    And I found the variables of 'transformer.text_encoder.pooler.dense.weight' does not have grad. This may be the main reason for this error.

    opened by xushilin1 2
  • Loading the Faster RCNN checkpoint

    Loading the Faster RCNN checkpoint

    Greetings

    The readme states: (Feb 01, 2022) Training codes for MDef-DETR and MDef-DETR minus Language models are released -> training/README.md Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications are released -> applications/README.md All the pretrained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results are released -> this link

    Following the link to the google drive, only provides me with the model weight for the Faster-RCNN, but not with instructions on how to load it and which framework to use. I have tried creating a Faster-RCNN-resnet101 model with pytorch, but when I load the model weight, it states that the layer names does not match. Any guidance would be much appreciated.

    Best regards Martin

    uni-modal-detectors 
    opened by MartinPedersenpp 2
  • Need to understand how to import weights

    Need to understand how to import weights

    Hello,

    Firstly, I'd like to congratulate you for bringing this amazing work. Class agnostic object detection is much needed currently in the industry and this would be a great way to solve the problem.

    I wanted to test your model on some custom data. However, I cannot import pre-trained weights from the link you have provided. I can see the zip file but I couldn't find a way to import them. I'm using OpenCV to import weights. It is asking me to have a config file as well as .weights file.

    Could you please help me which library to use to import weights when I'm working on a jupyter notebook?

    Thank you,

    opened by abhi-vellala 2
  • pretrain data download

    pretrain data download

    if is it possible to split pretrain data into multiple seperate zip files。 I download data from google drive : https://drive.google.com/drive/folders/1-3kAsyZIVFbNelRXrF93Y5tMgOypv2jV i cannot download this data because of google drive time limit(less than 1 hours) and my limit network bandwidth。

    documentation 
    opened by zhouxingguang 1
  • Training code release

    Training code release

    This pull request adds

    • Training codes for MDef-DETR and MDef-DETR minus Language models
    • Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications
    • All the pre-trained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results
    opened by mmaaz60 0
  • Questions about your training procedure?

    Questions about your training procedure?

    To my understanding, I think you use image-text pairs as inputs and only bbox annotations as supervision signals without any class labels, does it right?

    opened by GYslchen 1
  • Questions about your pretrained model

    Questions about your pretrained model

    Does the pre-trained model you provide cover the categories on LVIS data? If I want to do open-world object detection on the LVIS dataset, can I directly use your pre-trained model to generate the proposals or should I need to filter the dataset so that it doesn't contain any object in the LVIS dataset?

    opened by chengsilin 1
  • how to generate 'tokens_positive'  ann from detector dataset like object365?

    how to generate 'tokens_positive' ann from detector dataset like object365?

    I found 'tokens_positive' was used in your ann file. could you please release the code of how to process detect data like coco to get the 'tokens_positive' ann results?

    documentation 
    opened by zhouxingguang 1
Releases(v1.0)
  • v1.0(Feb 1, 2022)

    • Training codes for MDef-DETR and MDef-DETR minus Language models are released -> training/README.md
    • Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications are released -> applications/README.md
    • All the pretrained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results are released -> this link
    Source code(tar.gz)
    Source code(zip)
  • v0.1(Nov 25, 2021)

    Evaluation Code & Pre-trained Models

    • Releases evaluation code for MDef-DETR and 'MDef-DETR w/o Language Branch' model
    • Releases the pre-trained weights for both models
    • Releases the pre-computed predictions for both the models
    Source code(tar.gz)
    Source code(zip)
Owner
Muhammad Maaz
An Electrical Engineer with experience in Computer Vision software development. Skilled in Machine Learning, Deep Learning and Computer Vision.
Muhammad Maaz
Heterogeneous Deep Graph Infomax

Heterogeneous-Deep-Graph-Infomax Parameter Setting: HDGI-A: Node-level dimension: 16 Attention head: 4 Semantic-level attention vector: 8 learning rat

52 Oct 31, 2022
PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

Transparency-by-Design networks (TbD-nets) This repository contains code for replicating the experiments and visualizations from the paper Transparenc

David Mascharka 351 Nov 18, 2022
Single-Shot Motion Completion with Transformer

Single-Shot Motion Completion with Transformer 👉 [Preprint] 👈 Abstract Motion completion is a challenging and long-discussed problem, which is of gr

FuxiCV 78 Dec 29, 2022
Testing and Estimation of structural breaks in Stata

xtbreak estimating and testing for many known and unknown structural breaks in time series and panel data. For an overview of xtbreak test see xtbreak

Jan Ditzen 13 Jun 19, 2022
object detection; robust detection; ACM MM21 grand challenge; Security AI Challenger Phase VII

赛题背景 在商品知识产权领域,知识产权体现为在线商品的设计和品牌。不幸的是,在每一天,存在着非法商户通过一些对抗手段干扰商标识别来逃避侵权,这带来了很高的知识产权风险和财务损失。为了促进先进的多媒体人工智能技术的发展,以保护企业来之不易的创作和想法免受恶意使用和剽窃,因此提出了鲁棒性标识检测挑战赛

65 Dec 22, 2022
Kaggle-titanic - A tutorial for Kaggle's Titanic: Machine Learning from Disaster competition. Demonstrates basic data munging, analysis, and visualization techniques. Shows examples of supervised machine learning techniques.

Kaggle-titanic This is a tutorial in an IPython Notebook for the Kaggle competition, Titanic Machine Learning From Disaster. The goal of this reposito

Andrew Conti 800 Dec 15, 2022
VQMIVC - Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech

Disong Wang 262 Dec 31, 2022
Deep learning toolbox based on PyTorch for hyperspectral data classification.

Deep learning toolbox based on PyTorch for hyperspectral data classification.

Nicolas 304 Dec 28, 2022
Official code for "InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization" (ICLR 2020, spotlight)

InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization Authors: Fan-yun Sun, Jordan Hoffm

Fan-Yun Sun 232 Dec 28, 2022
The pytorch implementation of SOKD (BMVC2021).

Semi-Online Knowledge Distillation Implementations of SOKD. Requirements This repo was tested with Python 3.8, PyTorch 1.5.1, torchvision 0.6.1, CUDA

4 Dec 19, 2021
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

OCTIS : Optimizing and Comparing Topic Models is Simple! OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and compa

MIND 478 Jan 01, 2023
PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

PyExplainer PyExplainer is a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of J

AI Wizards for Software Management (AWSM) Research Group 14 Nov 13, 2022
Code for "Unsupervised Source Separation via Bayesian inference in the latent domain"

LQVAE-separation Code for "Unsupervised Source Separation via Bayesian inference in the latent domain" Paper Samples GT Compressed Separated Drums GT

Michele Mancusi 30 Oct 25, 2022
Interactive Visualization to empower domain experts to align ML model behaviors with their knowledge.

An interactive visualization system designed to helps domain experts responsibly edit Generalized Additive Models (GAMs). For more information, check

InterpretML 83 Jan 04, 2023
TF Image Segmentation: Image Segmentation framework

TF Image Segmentation: Image Segmentation framework The aim of the TF Image Segmentation framework is to provide/provide a simplified way for: Convert

Daniil Pakhomov 546 Dec 17, 2022
Pytorch domain adaptation package

DomainAdaptation This package is created to tackle the problem of domain shifts when dealing with two domains of different feature distributions. In d

Institute of Computational Perception 7 Oct 22, 2022
Using Machine Learning to Create High-Res Fine Art

BIG.art: Using Machine Learning to Create High-Res Fine Art How to use GLIDE and BSRGAN to create ultra-high-resolution paintings with fine details By

Robert A. Gonsalves 13 Nov 27, 2022
Implementation of E(n)-Transformer, which extends the ideas of Welling's E(n)-Equivariant Graph Neural Network to attention

E(n)-Equivariant Transformer (wip) Implementation of E(n)-Equivariant Transformer, which extends the ideas from Welling's E(n)-Equivariant G

Phil Wang 132 Jan 02, 2023
Official PyTorch Implementation of Mask-aware IoU and maYOLACT Detector [BMVC2021]

The official implementation of Mask-aware IoU and maYOLACT detector. Our implementation is based on mmdetection. Mask-aware IoU for Anchor Assignment

Kemal Oksuz 46 Sep 29, 2022
DECAF: Deep Extreme Classification with Label Features

DECAF DECAF: Deep Extreme Classification with Label Features @InProceedings{Mittal21, author = "Mittal, A. and Dahiya, K. and Agrawal, S. and Sain

46 Nov 06, 2022