Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Last update: Jan 04, 2023

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Works for DETR-style detectors.

Installation

See installation instructions.

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:

Run our demo using Colab (no GPU needed):

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

Open-vocabulary LVIS

mask mAP mask mAP_novel

Box-Supervised 30.2 16.4

Detic 32.4 24.9

	mask mAP	mask mAP_novel
Box-Supervised	30.2	16.4
Detic	32.4	24.9

Standard LVIS

	Detector/ Backbone	mask mAP	mask mAP_rare
Box-Supervised	CenterNet2-ResNet50	31.5	25.6
Detic	CenterNet2-ResNet50	33.2	29.7
Box-Supervised	CenterNet2-SwinB	40.7	35.9
Detic	CenterNet2-SwinB	41.7	41.7

	Detector/ Backbone	box mAP	box mAP_rare
Box-Supervised	DeformableDETR-ResNet50	31.7	21.4
Detic	DeformableDETR-ResNet50	32.5	26.2

Cross-dataset generalization

Backbone Objects365 box mAP OpenImages box mAP50

Box-Supervised SwinB 19.1 46.2

Detic SwinB 21.4 55.2

	Backbone	Objects365 box mAP	OpenImages box mAP50
Box-Supervised	SwinB	19.1	46.2
Detic	SwinB	21.4	55.2

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (https://github.com/lvis-dataset/lvis-api/blob/master/LICENSE)” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2021detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},
  year={2021}
}

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Installation

Demo

Benchmark evaluation and training

License

Ethical Considerations

Citation

Owner

Meta Research

The code for paper Efficiently Solve the Max-cut Problem via a Quantum Qubit Rotation Algorithm

Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

A Collection of Papers and Codes for ICCV2021 Low Level Vision and Image Generation

Jremesh-tools - Blender addon for quad remeshing

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES)

Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

A face dataset generator with out-of-focus blur detection and dynamic interval adjustment.

Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

AI-based, context-driven network device ranking

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.

Official Pytorch implementation for video neural representation (NeRV)

Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)"

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

A way to store images in YAML.

Pathdreamer: A World Model for Indoor Navigation

Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets).

Natural Intelligence is still a pretty good idea.

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Installation

Demo

Benchmark evaluation and training

License

Ethical Considerations

Citation

Owner

Meta Research

The code for paper Efficiently Solve the Max-cut Problem via a Quantum Qubit Rotation Algorithm

Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

A Collection of Papers and Codes for ICCV2021 Low Level Vision and Image Generation

Jremesh-tools - Blender addon for quad remeshing

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES)

Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

A face dataset generator with out-of-focus blur detection and dynamic interval adjustment.

Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

AI-based, context-driven network device ranking

The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

Official Pytorch implementation for video neural representation (NeRV)

Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)"

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

A way to store images in YAML.

Pathdreamer: A World Model for Indoor Navigation

Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets).

Natural Intelligence is still a pretty good idea.

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.