ZeroVL - The official implementation of ZeroVL

Last update: Nov 04, 2022

Related tags

Overview

This repository contains source code necessary to reproduce the results presented in the paper ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources.

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. Meanwhile, we provide a reproducible strong baseline of competitive results, namely ZeroVL, with publicly accessible academic datasets and a popular experimental environment.

Performance

Image-text retreival RSUM scores on MSCOCO and Flickr30K datasets:

method	computation	data	COCO(zs.)	COCO(ft.)	F30K(zs.)	F30K(ft.)
CLIP	256 V100	400M	400.2	-	540.6	-
ALIGN	1024 TPUv3	1800M	425.3	500.4	553.3	576.0
baseline	8 V100	14.2M	363.5	471.9	476.8	553.0
ZeroVL	8 V100	14.2M	425.0	485.0	536.2	561.6
ZeroVL	8 V100	100M	442.1	500.5	546.5	573.6

zs.: zero-shot setting, ft.: fine-tuned setting.

Installation

Requirements:

Python 3.7
Pytorch 1.8.1
torchvision 0.9.1
cuda 11.1

Install requirements:

pip3 install -r requirements.txt

Getting Started

Check GETTING_STARTED.md for codebase usage.

Model Zoo

We will release pre-trained models soon.

Citing ZeroVL

If you use ZeroVL in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{cui2021zerovl,
  title={ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources},
  author={Cui, Quan and Zhou, Boyan and Guo, Yu and Yin, Weidong and Wu, Hao and Yoshie, Osamu},
  journal={arXiv preprint arXiv:2112.09331},
  year={2021}
}

License

ZeroVL is released under the MIT license. See LICENSE for details.

ZeroVL - The official implementation of ZeroVL

Related tags

Overview

Performance

Installation

Getting Started

Model Zoo

Citing ZeroVL

License

Owner

[CVPR 2021] MiVOS - Scribble to Mask module

An algorithm study of the 6th iOS 10 set of Boost Camp Web Mobile

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

Sign Language Translation with Transformers (COLING'2020, ECCV'20 SLRTP Workshop)

Image-to-Image Translation with Conditional Adversarial Networks (Pix2pix) implementation in keras

Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness?

Codes for our IJCAI21 paper: Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization

A lossless neural compression framework built on top of JAX.

AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation

Python implementation of a live deep learning based age/gender/expression recognizer

Improving 3D Object Detection with Channel-wise Transformer

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Scripts and outputs related to the paper Prediction of Adverse Biological Effects of Chemicals Using Knowledge Graph Embeddings.

Keras udrl - Keras implementation of Upside Down Reinforcement Learning

DiscoNet: Learning Distilled Collaboration Graph for Multi-Agent Perception [NeurIPS 2021]

Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

"Neural Turing Machine" in Tensorflow

PyTorch implementation of SmoothGrad: removing noise by adding noise.

Label Studio is a multi-type data labeling and annotation tool with standardized output format

x-transformers-paddle 2.x version