Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Overview

Learning the Best Pooling Strategy for Visual Semantic Embedding

License: MIT

Official PyTorch implementation of the paper Learning the Best Pooling Strategy for Visual Semantic Embedding (CVPR 2021 Oral).

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@inproceedings{chen2021vseinfty,
     title={Learning the Best Pooling Strategy for Visual Semantic Embedding},
     author={Chen, Jiacheng and Hu, Hexiang and Wu, Hao and Jiang, Yuning and Wang, Changhu},
     booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
     year={2021}
} 

We referred to the implementations of VSE++ and SCAN to build up our codebase.

Introduction

Illustration of the standard Visual Semantic Embedding (VSE) framework with the proposed pooling-based aggregator, i.e., Generalized Pooling Operator (GPO). It is simple and effective, which automatically adapts to the appropriate pooling strategy given different data modality and feature extractor, and improves VSE models at negligible extra computation cost.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone, please check out to the bigru branch for the code and pre-trained models for using BiGRU as the text backbone.

Note that the VSE++ entries in the following tables are the VSE++ model with the specified feature backbones, thus the results are different from the original VSE++ paper.

Results of 5-fold evaluation on COCO 1K Test Split

Visual Backbone Text Backbone R1 R5 R1 R5 Link
VSE++ BUTD region BERT-base 67.9 91.9 54.0 85.6 -
VSEInfty BUTD region BERT-base 79.7 96.4 64.8 91.4 Here
VSEInfty BUTD grid BERT-base 80.4 96.8 66.4 92.1 Here
VSEInfty WSL grid BERT-base 84.5 98.1 72.0 93.9 Here

Results on Flickr30K Test Split

Visual Backbone Text Backbone R1 R5 R1 R5 Link
VSE++ BUTD region BERT-base 63.4 87.2 45.6 76.4 -
VSEInfty BUTD region BERT-base 81.7 95.4 61.4 85.9 Here
VSEInfty BUTD grid BERT-base 81.5 97.1 63.7 88.3 Here
VSEInfty WSL grid BERT-base 88.4 98.3 74.2 93.7 Here

Result (in [email protected]) on Crisscrossed Caption benchmark (trained on COCO)

Visual Backbone Text Backbone I2T T2I T2T I2I
VSRN BUTD region BiGRU 52.4 40.1 41.0 44.2
DE EfficientNet-B4 grid BERT-base 55.9 41.7 42.6 38.5
VSEInfty BUTD grid BERT-base 60.6 46.2 45.9 44.4
VSEInfty WSL grid BERT-base 67.9 53.6 46.7 51.3

Preparation

Environment

We trained and evaluated our models with the following key dependencies:

  • Python 3.7.3

  • Pytorch 1.2.0

  • Transformers 2.1.0

Run pip install -r requirements.txt to install the exactly same dependencies as our experiments. However, we also verified that using the latest Pytorch 1.8.0 and Transformers 4.4.2 can also produce similar results.

Data

We organize all data used in the experiments in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── images   # raw coco images
│   │      ├── train2014
│   │      └── val2014
│   │
│   ├── cxc_annots # annotations for evaluating COCO-trained models on the CxC benchmark
│   │
│   └── id_mapping.json  # mapping from coco-id to image's file name
│   
│
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── flickr30k-images   # raw coco images
│   │      ├── xxx.jpg
│   │      └── ...
│   └── id_mapping.json  # mapping from f30k index to image's file name
│   
├── weights
│      └── original_updown_backbone.pth # the BUTD CNN weights
│
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the offical repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images.

The id_mapping.json files are the mapping from image index (ie, the COCO id for COCO images) to corresponding filenames, we generated these mappings to eliminate the need of the pycocotools package.

weights/original_updowmn_backbone.pth is the pre-trained ResNet-101 weights from Bottom-up Attention Model, we converted the original Caffe weights into Pytorch. Please download it from this link.

The data/coco/cxc_annots directory contains the necessary data files for running the Criscrossed Caption (CxC) evaluation. Since there is no official evaluation protocol in the CxC repo, we processed their raw data files and generated these data files to implement our own evaluation. We have verified our implementation by aligning the evaluation results of the official VSRN model with the ones reported by the CxC paper Please download the data files at this link.

Please download all necessary data files and organize them in the above manner, the path to the data directory will be the argument to the training script as shown below.

Training

Assuming the data root is /tmp/data, we provide example training scripts for:

  1. Grid feature with BUTD CNN for the image feature, BERT-base for the text feature. See train_grid.sh

  2. BUTD Region feature for the image feature, BERT-base for the text feature. See train_region.sh

To use other CNN initializations for the grid image feature, change the --backbone_source argument to different values:

  • (1). the default detector is to use the BUTD ResNet-101, we have adapted the original Caffe weights into Pytorch and provided the download link above;
  • (2). wsl is to use the backbones from large-scale weakly supervised learning;
  • (3). imagenet_res152 is to use the ResNet-152 pre-trained on ImageNet.

Evaluation

Run eval.py to evaluate specified models on either COCO and Flickr30K. For evaluting pre-trained models on COCO, use the following command (assuming there are 4 GPUs, and the local data path is /tmp/data):

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco

For evaluting pre-trained models on Flickr-30K, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset f30k --data_path /tmp/data/f30k

For evaluating pre-trained COCO models on the CxC dataset, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco --evaluate_cxc

For evaluating two-model ensemble, first run single-model evaluation commands above with the argument --save_results, and then use eval_ensemble.py to get the results (need to manually specify the paths to the saved results).

Owner
Jiacheng Chen
Jiacheng Chen
PyTorch implementation of federated learning framework based on the acceleration of global momentum

Federated Learning with Acceleration of Global Momentum PyTorch implementation of federated learning framework based on the acceleration of global mom

0 Dec 23, 2021
Official PyTorch implementation of paper: Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation (ICCV 2021 Oral Presentation)

SML (ICCV 2021, Oral) : Official Pytorch Implementation This repository provides the official PyTorch implementation of the following paper: Standardi

SangHun 61 Dec 27, 2022
以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的斗地主ai

ddz-ai 介绍 斗地主是一种扑克游戏。游戏最少由3个玩家进行,用一副54张牌(连鬼牌),其中一方为地主,其余两家为另一方,双方对战,先出完牌的一方获胜。 ddz-ai以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的系统,使其经过大量训练后,能在实际游戏中获

freefuiiismyname 88 May 15, 2022
Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

Nugroho Dewantoro 9 Jun 06, 2022
NLU Dataset Diagnostics

NLU Dataset Diagnostics This repository contains data and scripts to reproduce the results from our paper: Aarne Talman, Marianna Apidianaki, Stergios

Language Technology at the University of Helsinki 1 Jul 20, 2022
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

CALVIN CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks Oier Mees, Lukas Hermann, Erick Rosete,

Oier Mees 107 Dec 26, 2022
A new test set for ImageNet

ImageNetV2 The ImageNetV2 dataset contains new test data for the ImageNet benchmark. This repository provides associated code for assembling and worki

186 Dec 18, 2022
An implementation of Deep Graph Infomax (DGI) in PyTorch

DGI Deep Graph Infomax (Veličković et al., ICLR 2019): https://arxiv.org/abs/1809.10341 Overview Here we provide an implementation of Deep Graph Infom

Petar Veličković 491 Jan 03, 2023
Custom Implementation of Non-Deep Networks

ParNet Custom Implementation of Non-deep Networks arXiv:2110.07641 Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun Official Repository https

Pritama Kumar Nayak 20 May 27, 2022
Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

82 Nov 29, 2022
Code and data accompanying our SVRHM'21 paper.

Code and data accompanying our SVRHM'21 paper. Requires tensorflow 1.13, python 3.7, scikit-learn, and pytorch 1.6.0 to be installed. Python scripts i

5 Nov 17, 2021
Facial Action Unit Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution

FAU Implementation of the paper: Facial Action Unit Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution. Yingruo

Evelyn 78 Nov 29, 2022
Generating Band-Limited Adversarial Surfaces Using Neural Networks

Generating Band-Limited Adversarial Surfaces Using Neural Networks This is the official repository of the technical report that was published on arXiv

3 Jul 26, 2022
an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

This work has now been superseded by: https://github.com/sniklaus/revisiting-sepconv sepconv-slomo This is a reference implementation of Video Frame I

Simon Niklaus 985 Jan 08, 2023
Code release for ICCV 2021 paper "Anticipative Video Transformer"

Anticipative Video Transformer Ranked first in the Action Anticipation task of the CVPR 2021 EPIC-Kitchens Challenge! (entry: AVT-FB-UT) [project page

Facebook Research 123 Dec 13, 2022
Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Mozhdeh Gheini 16 Jul 16, 2022
Anagram Generator in Python

Anagrams Generator This is a program for computing multiword anagrams. It makes no effort to come up with sentences that make sense; it only finds ana

Day Fundora 5 Nov 17, 2022
A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021)

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021) This repository contains the official implemen

81 Dec 14, 2022
[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

153 Dec 14, 2022
View model summaries in PyTorch!

torchinfo (formerly torch-summary) Torchinfo provides information complementary to what is provided by print(your_model) in PyTorch, similar to Tensor

Tyler Yep 1.5k Jan 05, 2023