[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

Last update: Dec 12, 2022

Overview

LBYL-Net

This repo implements paper Look Before You Leap: Learning Landmark Features For One-Stage Visual Grounding CVPR 2021.

Getting Started

Prerequisites

python 3.7
pytorch 10.0
cuda 10.0
gcc 4.92 or above

Installation

Then clone the repo and install dependencies.

git clone https://github.com/svip-lab/LBYLNet.git
cd LBYLNet
pip install requirements.txt

You also need to install our landmark feature convolution:

cd ext
git clone https://github.com/hbb1/landmarkconv.git
cd landmarkconv/lib/layers
python setup.py install --user

We follow dataset structure DMS and FAOA. For convience, we have pack them togather, including ReferitGame, RefCOCO, RefCOCO+, RefCOCOg.
```
bash data/refer/download_data.sh ./data/refer
```
download the generated index files and place them in ./data/refer. Available at [Gdrive], [One Drive] .

download the pretained model of YOLOv3.

wget -P ext https://pjreddie.com/media/files/yolov3.weights

Training and Evaluation

By default, we use 2 gpus and batchsize 64 with DDP (distributed data-parallel). We have provided several configurations and training log for reproducing our results. If you want to use different hyperparameters or models, you may create configs for yourself. Here are examples:

For distributed training with gpus :

CUDA_VISIBLE_DEVICES=0,1 python train.py lbyl_lstm_referit_batch64  --workers 8 --distributed --world_size 1  --dist_url "tcp://127.0.0.1:60006"

If you use single gpu or won't use distributed training (make sure to adjust the batchsize in the corresponding config file to match your devices):
```
CUDA_VISIBLE_DEVICES=0, python train.py lbyl_lstm_referit_batch64  --workers 8
```

For evaluation:

CUDA_VISIBLE_DEVICES=0, python evaluate.py lbyl_lstm_referit_batch64 --testiter 100 --split val

Trained Models

We provide the our retrained models with this re-organized codebase and provide their checkpoints and logs for reproducing the results. To use our trained models, download them from the [Gdrive] and save them into directory cache. Then the file path is expected to be <LBYLNet dir>/cache/nnet/<config>/<dataset>/<config>_100.pkl

Notice: The reproduced performances are occassionally higher or lower (within a reasonable range) than the results reported in the paper.

In this repo, we provide the peformance of our LBYL-Nets below. You can also find the details on <LBYLNet dir>/results and <LBYLNet dir>/logs.

Performance on ReferitGame ([email protected]).

Dataset Langauge Split Papar Reproduce

ReferitGame LSTM test 65.48 65.98

BERT test 67.47 68.48
Performance on RefCOCO ([email protected]).

Dataset Langauge Split Papar Reproduce

RefCOCO LSTM

testA 82.18 82.48

testB 71.91 71.76

BERT

testA 82.91 82.82

testB 74.15 72.82
Performance on RefCOCO+ ([email protected]).

Dataset Langauge Split Papar Reproduce

RefCOCO+ LSTM val 66.64 66.71

testA 73.21 72.63

testB 56.23 55.88

BERT val 68.64 68.76

testA 73.38 73.73

testB 59.49 59.62
Performance on RefCOCOg ([email protected]).

Dataset Langauge Split Papar Reproduce

RefCOCOg LSTM val 58.72 60.03

BERT val 62.70 63.20

Dataset	Langauge	Split	Papar	Reproduce
RefCOCO	LSTM
testA	82.18	82.48
testB	71.91	71.76
BERT
testA	82.91	82.82
testB	74.15	72.82

Dataset	Langauge	Split	Papar	Reproduce
RefCOCO+	LSTM	val	66.64	66.71
testA	73.21	72.63
testB	56.23	55.88
BERT	val	68.64	68.76
testA	73.38	73.73
testB	59.49	59.62

Dataset	Langauge	Split	Papar	Reproduce
RefCOCOg	LSTM	val	58.72	60.03
BERT	val	62.70	63.20

Demo

We also provide demo scripts to test if the repo is corretly installed. After installing the repo and download the pretained weights, you should be able to use the LBYL-Net to ground your own images.

python demo.py

you can change the model, image or phrase in the demo.py. You will see the output image in imgs/demo_out.jpg.

#!/usr/bin/env python
import cv2
import torch
from core.test.test import _visualize
from core.groundors import Net 
# pick one model
cfg_file = "lbyl_bert_unc+_batch64"
detector = Net(cfg_file, iter=100)
# inference
image = cv2.imread('imgs/demo.jpeg')
phrase = 'the green gaint'
bbox = detector(image, phrase)
_visualize(image, pred_bbox=bbox, phrase=phrase, save_path='imgs/demo_out.jpg', color=(1, 174, 245), draw_phrase=True)

Input:

Output:

Acknowledgements

This repo is organized as CornerNet-Lite and the code is partially from FAOA (e.g. data preparation) and MAttNet (e.g. LSTM). We thank for their great works.

Citations:

If you use any part of this repo in your research, please cite our paper:

@InProceedings{huang2021look,
      title={Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding}, 
      author={Huang, Binbin and Lian, Dongze and Luo, Weixin and Gao, Shenghua},
      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month = {June},
      year={2021},
}

[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

Related tags

Overview

LBYL-Net

Getting Started

Prerequisites

Installation

Training and Evaluation

Trained Models

Demo

Acknowledgements

Citations:

Owner

SVIP Lab

MASS (Mueen's Algorithm for Similarity Search) - a python 2 and 3 compatible library used for searching time series sub-sequences under z-normalized Euclidean distance for similarity.

SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021)

This is a official repository of SimViT.

Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

A Collection of LiDAR-Camera-Calibration Papers, Toolboxes and Notes

Disturbing Target Values for Neural Network regularization: attacking the loss layer to prevent overfitting

SparseML is a libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Fast and robust clustering of point clouds generated with a Velodyne sensor.

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

CoMoGAN: continuous model-guided image-to-image translation. CVPR 2021 oral.

Repository for MeshTalk supplemental material and code once the (already approved) 16 GHS captures our lab will make publicly available are released.

A Python library created to assist programmers with complex mathematical functions

ScaleNet: A Shallow Architecture for Scale Estimation

ShinRL: A Library for Evaluating RL Algorithms from Theoretical and Practical Perspectives

Repository for the Bias Benchmark for QA dataset.

Consecutive-Subsequence - Simple software to calculate susequence with highest sum

Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training

Pathdreamer: A World Model for Indoor Navigation

2021 National Underwater Robotics Vision Optics

PyTorch implementation of "A Simple Baseline for Low-Budget Active Learning".