Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Last update: Dec 14, 2022

Related tags

Deep Learning vln-bert

Overview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra

Paper: https://arxiv.org/abs/2004.14973

Model Zoo

A variety of pre-trained VLN-BERT weights can accessed through the following links:

	Pre-training Stages	Job ID	Val Unseen SR	URL
0	no pre-training	174631	30.52%	TBD
1	1	175134	45.17%	TBD
3	1 and 2	221943	49.64%	download
2	1 and 3	220929	50.02%	download
4	1, 2, and 3 (Full Model)	220825	59.26%	download

Usage Instructions

Follow the instructions in INSTALL.md to setup this codebase. The instructions walk you through several steps including preprocessing the Matterport3D panoramas by extracting regions with a pretrained object detector.

Training

To preform stage 3 of pre-training, first download ViLBERT weights from here. Then, run:

python \
-m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
train.py \
--from_pretrained <path/to/vilbert_pytorch_model_9.bin> \
--save_name [pre_train_run_id] \
--num_epochs 50 \
--warmup_proportion 0.08 \
--cooldown_factor 8 \
--masked_language \
--masked_vision \
--no_ranking

To fine-tune VLN-BERT for the path selection task, run:

python \
-m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
train.py \
--from_pretrained <path/to/pytorch_model_50.bin> \
--save_name [fine_tune_run_id]

Evaluation

To evaluate a pre-trained model, run:

python test.py \
--split [val_seen|val_unseen] \
--from_pretrained <path/to/run_[run_id]_pytorch_model.bin> \
--save_name [run_id]

followed by:

python scripts/calculate-metrics.py <path/to/results_[val_seen|val_unseen].json>

Citation

If you find this code useful, please consider citing:

@inproceedings{majumdar2020improving,
  title={Improving Vision-and-Language Navigation with Image-Text Pairs from the Web},
  author={Arjun Majumdar and Ayush Shrivastava and Stefan Lee and Peter Anderson and Devi Parikh and Dhruv Batra},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Related tags

Overview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Model Zoo

Usage Instructions

Training

Evaluation

Citation

Owner

Arjun Majumdar

PyTorch implementation of "Image-to-Image Translation Using Conditional Adversarial Networks".

Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

Recurrent Scale Approximation (RSA) for Object Detection

Sharpness-Aware Minimization for Efficiently Improving Generalization

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Open-source implementation of Google Vizier for hyper parameters tuning

Inferred Model-based Fuzzer

Logistic Bandit experiments. Official code for the paper "Jointly Efficient and Optimal Algorithms for Logistic Bandits".

Learning to Estimate Hidden Motions with Global Motion Aggregation

"NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search".

Exploring Simple 3D Multi-Object Tracking for Autonomous Driving (ICCV 2021)

[NeurIPS 2021] Official implementation of paper "Learning to Simulate Self-driven Particles System with Coordinated Policy Optimization".

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Implementation of "Semi-supervised Domain Adaptive Structure Learning"

Repository For Programmers Seeking a platform to show their skills

VGGVox models for Speaker Identification and Verification trained on the VoxCeleb (1 & 2) datasets

DeepOBS: A Deep Learning Optimizer Benchmark Suite

This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures using receptive field analysis (RFA) and create graph visualizations of your architecture.

codebase for "A Theory of the Inductive Bias and Generalization of Kernel Regression and Wide Neural Networks"

Federated Deep Reinforcement Learning for the Distributed Control of NextG Wireless Networks.