History Aware Multimodal Transformer for Vision-and-Language Navigation

Last update: Nov 23, 2022

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

Installation

Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Install requirements:

conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.
(Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

History Aware Multimodal Transformer for Vision-and-Language Navigation

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

Installation

Extracting features (optional)

Training with proxy tasks

Fine-tuning for sequential action prediction

Citation

Acknowledgement

Owner

Shizhe Chen

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

CPC-big and k-means clustering for zero-resource speech processing

Shared code for training sentence embeddings with Flax / JAX

Crie tokens de autenticação íntegros e seguros com UToken.

FireFlyer Record file format, writer and reader for DL training samples.

Pangu-Alpha for Transformers

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Basic yet complete Machine Learning pipeline for NLP tasks

A complete NLP guideline for enthusiasts

Unsupervised text tokenizer for Neural Network-based text generation.

Use Tensorflow2.7.0 Build OpenAI'GPT-2

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

AutoGluon: AutoML for Text, Image, and Tabular Data

NLP, Machine learning

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

History Aware Multimodal Transformer for Vision-and-Language Navigation

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

Installation

Extracting features (optional)

Training with proxy tasks

Fine-tuning for sequential action prediction

Citation

Acknowledgement

Owner

Shizhe Chen

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

CPC-big and k-means clustering for zero-resource speech processing

Shared code for training sentence embeddings with Flax / JAX

Crie tokens de autenticação íntegros e seguros com UToken.

FireFlyer Record file format, writer and reader for DL training samples.

Pangu-Alpha for Transformers

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Basic yet complete Machine Learning pipeline for NLP tasks

A complete NLP guideline for enthusiasts

Unsupervised text tokenizer for Neural Network-based text generation.

Use Tensorflow2.7.0 Build OpenAI'GPT-2

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

AutoGluon: AutoML for Text, Image, and Tabular Data

NLP, Machine learning

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。