VD-BERT: A Unified Vision and Dialog Transformer with BERT

Last update: Nov 01, 2022

Related tags

Overview

VD-BERT: A Unified Vision and Dialog Transformer with BERT

PyTorch Code for the following paper at EMNLP2020:
Title: VD-BERT: A Unified Vision and Dialog Transformer with BERT [pdf]
Authors: Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi
Institute: Salesforce Research and CUHK
Abstract
Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard.

Framework illustration

Installation

Package: Pytorch 1.1; We alo provide our Dockerfile and YAML file for setting up experiments in Google Cloud Platform (GCP).
Data: you can obtain the VisDial data from here
Visual features: we provide bottom-up attention visual features of VisDial v1.0 on data/img_feats1.0/. If you would like to extract visual features for other images, please refer to this docker image. We provide the running script on data/visual_extract_code.py, which should be used inside the provided bottom-up-attention image.

Code explanation

vdbert: store the main training and testing python files, data loader code, metrics and the ensemble code;

pytorch_pretrained_bert: mainly borrow from the Huggingface's pytorch-transformers v0.4.0;

modeling.py: we modify or add two classes: BertForPreTrainingLossMask and BertForVisDialGen;
rank_loss.py: three ranking methods: ListNet, ListMLE, approxNDCG;

sh: shell scripts to run the experiments

pred: store two json files for best single-model (74.54 NDCG) and ensemble model (75.35 NDCG)

model: You can download a pretrained model from https://storage.cloud.google.com/sfr-vd-bert-research/v1.0_from_BERT_e30.bin

Running experiments

Below the running example scripts for pretraining, finetuning (including dense annotation), and testing.

Pretraining bash sh/pretrain_v1.0_mlm_nsp_g4.sh
Finetuning for discriminative bash sh/finetune_v1.0_disc_g4.sh
Finetuning for discriminative specifically on dense annotation bash sh/finetune_v1.0_disc_dense_g4.sh
Finetuning for generative bash sh/finetune_v1.0_gen_g4.sh
Testing for discriminative on validation bash sh/test_v1.0_disc_val.sh
Testing for generative on validation bash sh/test_v1.0_gen_val.sh
Testing for discriminative on test bash sh/test_v1.0_disc_test.sh

Notation: mlm: masked language modeling, nsp: next sentence prediction, disc: discriminative, gen: generative, g4: 4 gpus, dense: dense annotation

Citation

If you find the code useful in your research, please consider citing our paper:

@inproceedings{
    wang2020vdbert,
    title={VD-BERT: A Unified Vision and Dialog Transformer with BERT},
    author={Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi},
    booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020},
    year={2020},
}

License

This project is licensed under the terms of the MIT license.

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Related tags

Overview

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Installation

Code explanation

Running experiments

Citation

License

Owner

Salesforce

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

A fast and easy implementation of Transformer with PyTorch.

justCTF [*] 2020 challenges sources

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Natural Language Processing library built with AllenNLP 🌲🌱

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

Utilize Korean BERT model in sentence-transformers library

Kurumi ChatBot

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Dope Wars game engine on StarkNet L2 roll-up

The ibet-Prime security token management system for ibet network.

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

An automated program that helps customers of Pizza Palour place their pizza orders

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

A toolkit for document-level event extraction, containing some SOTA model implementations

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.