The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Last update: Jan 07, 2023

Related tags

Deep Learning VG-GPLMs

Overview

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

[Paper] accepted at the EMNLP 2021:

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization, by Tiezheng Yu *, Wenliang Dai *, Zihan Liu, Pascale Fung.

Paper Abstract

Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs' powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our visual guidance method contributes 83.6% of the overall improvement. Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.

If you work is inspired by our paper or code, please cite it, thanks!

TODO

Evaluation

We release the generated summaries from different models in ./evaluation/results. All the evaluation metrics can be computed following ./evaluation/README.md.

Prepare dataset

You can go to How2 dataset Github to get the dataset. We recommend you to choose the (option 1): Download a pre-packaged version.

Run fine-tuning

make directory for saving lightning logs: mkdir lightning_logs
An example of running Bart text only model: ./scripts/Bart_text_only.sh
An example of running Bart multimodal model: ./scripts/Bart_multimodal.sh

Run inference

An example of running Bart multimodal model: ./scripts/test_Bart_multimodal.sh

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Related tags

Overview

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Paper Abstract

Evaluation

Prepare dataset

Run fine-tuning

Run inference

Owner

CAiRE

Ensembling Off-the-shelf Models for GAN Training

This repository is the code of the paper "Sparse Spatial Transformers for Few-Shot Learning".

GitHub repository for "Improving Video Generation for Multi-functional Applications"

I decide to sync up this repo and self-critical.pytorch. (The old master is in old master branch for archive)

Tensorflow implementation of DeepLabv2

Code accompanying "Dynamic Neural Relational Inference" from CVPR 2020

Recognize numbers from an (28 x 28) image using neural networks

Gradient representations in ReLU networks as similarity functions

Supplementary code for the paper "Meta-Solver for Neural Ordinary Differential Equations" https://arxiv.org/abs/2103.08561

A system used to detect whether a person is wearing a medical mask or not.

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

Implementation of Lie Transformer, Equivariant Self-Attention, in Pytorch

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

This is the repository for The Machine Learning Workshops, published by AI DOJO

Small little script to scrape, parse and check for active tor nodes. Can be used as proxies.

中文语音识别系列，读者可以借助它快速训练属于自己的中文语音识别模型，或直接使用预训练模型测试效果。

A new benchmark for Icon Question Answering (IconQA) and a large-scale icon dataset Icon645.

CTRL-C: Camera calibration TRansformer with Line-Classification

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

A U-Net combined with a variational auto-encoder that is able to learn conditional distributions over semantic segmentations.