End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Overview

PDVC

PWC PWC

Official implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

[paper] [valse论文速递(Chinese)]

This repo supports:

  • two video captioning tasks: dense video captioning and video paragraph captioning
  • two datasets: ActivityNet Captions and YouCook2
  • video features containing C3D, TSN, and TSP.
  • visualization of the generated captions of your own videos

Table of Contents:

Updates

  • (2021.11.19) add code for running PDVC on raw videos and visualize the generated captions (support Chinese and other non-English languages)
  • (2021.11.19) add pretrained models with TSP features. It achieves 9.03 METEOR(2021) and 6.05 SODA_c, a very competitive results on ActivityNet Captions without self-critical sequence training.
  • (2021.08.29) add TSN pretrained models and support YouCook2

Introduction

PDVC is a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art methods when its localization accuracy is on par with them. pdvc.jpg

Preparation

Environment: Linux, GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1

  1. Clone the repo
git clone --recursive https://github.com/ttengwang/PDVC.git
  1. Create vitual environment by conda
conda create -n PDVC python=3.7
source activate PDVC
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirement.txt
  1. Compile the deformable attention layer (requires GCC >= 5.4).
cd pdvc/ops
sh make.sh

Running PDVC on Your Own Videos

Download a pretrained model (GoogleDrive) with TSP features and put it into ./save. Then run:

video_folder=visualization/videos
output_folder=visualization/output
pdvc_model_path=save/anet_tsp_pdvc/model-best.pth
output_language=en
bash test_and_visualize.sh $video_folder $output_folder $pdvc_model_path $output_language

check the $output_folder, you will see a new video with embedded captions. Note that we generate non-English captions by translating the English captions by GoogleTranslate. To produce chinese captions, set output_language=zh-cn. For other language support, find the abbreviation of your language at this url, and you also may need to download a font supporting your language and put it into ./visualization.

demo.gifdemo.gif

Training and Validation

Download Video Features

cd data/anet/features
bash download_anet_c3d.sh
# bash download_anet_tsn.sh
# bash download_i3d_vggish_features.sh
# bash download_tsp_features.sh

Dense Video Captioning

  1. PDVC with learnt proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}
# The script will evaluate the model for every epoch. The results and logs are saved in `./save`.

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}
  1. PDVC with ground-truth proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Video Paragraph Captioning

  1. PDVC with learnt proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID} 

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}
  1. PDVC with ground-truth proposals
# Training
config_path=cfgs/anet_c3d_pdvc_gt.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Performance

Dense video captioning

Model Features config_path Url Recall Precision BLEU4 METEOR2018 METEOR2021 CIDEr SODA_c
PDVC_light C3D cfgs/anet_c3d_pdvcl.yml Google Drive 55.30 58.42 1.55 7.13 7.66 24.80 5.23
PDVC C3D cfgs/anet_c3d_pdvc.yml Google Drive 55.20 57.36 1.82 7.48 8.09 28.16 5.47
PDVC_light TSN cfgs/anet_tsn_pdvcl.yml Google Drive 55.34 57.97 1.66 7.41 7.97 27.23 5.51
PDVC TSN cfgs/anet_tsn_pdvc.yml Google Drive 56.21 57.46 1.92 8.00 8.63 29.00 5.68
PDVC_light TSP cfgs/anet_tsp_pdvcl.yml Google Drive 55.24 57.78 1.77 7.94 8.55 28.25 5.95
PDVC TSP cfgs/anet_tsp_pdvc.yml Google Drive 55.79 57.39 2.17 8.37 9.03 31.14 6.05

Notes:

Video paragraph captioning

Model Features config_path BLEU4 METEOR CIDEr
PDVC C3D cfgs/anet_c3d_pdvc.yml 9.67 14.74 16.43
PDVC TSN cfgs/anet_tsn_pdvc.yml 10.18 15.96 20.66
PDVC TSP cfgs/anet_tsp_pdvc.yml 10.46 16.42 20.91

Notes:

  • Paragraph-level scores are evaluated on the ActivityNet Entity ae-val set.

Citation

If you find this repo helpful, please consider citing:

@inproceedings{wang2021end,
  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Wang, Teng and Zhang, Ruimao and Lu, Zhichao and Zheng, Feng and Cheng, Ran and Luo, Ping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6847--6857},
  year={2021}
}
@ARTICLE{wang2021echr,
  author={Wang, Teng and Zheng, Huicheng and Yu, Mingjing and Tian, Qian and Hu, Haifeng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Event-Centric Hierarchical Representation for Dense Video Captioning}, 
  year={2021},
  volume={31},
  number={5},
  pages={1890-1900},
  doi={10.1109/TCSVT.2020.3014606}}

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.

Owner
Teng Wang
My research interests focus on deep learning and computer vision.
Teng Wang
PFLD pytorch Implementation

PFLD-pytorch Implementation of PFLD A Practical Facial Landmark Detector by pytorch. 1. install requirements pip3 install -r requirements.txt 2. Datas

zhaozhichao 669 Jan 02, 2023
Code for NeurIPS 2020 article "Contrastive learning of global and local features for medical image segmentation with limited annotations"

Contrastive learning of global and local features for medical image segmentation with limited annotations The code is for the article "Contrastive lea

Krishna Chaitanya 152 Dec 22, 2022
Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources

Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources (e.g. just the lead vocals).

Victor Basu 14 Nov 07, 2022
The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Hailo Model Zoo The Hailo Model Zoo provides pre-trained models for high-performance deep learning applications. Using the Hailo Model Zoo you can mea

Hailo 50 Dec 07, 2022
Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Ponder(ing) Transformer Implementation of a Transformer that learns to adapt the number of computational steps it takes depending on the difficulty of

Phil Wang 65 Oct 04, 2022
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning This repository contains the code for our ICCV 202

sangho.lee 28 Nov 08, 2022
DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing Figure: Joint multi-attribute edits using DyStyle model. Great diversity

74 Dec 03, 2022
An end-to-end machine learning library to directly optimize AUC loss

LibAUC An end-to-end machine learning library for AUC optimization. Why LibAUC? Deep AUC Maximization (DAM) is a paradigm for learning a deep neural n

Andrew 75 Dec 12, 2022
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers Authors: Jaemin Cho, Abhay Zala, and Mohit Bansal (

Jaemin Cho 98 Dec 15, 2022
Awesome Transformers in Medical Imaging

This repo supplements our Survey on Transformers in Medical Imaging Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat,

Fahad Shamshad 666 Jan 06, 2023
Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

ddpg-aigym Deep Deterministic Policy Gradient Implementation of Deep Deterministic Policy Gradiet Algorithm (Lillicrap et al.arXiv:1509.02971.) in Ten

Steven Spielberg P 247 Dec 07, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel 🤘 This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 02, 2022
The Empirical Investigation of Representation Learning for Imitation (EIRLI)

The Empirical Investigation of Representation Learning for Imitation (EIRLI)

Center for Human-Compatible AI 31 Nov 06, 2022
Code for "AutoMTL: A Programming Framework for Automated Multi-Task Learning"

AutoMTL: A Programming Framework for Automated Multi-Task Learning This is the website for our paper "AutoMTL: A Programming Framework for Automated M

Ivy Zhang 40 Dec 04, 2022
🕵 Artificial Intelligence for social control of public administration

Non-tech crash course into Operação Serenata de Amor Tech crash course into Operação Serenata de Amor Contributing with code and tech skills Supportin

Open Knowledge Brasil - Rede pelo Conhecimento Livre 4.4k Dec 31, 2022
Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

DNA This repository provides the code of our paper: Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. Illustration of DNA

Changlin Li 215 Dec 19, 2022
Code for our TKDE paper "Understanding WeChat User Preferences and “Wow” Diffusion"

wechat-wow-analysis Understanding WeChat User Preferences and “Wow” Diffusion. Fanjin Zhang, Jie Tang, Xueyi Liu, Zhenyu Hou, Yuxiao Dong, Jing Zhang,

18 Sep 16, 2022
Understanding Convolution for Semantic Segmentation

TuSimple-DUC by Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Introduction This repository is for Under

TuSimple 585 Dec 31, 2022
Ratatoskr: Worcester Tech's conference scheduling system

Ratatoskr: Worcester Tech's conference scheduling system In Norse mythology, Ratatoskr is a squirrel who runs up and down the world tree Yggdrasil to

4 Dec 22, 2022
A style-based Quantum Generative Adversarial Network

Style-qGAN A style based Quantum Generative Adversarial Network (style-qGAN) model for Monte Carlo event generation. Tutorial We have prepared a noteb

9 Nov 24, 2022