Pytorch Implementation for (STANet+ and STANet)

Related tags

Deep LearningSTANet
Overview

Pytorch Implementation for (STANet+ and STANet)

V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V2

V1-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V1


Introduction

  • This repository contains the source code, results, and evaluation toolbox of STANet+ (V2), which are the journal extension version of our paper STANet (V1) published at CVPR-2021.
  • Compared our conference version STANet (V2), which has been extended in two distinct aspects.
    First on the basis of multisource and multiscale perspectives which have been adopted by the CVPR version (V1), we have provided a deep insight into the relationship between multigranularity perception (Fig.2) and real human attention behaved in visual-auditory environment.
    Second without using any complex networks, we have provided an elegant framework to complementary integrate multisource, multiscale, and multigranular information (Fig.1) to formulate pseudofixations which are very consistent with the real ones. Apart from achieving significant performance gain, this work also provides a comprehensive solution for mimicking multimodality attention.

Figure 1: STANet+ mainly focuses on devising a weakly supervised approach for the spatial-temporal-audio (STA) fixation prediction task, where the key innovation is that, as one of the first attempts, we automatically convert semantic category tags to pseudofixations via the newly proposed selective class activation mapping (SCAM) and the upgraded version SCAM+ that has been additionally equipped with the multigranularity perception ability. The obtained pseudofixations can be used as the learning objective to guide knowledge distillation to teach two individual fixation prediction networks (i.e., STA and STA+), which jointly enable generic video fixation prediction without requiring any video tags.

Figure 2: Some representative ’fixation shifting’ cases, additional multigranularity information (i.e., long/crossterm information) has been shown before collecting fixations in A_SRC. Clearly, by comparing A_FIX0, A_FIX1, and A _FIX2, we can easily notice that the multigranularity information could draw human attention to the most meaningful objects and make the fixations to be more focused.

Dependencies

  • Windows10
  • NVIDIA GeForce RTX 2070 SUPER & NVIDIA GeForce RTX 1080Ti
  • python 3.6.4
  • Matlab R2016b
  • pytorch 1.8.0
  • soundmodel

Preparation

Downloading the official pretrained visual and audio model

Visual:resnext101_32x8d, vgg16
Audio: vggsound, net = torch.load('vggsound_netvlad').

Downloading the training dataset and testing dataset:

Training dataset: AVE(Audio Visual Event Location).
Testing dataset: AVAD, DIEM, SumMe, ETMD, Coutrot.

Training

Note
We use Fourier-transform to transform audio features as audio stream input, therefore, you firstly need to use the function audiostft.py to convert the audio files (.wav) to get the audio features(.h5).

Step 1. SCAM training

Coarse: Separately training branches of Scoarse, SAcoarse, STcoarse ,it should be noted that the coarse stage is coarse location, so the size is set to 256 to ensure object-wise location accuracy.
Fine: Separately re-training branches of Sfine, SAfine, STfine,it should be noted that the fine stage is a fine location, so the size is set to 356 to ensure regional location exactness.

Step2. SCAM+ training

S+: Separately training branches of S+short, S+long, S+cross, because it is frame-wise relational reasoning network, the network is the same, so we only need to change the source of the input data.
SA+: Separately training branches of SA+long, SA+cross.
ST+: Separately training branches of ST+short, ST+long, ST+cross.

Step 3. pseudoGT generation

In order to facilitate the display of matrix data processing, Matlab2016b was performed in coarse location of inter-frame smoothing and pseudo GT data post-processing.

Step 4. STA and STA+ training

Training the model of STA and STA+ using the AVE video frames with the generated pseudoGT.

Testing

Step 1. Using the function audiostft.py to convert the audio files (.wav) to get the audio features (.h5).
Step 2. Testing STA, STA+ network, fusing the test results to generate final saliency results.(STANet+)

The model weight file STANet+, STANet, AudioSwitch:
(Baidu Netdisk, code:6afo).

Evaluation

We use the evaluation code in the paper of STAVIS for fair comparisons.
You may need to revise the algorithms, data_root, and maps_root defined in the main.m.
We provide the saliency maps of the SOTA:

(STANet+, STANet, ITTI, GBVS, SCLI, AWS-D, SBF, CAM, GradCAM, GradCAMpp, SGradCAMpp, xGradCAM, SSCAM, ScoCAM, LCAM, ISCAM, ACAM, EGradCAM, ECAM, SPG, VUNP, WSS, MWS, WSSA).
(Baidu Netdisk, code:6afo).

Quantitative comparisons:

Qualitative results of our method and eight representative saliency models: ITTI, GBVS, SCLI, SBF, AWS-D, WSS, MWS, WSSA. It can be observed that our method is able to handle various challenging scenes well and produces more accurate results than other competitors.

Qualitative comparisons:

Quantitative comparisons between our method with other fully-/weakly-/un-supervised methods on 6 datasets. Bold means the best result, " denotes the higher the score, the better the performance.

References

[1][Tsiami, A., Koutras, P., Maragos, P.STAViS: Spatio-Temporal AudioVisual Saliency Network. (CVPR 2020).] (https://openaccess.thecvf.com/content_CVPR_2020/papers/Tsiami_STAViS_Spatio-Temporal_AudioVisual_Saliency_Network_CVPR_2020_paper.pdf)
[2][Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C. Audio-Visual Event Localization in Unconstrained Videos. (ECCV 2018)] (https://openaccess.thecvf.com/content_ECCV_2018/papers/Yapeng_Tian_Audio-Visual_Event_Localization_ECCV_2018_paper.pdf)
[3][Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. Vggsound: A Large-Scale Audio-Visual Dataset. (ICASSP 2020)] (https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf)

Citation

If you find this work useful for your research, please consider citing the following paper:

@InProceedings{Wang_2021_CVPR,  
    author    = {Wang, Guotao and Chen, Chenglizhao and Fan, Deng-Ping and Hao, Aimin and Qin, Hong},
    title     = {From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach},  
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},  
    month     = {June},  
    year      = {2021},  
    pages     = {15119-15128}  
}  


@misc{wang2021weakly,
    title={Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception}, 
    author={Guotao Wang and Chenglizhao Chen and Dengping Fan and Aimin Hao and Hong Qin},
    year={2021},
    eprint={2112.13697},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Owner
GuotaoWang
GuotaoWang
Codes for "Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier"

Deep-RTC [project page] This repository contains the source code accompanying our ECCV 2020 paper. Solving Long-tailed Recognition with Deep Realistic

Gina Wu 16 May 26, 2022
A flag generation AI created using DeepAIs API

Vex AI or Vexiology AI is an Artifical Intelligence created to generate custom made flag design texts. It uses DeepAIs API. Please be aware that you must include your own DeepAI API key. See instruct

Bernie 10 Apr 06, 2022
Implementation of the state-of-the-art vision transformers with tensorflow

ViT Tensorflow This repository contains the tensorflow implementation of the state-of-the-art vision transformers (a category of computer vision model

Mohammadmahdi NouriBorji 2 Mar 16, 2022
PAWS 🐾 Predicting View-Assignments with Support Samples

This repo provides a PyTorch implementation of PAWS (predicting view assignments with support samples), as described in the paper Semi-Supervised Learning of Visual Features by Non-Parametrically Pre

Facebook Research 437 Dec 23, 2022
A curated (most recent) list of resources for Learning with Noisy Labels

A curated (most recent) list of resources for Learning with Noisy Labels

Jiaheng Wei 321 Jan 09, 2023
[NeurIPS'21] Shape As Points: A Differentiable Poisson Solver

Shape As Points (SAP) Paper | Project Page | Short Video (6 min) | Long Video (12 min) This repository contains the implementation of the paper: Shape

394 Dec 30, 2022
A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

TADDY: Anomaly detection in dynamic graphs via transformer This repo covers an reference implementation for the paper "Anomaly detection in dynamic gr

Yue Tan 21 Nov 24, 2022
BlueFog Tutorials

BlueFog Tutorials Welcome to the BlueFog tutorials! In this repository, we've put together a collection of awesome Jupyter notebooks. These notebooks

4 Oct 27, 2021
This code is part of the reproducibility package for the SANER 2022 paper "Generating Clarifying Questions for Query Refinement in Source Code Search".

Clarifying Questions for Query Refinement in Source Code Search This code is part of the reproducibility package for the SANER 2022 paper "Generating

Zachary Eberhart 0 Dec 04, 2021
PyTorch Live is an easy to use library of tools for creating on-device ML demos on Android and iOS.

PyTorch Live is an easy to use library of tools for creating on-device ML demos on Android and iOS. With Live, you can build a working mobile app ML demo in minutes.

559 Jan 01, 2023
Official implementation of the NeurIPS'21 paper 'Conditional Generation Using Polynomial Expansions'.

Conditional Generation Using Polynomial Expansions Official implementation of the conditional image generation experiments as described on the NeurIPS

Grigoris 4 Aug 07, 2022
Code for the paper "Balancing Training for Multilingual Neural Machine Translation, ACL 2020"

Balancing Training for Multilingual Neural Machine Translation Implementation of the paper Balancing Training for Multilingual Neural Machine Translat

Xinyi Wang 21 May 18, 2022
Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

flownet2-pytorch Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. Multiple GPU training is supported, a

NVIDIA Corporation 2.8k Dec 27, 2022
GNN-based Recommendation Benchmark

GRecX A Fair Benchmark for GNN-based Recommendation Homepage and Documentation Homepage: Documentation: Paper: GRecX: An Efficient and Unified Benchma

73 Oct 17, 2022
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

LiDAR fog simulation Created by Martin Hahner at the Computer Vision Lab of ETH Zurich. This is the official code release of the paper Fog Simulation

Martin Hahner 110 Dec 30, 2022
내가 보려고 정리한 <프로그래밍 기초 Ⅰ> / organized for me

Programming-Basics 프로그래밍 기초 Ⅰ 아카이브 Do it! 점프 투 파이썬 주차 강의주제 비고 1주차 Syllabus 2주차 자료형 - 숫자형 3주차 자료형 - 문자열형 4주차 입력과 출력 5주차 제어문 - 조건문 if 6주차 제어문 - 반복문 whil

KIMMINSEO 1 Mar 07, 2022
A FAIR dataset of TCV experimental results for validating edge/divertor turbulence models.

TCV-X21 validation for divertor turbulence simulations Quick links Intro Welcome to TCV-X21. We're glad you've found us! This repository is designed t

0 Dec 18, 2021
QHack—the quantum machine learning hackathon

Official repo for QHack—the quantum machine learning hackathon

Xanadu 72 Dec 21, 2022
Various operations like path tracking, counting, etc by using yolov5

Object-tracing-with-YOLOv5 Various operations like path tracking, counting, etc by using yolov5

Pawan Valluri 5 Nov 28, 2022
A Dataset for Direct Quotation Extraction and Attribution in News Articles.

DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles DirectQuote is a corpus containing 19,760 paragraphs and 10,3

THUNLP-MT 9 Sep 23, 2022