Pytorch Implementation for (STANet+ and STANet)

Related tags

Deep LearningSTANet
Overview

Pytorch Implementation for (STANet+ and STANet)

V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V2

V1-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V1


Introduction

  • This repository contains the source code, results, and evaluation toolbox of STANet+ (V2), which are the journal extension version of our paper STANet (V1) published at CVPR-2021.
  • Compared our conference version STANet (V2), which has been extended in two distinct aspects.
    First on the basis of multisource and multiscale perspectives which have been adopted by the CVPR version (V1), we have provided a deep insight into the relationship between multigranularity perception (Fig.2) and real human attention behaved in visual-auditory environment.
    Second without using any complex networks, we have provided an elegant framework to complementary integrate multisource, multiscale, and multigranular information (Fig.1) to formulate pseudofixations which are very consistent with the real ones. Apart from achieving significant performance gain, this work also provides a comprehensive solution for mimicking multimodality attention.

Figure 1: STANet+ mainly focuses on devising a weakly supervised approach for the spatial-temporal-audio (STA) fixation prediction task, where the key innovation is that, as one of the first attempts, we automatically convert semantic category tags to pseudofixations via the newly proposed selective class activation mapping (SCAM) and the upgraded version SCAM+ that has been additionally equipped with the multigranularity perception ability. The obtained pseudofixations can be used as the learning objective to guide knowledge distillation to teach two individual fixation prediction networks (i.e., STA and STA+), which jointly enable generic video fixation prediction without requiring any video tags.

Figure 2: Some representative ’fixation shifting’ cases, additional multigranularity information (i.e., long/crossterm information) has been shown before collecting fixations in A_SRC. Clearly, by comparing A_FIX0, A_FIX1, and A _FIX2, we can easily notice that the multigranularity information could draw human attention to the most meaningful objects and make the fixations to be more focused.

Dependencies

  • Windows10
  • NVIDIA GeForce RTX 2070 SUPER & NVIDIA GeForce RTX 1080Ti
  • python 3.6.4
  • Matlab R2016b
  • pytorch 1.8.0
  • soundmodel

Preparation

Downloading the official pretrained visual and audio model

Visual:resnext101_32x8d, vgg16
Audio: vggsound, net = torch.load('vggsound_netvlad').

Downloading the training dataset and testing dataset:

Training dataset: AVE(Audio Visual Event Location).
Testing dataset: AVAD, DIEM, SumMe, ETMD, Coutrot.

Training

Note
We use Fourier-transform to transform audio features as audio stream input, therefore, you firstly need to use the function audiostft.py to convert the audio files (.wav) to get the audio features(.h5).

Step 1. SCAM training

Coarse: Separately training branches of Scoarse, SAcoarse, STcoarse ,it should be noted that the coarse stage is coarse location, so the size is set to 256 to ensure object-wise location accuracy.
Fine: Separately re-training branches of Sfine, SAfine, STfine,it should be noted that the fine stage is a fine location, so the size is set to 356 to ensure regional location exactness.

Step2. SCAM+ training

S+: Separately training branches of S+short, S+long, S+cross, because it is frame-wise relational reasoning network, the network is the same, so we only need to change the source of the input data.
SA+: Separately training branches of SA+long, SA+cross.
ST+: Separately training branches of ST+short, ST+long, ST+cross.

Step 3. pseudoGT generation

In order to facilitate the display of matrix data processing, Matlab2016b was performed in coarse location of inter-frame smoothing and pseudo GT data post-processing.

Step 4. STA and STA+ training

Training the model of STA and STA+ using the AVE video frames with the generated pseudoGT.

Testing

Step 1. Using the function audiostft.py to convert the audio files (.wav) to get the audio features (.h5).
Step 2. Testing STA, STA+ network, fusing the test results to generate final saliency results.(STANet+)

The model weight file STANet+, STANet, AudioSwitch:
(Baidu Netdisk, code:6afo).

Evaluation

We use the evaluation code in the paper of STAVIS for fair comparisons.
You may need to revise the algorithms, data_root, and maps_root defined in the main.m.
We provide the saliency maps of the SOTA:

(STANet+, STANet, ITTI, GBVS, SCLI, AWS-D, SBF, CAM, GradCAM, GradCAMpp, SGradCAMpp, xGradCAM, SSCAM, ScoCAM, LCAM, ISCAM, ACAM, EGradCAM, ECAM, SPG, VUNP, WSS, MWS, WSSA).
(Baidu Netdisk, code:6afo).

Quantitative comparisons:

Qualitative results of our method and eight representative saliency models: ITTI, GBVS, SCLI, SBF, AWS-D, WSS, MWS, WSSA. It can be observed that our method is able to handle various challenging scenes well and produces more accurate results than other competitors.

Qualitative comparisons:

Quantitative comparisons between our method with other fully-/weakly-/un-supervised methods on 6 datasets. Bold means the best result, " denotes the higher the score, the better the performance.

References

[1][Tsiami, A., Koutras, P., Maragos, P.STAViS: Spatio-Temporal AudioVisual Saliency Network. (CVPR 2020).] (https://openaccess.thecvf.com/content_CVPR_2020/papers/Tsiami_STAViS_Spatio-Temporal_AudioVisual_Saliency_Network_CVPR_2020_paper.pdf)
[2][Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C. Audio-Visual Event Localization in Unconstrained Videos. (ECCV 2018)] (https://openaccess.thecvf.com/content_ECCV_2018/papers/Yapeng_Tian_Audio-Visual_Event_Localization_ECCV_2018_paper.pdf)
[3][Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. Vggsound: A Large-Scale Audio-Visual Dataset. (ICASSP 2020)] (https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf)

Citation

If you find this work useful for your research, please consider citing the following paper:

@InProceedings{Wang_2021_CVPR,  
    author    = {Wang, Guotao and Chen, Chenglizhao and Fan, Deng-Ping and Hao, Aimin and Qin, Hong},
    title     = {From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach},  
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},  
    month     = {June},  
    year      = {2021},  
    pages     = {15119-15128}  
}  


@misc{wang2021weakly,
    title={Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception}, 
    author={Guotao Wang and Chenglizhao Chen and Dengping Fan and Aimin Hao and Hong Qin},
    year={2021},
    eprint={2112.13697},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Owner
GuotaoWang
GuotaoWang
Sandbox for training deep learning networks

Deep learning networks This repo is used to research convolutional networks primarily for computer vision tasks. For this purpose, the repo contains (

Oleg Sémery 2.7k Jan 01, 2023
Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer.

DocEnTR Description Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer. This model is implemented on to

Mohamed Ali Souibgui 74 Jan 07, 2023
This repo is customed for VisDrone.

Object Detection for VisDrone(无人机航拍图像目标检测) My environment 1、Windows10 (Linux available) 2、tensorflow = 1.12.0 3、python3.6 (anaconda) 4、cv2 5、ensemble

53 Jul 17, 2022
Code for our paper "Interactive Analysis of CNN Robustness"

Perturber Code for our paper "Interactive Analysis of CNN Robustness" Datasets Feature visualizations: Google Drive Fine-tuning checkpoints as saved m

Stefan Sietzen 0 Aug 17, 2021
Code for the AAAI-2022 paper: Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification

Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification (AAAI 2022) Prerequisite PyTorch = 1.2.0 P

16 Dec 14, 2022
Temporal-Relational CrossTransformers

Temporal-Relational Cross-Transformers (TRX) This repo contains code for the method introduced in the paper: Temporal-Relational CrossTransformers for

83 Dec 12, 2022
Official pytorch implementation of paper Dual-Level Collaborative Transformer for Image Captioning (AAAI 2021).

Dual-Level Collaborative Transformer for Image Captioning This repository contains the reference code for the paper Dual-Level Collaborative Transform

lyricpoem 160 Dec 11, 2022
Code repository for Semantic Terrain Classification for Off-Road Autonomous Driving

BEVNet Datasets Datasets should be put inside data/. For example, data/semantic_kitti_4class_100x100. Training BEVNet-S Example: cd experiments bash t

(Brian) JoonHo Lee 24 Dec 12, 2022
Retrieve and analysis data from SDSS (Sloan Digital Sky Survey)

Author: Behrouz Safari License: MIT sdss A python package for retrieving and analysing data from SDSS (Sloan Digital Sky Survey) Installation Install

Behrouz 3 Oct 28, 2022
Author's PyTorch implementation of TD3+BC, a simple variant of TD3 for offline RL

A Minimalist Approach to Offline Reinforcement Learning TD3+BC is a simple approach to offline RL where only two changes are made to TD3: (1) a weight

Scott Fujimoto 193 Dec 23, 2022
A program that can analyze videos according to the weights you select

MaskMonitor A program that can analyze videos according to the weights you select 下載 訓練完的 weight檔案 執行 MaskDetection.py 內部可更改 輸入來源(鏡頭, 影片, 圖片) 以及輸出條件(人

Patrick_star 1 Nov 07, 2021
Kaggle: Cell Instance Segmentation

Kaggle: Cell Instance Segmentation The goal of this challenge is to detect cells in microscope images. with simple view on how many cels have been ann

Jirka Borovec 9 Aug 12, 2022
Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Improving evidential deep learning via multi task learning It is a repository of AAAI2022 paper, “Improving evidential deep learning via multi-task le

deargen 11 Nov 19, 2022
Extracts data from the database for a graph-node and stores it in parquet files

subgraph-extractor Extracts data from the database for a graph-node and stores it in parquet files Installation For developing, it's recommended to us

Cardstack 0 Jan 10, 2022
[제 13회 투빅스 컨퍼런스] OK Mugle! - 장르부터 멜로디까지, Content-based Music Recommendation

Ok Mugle! 🎵 장르부터 멜로디까지, Content-based Music Recommendation 'Ok Mugle!'은 제13회 투빅스 컨퍼런스(2022.01.15)에서 진행한 음악 추천 프로젝트입니다. Description 📖 본 프로젝트에서는 Kakao

SeongBeomLEE 5 Oct 09, 2022
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 07, 2022
Exploration of some patients clinical variables.

Answer_ALS_clinical_data Exploration of some patients clinical variables. All the clinical / metadata data is available here: https://data.answerals.o

1 Jan 20, 2022
Predict stock movement with Machine Learning and Deep Learning algorithms

Project Overview Stock market movement prediction using LSTM Deep Neural Networks and machine learning algorithms Software and Library Requirements Th

Naz Delam 46 Sep 13, 2022
a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

UDL UDL is a practicable framework used in Deep Learning (computer vision). Benchmark codes, results and models are available in UDL, please contact @

Xiao Wu 11 Sep 30, 2022
Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Contextualized Perturbation for Textual Adversarial Attack Introduction This is a PyTorch implementation of Contextualized Perturbation for Textual Ad

cookielee77 30 Jan 01, 2023