Code for Discriminative Sounding Objects Localization (NeurIPS 2020)

Last update: Dec 11, 2022

Overview

Discriminative Sounding Objects Localization

Code for our NeurIPS 2020 paper Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching (The previous title is Learning to Discriminatively Localize Sounding Objects in a Cocktail-party Scenario). The code is implemented on PyTorch with python3.

Requirements

PyTorch 1.1
torchvision
scikit-learn
librosa
Pillow
opencv

Running Procedure

For experiments on Music or AudioSet-instrument, the training and evaluation procedures are similar, respectively under the folder music-exp and audioset-instrument. Here, we take the experiments on Music dataset as an example.

Data Preparation

Download dataset, e.g., MUSIC, and split into training/validation/testing set. Specifically, for the [email protected]_one, please use the solo_training_1.txt. For the [email protected]_two, we use the the music clip in solo_training_2.txt to synthesize the cocktail-party scenarios.
Extract frames at 4 fps by running
```
python3 data/cut_video.py
```
Extract 1-second audio clips and turn into Log-Mel-Spectrogram by running
```
python3 data/cut_audio.py
```

The sounding object bounding box annotations on solo and duet are stored in music-exp/solotest.json and music-exp/duettest.json, and the data and annotations of synthetic set are available at https://zenodo.org/record/4079386#.X4PFodozbb2 . And the Audioset-instrument balanced subset bounding box annotations are in audioset-instrument/audioset_box.json

Training

Stage one

training_stage_one.py [-h]
optional arguments:
[--batch_size] training batchsize
[--learning_rate] learning rate
[--epoch] total training epoch
[--evaluate] only do testing or also training
[--use_pretrain] whether to initialize from ckpt
[--ckpt_file] the ckpt file path to be resumed
[--use_class_task] whether to use localization-classification alternative training
[--class_iter] training iterations for classification of each epoch
[--mask] mask threshold to determine whether is object or background
[--cluster] number of clusters for discrimination

python3 training_stage_one.py

After training of stage one, we will get the cluster pseudo labels and object dictionary of different classes in the folder ./obj_features, which is then used in the second stage training as category-aware object representation reference.

Stage two

training_stage_two.py [-h]
optional arguments:
[--batch_size] training batchsize
[--learning_rate] learning rate
[--epoch] total training epoch
[--evaluate] only do testing or also training
[--use_pretrain] whether to initialize from ckpt
[--ckpt_file] the ckpt file path to be resumed

python3 training_stage_two.py

Evaluation

Stage one

We first generate localization results and save then as a pkl file, then calculate metrics, IoU and AUC and also generate visualizations, by running

python3 test.py
python3 tools.py

Stage two

For evaluation of stage two, i.e., class-aware sounding object localization in multi-source scenes, we first match the cluster pseudo labels generated in stage one with gt labels to accordingly assign one object category to each center representation in the object dictionary by running

python3 match_cluster.py

It is necessary to manually ensure there is one-to-one matching between object category and each center representation.

Then we generate the localization results and calculate metrics, CIoU AUC and NSA, by running

python3 test_stage_two.py
python3 eval.py

Results

The two tables respectively show our model's performance on single-source and multi-source scenarios.

The following figures show the category-aware localization results under multi-source scenes. The green boxes mean the sounding objects while the red boxes are silent ones.

Code for Discriminative Sounding Objects Localization (NeurIPS 2020)

Related tags

Overview

Discriminative Sounding Objects Localization

Requirements

Running Procedure

Data Preparation

Training

Stage one

Stage two

Evaluation

Stage one

Stage two

Results

Owner

Companion repository to the paper accepted at the 4th ACM SIGSPATIAL International Workshop on Advances in Resilient and Intelligent Cities

Official repository for Hierarchical Opacity Propagation for Image Matting

Incomplete easy-to-use math solver and PDF generator.

BitPack is a practical tool to efficiently save ultra-low precision/mixed-precision quantized models.

Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment"

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

[ICCV 2021 Oral] Deep Evidential Action Recognition

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models (ICCV 2021 Oral)

Old Photo Restoration (Official PyTorch Implementation)

🍅🍅🍅YOLOv5-Lite: lighter, faster and easier to deploy. Evolved from yolov5 and the size of model is only 1.7M (int8) and 3.3M (fp16). It can reach 10+ FPS on the Raspberry Pi 4B when the input size is 320×320~

Using NumPy to solve the equations of fluid mechanics together with Finite Differences, explicit time stepping and Chorin's Projection methods

Image segmentation with private İstanbul Dataset

A system used to detect whether a person is wearing a medical mask or not.

Extremely simple and fast extreme multi-class and multi-label classifiers.

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Code repository accompanying the paper "On Adversarial Robustness: A Neural Architecture Search perspective"

Tensorflow implementation of our method: "Triangle Graph Interest Network for Click-through Rate Prediction".

Audio2Face - Audio To Face With Python

A Bayesian cognition approach for belief updating of correlation judgement through uncertainty visualizations