Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

Related tags

Deep Learningast
Overview

AST: Audio Spectrogram Transformer

Introduction

Illustration of AST.

This repository contains the official implementation (in PyTorch) of the Audio Spectrogram Transformer (AST) proposed in the Interspeech 2021 paper AST: Audio Spectrogram Transformer (Yuan Gong, Yu-An Chung, James Glass).

AST is the first convolution-free, purely attention-based model for audio classification which supports variable length input and can be applied to various tasks. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2. For details, please refer to the paper and the ISCA SIGML talk.

Please have a try! AST can be used with a few lines of code, and we also provide recipes to reproduce the SOTA results on AudioSet, ESC-50, and Speechcommands with almost one click.

The AST model file is in src/models/ast_models.py, the recipes are in egs/[audioset,esc50,speechcommands]/run.sh, when you run run.sh, it will call /src/run.py, which will then call /src/dataloader.py and /src/traintest.py, which will then call /src/models/ast_models.py.

Citing

Please cite our paper(s) if you find this repository useful. The first paper proposes the Audio Spectrogram Transformer while the second paper describes the training pipeline that we applied on AST to achieve the new state-of-the-art on AudioSet.

@article{gong2021ast,  
 title={Ast: Audio spectrogram transformer}, 
 author={Gong, Yuan and Chung, Yu-An and Glass, James}, 
 journal={arXiv preprint arXiv:2104.01778}, 
 year={2021}}  
@article{gong2021psla,  
 title={PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation}, 
 author={Gong, Yuan and Chung, Yu-An and Glass, James}, 
 journal={arXiv preprint arXiv:2102.01243}, 
 year={2021}}  

Getting Started

Step 1. Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd ast/ 
python3 -m venv venvast
source venvast/bin/activate
pip install -r requirements.txt 

Step 2. Test the AST model.

ASTModel(label_dim=527, \
         fstride=10, tstride=10, \
         input_fdim=128, input_tdim=1024, \
         imagenet_pretrain=True, audioset_pretrain=False, \
         model_size='base384')

Parameters:
label_dim : The number of classes (default:527).
fstride: The stride of patch spliting on the frequency dimension, for 16*16 patchs, fstride=16 means no overlap, fstride=10 means overlap of 6 (used in the paper). (default:10)
tstride: The stride of patch spliting on the time dimension, for 16*16 patchs, tstride=16 means no overlap, tstride=10 means overlap of 6 (used in the paper). (default:10)
input_fdim: The number of frequency bins of the input spectrogram. (default:128)
input_tdim: The number of time frames of the input spectrogram. (default:1024, i.e., 10.24s)
imagenet_pretrain: If True, use ImageNet pretrained model. (default: True, we recommend to set it as True for all tasks.)
audioset_pretrain: IfTrue, use full AudioSet And ImageNet pretrained model. Currently only support base384 model with fstride=tstride=10. (default: False, we recommend to set it as True for all tasks except AudioSet.)
model_size: The model size of AST, should be in [tiny224, small224, base224, base384] (default: base384).

cd ast/src
python
import os 
import torch
from models import ASTModel 
# download pretrained model in this directory
os.environ['TORCH_HOME'] = '../pretrained_models'  
# assume each input spectrogram has 100 time frames
input_tdim = 100
# assume the task has 527 classes
label_dim = 527
# create a pseudo input: a batch of 10 spectrogram, each with 100 time frames and 128 frequency bins 
test_input = torch.rand([10, input_tdim, 128]) 
# create an AST model
ast_mdl = ASTModel(label_dim=label_dim, input_tdim=input_tdim, imagenet_pretrain=True)
test_output = ast_mdl(test_input) 
# output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes. 
print(test_output.shape)  

ESC-50 Recipe

The ESC-50 recipe is in ast/egs/esc50/run_esc.sh, the script will automatically download the ESC-50 dataset and resample it to 16kHz, then run standard 5-cross validation and report the result. The recipe was tested on 4 GTX TITAN GPUs with 12GB memory. The result is saved in ast/egs/esc50/exp/yourexpname/acc_fold.csv (the accuracy of fold 1-5 and the averaged accuracy), you can also check details in result.csv and best_result.csv (accuracy, AUC, loss, etc of each epoch / best epoch). We attached our log file in ast/egs/esc50/test-esc50-f10-t10-p-b48-lr1e-5, the model achieves 95.75% accuracy.

To run the recipe, simply comment out . /data/sls/scratch/share-201907/slstoolchainrc in ast/egs/esc50/run_esc.sh, adjust the path if needed, and run:

cd ast/egs/esc50
(slurm user) sbatch run_esc50.sh
(local user) ./run_esc50.sh

Speechcommands V2 Recipe

The Speechcommands recipe is in ast/egs/speechcommands/run_sc.sh, the script will automatically download the Speechcommands V2 dataset, train an AST model on the training set, validate it on the validation set, and evaluate it on the test set. The recipe was tested on 4 GTX TITAN GPUs with 12GB memory. The result is saved in ast/egs/speechcommands/exp/yourexpname/eval_result.csv in format [val_acc, val_AUC, eval_acc, eval_AUC], you can also check details in result.csv (accuracy, AUC, loss, etc of each epoch). We attached our log file in ast/egs/speechcommends/test-speechcommands-f10-t10-p-b128-lr2.5e-4-0.5-false, the model achieves 98.12% accuracy.

To run the recipe, simply comment out . /data/sls/scratch/share-201907/slstoolchainrc in ast/egs/esc50/run_sc.sh, adjust the path if needed, and run:

cd ast/egs/speechcommands
(slurm user) sbatch run_sc.sh
(local user) ./run_sc.sh

Audioset Recipe

Audioset is a little bit more complex, you will need to prepare your data json files (i.e., train_data.json and eval_data.json) by your self. The reason is that the raw wavefiles of Audioset is not released and you need to download them by yourself. We have put a sample json file in ast/egs/audioset/data/datafiles, please generate files in the same format (You can also refer to ast/egs/esc50/prep_esc50.py and ast/egs/speechcommands/prep_sc.py.). Please keep the label code consistent with ast/egs/audioset/data/class_labels_indices.csv.

Once you have the json files, you will need to generate the sampling weight file of your training data (please check our PSLA paper to see why it is needed).

cd ast/egs/audioset
python gen_weight_file.py ./data/datafiles/train_data.json

Then you just need to change the tr_data and te_data in /ast/egs/audioset/run.sh and then

cd ast/egs/audioset
(slurm user) sbatch run.sh
(local user) ./run.sh

You should get a model achieves 0.448 mAP (without weight averaging) and 0.459 (with weight averaging). This is the best single model reported in the paper. The result of each epoch is saved in ast/egs/audioset/exp/yourexpname/result.csv in format [mAP, mAUC, precision, recall, d_prime, train_loss, valid_loss, cum_mAP, cum_mAUC, lr] , where cum_ results are the checkpoint ensemble results (i.e., averaging the prediction of checkpoint models of each epoch, please check our PSLA paper for details). The result of weighted averaged model is saved in wa_result.csv in format [mAP, AUC, precision, recall, d-prime]. We attached our log file in ast/egs/audioset/test-full-f10-t10-pTrue-b12-lr1e-5/, the model achieves 0.459 mAP.

In order to reproduce ensembe results of 0.475 mAP and 0.485 mAP, please train 3 models use the same setting (i.e., repeat above three times) and train 6 models with different tstride and fstride, and average the output of the models. Please refer to ast/egs/audioset/ensemble.py. We attached our ensemble log in /ast/egs/audioset/exp/ensemble-s.log and ensemble-m.log. You can use our pretrained models (see below) to test ensemble result.

Pretrained Models

We provide full AudioSet pretrained models.

  1. Full AudioSet, 10 tstride, 10 fstride, with Weight Averaging (0.459 mAP)
  2. Full AudioSet, 10 tstride, 10 fstride, without Weight Averaging, Model 1 (0.450 mAP)
  3. Full AudioSet, 10 tstride, 10 fstride, without Weight Averaging, Model 2 (0.448 mAP)
  4. Full AudioSet, 10 tstride, 10 fstride, without Weight Averaging, Model 3 (0.448 mAP)
  5. Full AudioSet, 12 tstride, 12 fstride, without Weight Averaging, Model (0.447 mAP)
  6. Full AudioSet, 14 tstride, 14 fstride, without Weight Averaging, Model (0.443 mAP)
  7. Full AudioSet, 16 tstride, 16 fstride, without Weight Averaging, Model (0.442 mAP)

Ensemble model 2-4 achieves 0.475 mAP, Ensemble model 2-7 achieves and 0.485 mAP. You can download these models at one click using ast/egs/audioset/download_models.sh. Once you download the model, you can try ast/egs/audioset/ensemble.py, you need to change the eval_data_path and mdl_list to run it. We attached our ensemble log in /ast/egs/audioset/exp/ensemble-s.log and ensemble-m.log.

If you want to finetune AudioSet-pretrained AST model on your task, you can simply set the audioset_pretrain=True when you create the AST model, it will automatically download model 1 (0.459 mAP). In our ESC-50 recipe, AudioSet pretraining is used.

Contact

If you have a question, please bring up an issue (preferred) or send me an email [email protected].

Owner
Yuan Gong
Ph.D in CS
Yuan Gong
Testability-Aware Low Power Controller Design with Evolutionary Learning, ITC2021

Testability-Aware Low Power Controller Design with Evolutionary Learning This repo contains the source code of Testability-Aware Low Power Controller

Lee Man 1 Dec 26, 2021
Code for Neurips2021 Paper "Topology-Imbalance Learning for Semi-Supervised Node Classification".

Topology-Imbalance Learning for Semi-Supervised Node Classification Introduction Code for NeurIPS 2021 paper "Topology-Imbalance Learning for Semi-Sup

Victor Chen 40 Nov 23, 2022
Winners of the Facebook Image Similarity Challenge

Winners of the Facebook Image Similarity Challenge

DrivenData 111 Jan 05, 2023
FedScale: Benchmarking Model and System Performance of Federated Learning

FedScale: Benchmarking Model and System Performance of Federated Learning (Paper) This repository contains scripts and instructions of building FedSca

268 Jan 01, 2023
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

TUCH This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] License Software Copyright License fo

Lea Müller 45 Jan 07, 2023
Robotics environments

Robotics environments Details and documentation on these robotics environments are available in OpenAI's blog post and the accompanying technical repo

Farama Foundation 121 Dec 28, 2022
PyTorch implementation of residual gated graph ConvNets, ICLR’18

Residual Gated Graph ConvNets April 24, 2018 Xavier Bresson http://www.ntu.edu.sg/home/xbresson https://github.com/xbresson https://twitter.com/xbress

Xavier Bresson 112 Aug 10, 2022
A small tool to joint picture including gif

README 做设计的时候遇到拼接长图的情况,但是发现没有什么好用的能拼接gif的工具。 于是自己写了个gif拼接小工具。 可以自动拼接gif、png和jpg等常见格式。 效果 从上至下 从下至上 从左至右 从右至左 使用 克隆仓库 git clone https://github.com/Dels

3 Dec 15, 2021
Download & Install mods for your favorit game with a few simple clicks

Husko's SteamWorkshop Downloader 🔴 IMPORTANT ❗ 🔴 The Tool is currently being rewritten so updates will be slow and only on the dev branch until it i

Husko 67 Nov 25, 2022
Source code for the paper: Variance-Aware Machine Translation Test Sets (NeurIPS 2021 Datasets and Benchmarks Track)

Variance-Aware-MT-Test-Sets Variance-Aware Machine Translation Test Sets License See LICENSE. We follow the data licensing plan as the same as the WMT

NLP2CT Lab, University of Macau 5 Dec 21, 2021
LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

Simon Boehm 183 Jan 02, 2023
Meandering In Networks of Entities to Reach Verisimilar Answers

MINERVA Meandering In Networks of Entities to Reach Verisimilar Answers Code and models for the paper Go for a Walk and Arrive at the Answer - Reasoni

Shehzaad Dhuliawala 271 Dec 13, 2022
Implementation for the paper 'YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs'

YOLO-ReT This is the original implementation of the paper: YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs. Prakhar Ganesh, Ya

69 Oct 19, 2022
DeepLab resnet v2 model in pytorch

pytorch-deeplab-resnet DeepLab resnet v2 model implementation in pytorch. The architecture of deepLab-ResNet has been replicated exactly as it is from

Isht Dwivedi 601 Dec 22, 2022
Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

SPLASH: Semantic Parsing with Language Assistance from Humans SPLASH is dataset for the task of semantic parse correction with natural language feedba

Microsoft Research - Language and Information Technologies (MSR LIT) 35 Oct 31, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 08, 2022
Hippocampal segmentation using the UNet network for each axis

Hipposeg Hippocampal segmentation using the UNet network for each axis, inspired by https://github.com/MICLab-Unicamp/e2dhipseg Red: False Positive Gr

Juan Carlos Aguirre Arango 0 Sep 02, 2021
As-ViT: Auto-scaling Vision Transformers without Training

As-ViT: Auto-scaling Vision Transformers without Training [PDF] Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou In ICLR 2

VITA 68 Sep 05, 2022
An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

Zou 33 Jan 03, 2023
Dataloader tools for language modelling

Installation: pip install lm_dataloader Design Philosophy A library to unify lm dataloading at large scale Simple interface, any tokenizer can be inte

5 Mar 25, 2022