Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Last update: Jan 02, 2023

Related tags

Deep Learning AudioCLIP

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

This repository contains implementation of the models described in the paper arXiv:2106.13043. This work based on our previous works:

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021).
ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020).

Abstract

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.

In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).

Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

Downloading Pre-Trained Weights

The pre-trained model can be downloaded from the releases.

# AudioCLIP trained on AudioSet (text-, image- and audio-head simultaneously)
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt

How to Run the Model

The required Python version is >= 3.7.

AudioCLIP

On the ESC-50 dataset

python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50

On the UrbanSound8K dataset

python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K

Cite Us

@misc{guzhov2021audioclip,
      title={AudioCLIP: Extending CLIP to Image, Text and Audio}, 
      author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
      year={2021},
      eprint={2106.13043},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

76 Dec 22, 2022

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

46 Dec 14, 2022

https://arxiv.org/abs/2102.11005

LogME LogME: Practical Assessment of Pre-trained Models for Transfer Learning How to use Just feed the features f and labels y to the function, and yo

149 Dec 19, 2022

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

967 Jan 4, 2023

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

58 Dec 24, 2022

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Git lfs was giving problems, so I removed all assets files from it - the files can be found in the "Release" anyways.

Also it was a bit problematic to use this project in other projects because the folder structure was lacking. I moved all files into an "audioclip" folder to fix python pathing for external projects.

I renamed master to main, but I doubt that this change is going to stay once this pull request is merged.

opened by NotNANtoN 0

Releases(v0.1)

v0.1(Jun 29, 2021)
Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)

CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)

ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)

AudioCLIP trained on AudioSet (+ video frames)

AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)

AudioCLIP-Partial-Training.pt – training of the audio-head only

Source code(tar.gz)
Source code(zip)
AudioCLIP-Full-Training.pt(512.41 MB)
AudioCLIP-Partial-Training.pt(512.41 MB)
bpe_simple_vocab_16e6.txt.gz(1.29 MB)
CLIP.pt(389.49 MB)
ESRNXFBSP.pt(119.01 MB)

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Related tags

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

Abstract

Downloading Pre-Trained Weights

How to Run the Model

AudioCLIP

On the ESC-50 dataset

On the UrbanSound8K dataset

Cite Us

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

https://arxiv.org/abs/2102.11005

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

A PyTorch implementation of EventProp [https://arxiv.org/abs/2009.08378], a method to train Spiking Neural Networks

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Releases(v0.1)

v0.1(Jun 29, 2021)

Owner

Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection

A simple Python configuration file operator.

DAT4 - General Assembly's Data Science course in Washington, DC

A distributed deep learning framework that supports flexible parallelization strategies.

Code of the paper "Shaping Visual Representations with Attributes for Few-Shot Learning (ASL)".

PyTorch implementation of DeepDream algorithm

A PyTorch implementation of QANet.

Image Captioning on google cloud platform based on iot

Code for Estimating Multi-cause Treatment Effects via Single-cause Perturbation (NeurIPS 2021)

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

Facial detection, landmark tracking and expression transfer library for Windows, Linux and Mac

Awesome Human Pose Estimation

This repository is for DSA and CP scripts for reference.

Structured Edge Detection Toolbox

Pytorch implementation of One-Shot Affordance Detection

PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud, CVPR 2019.

QAT(quantize aware training) for classification with MQBench

Diffusion Normalizing Flow (DiffFlow) Neurips2021