AudioCLIP Extending CLIP to Image, Text and Audio

Last update: Jan 02, 2023

Related tags

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

This repository contains implementation of the models described in the paper arXiv:2106.13043. This work based on our previous works:

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021).
ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020).

Abstract

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.

In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).

Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

How to Run the Model

The required Python version is >= 3.7.

AudioCLIP

On the ESC-50 dataset

python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50

On the UrbanSound8K dataset

python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K

Cite Us

@misc{guzhov2021audioclip,
      title={AudioCLIP: Extending CLIP to Image, Text and Audio}, 
      author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
      year={2021},
      eprint={2106.13043},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Git lfs was giving problems, so I removed all assets files from it - the files can be found in the "Release" anyways.

Also it was a bit problematic to use this project in other projects because the folder structure was lacking. I moved all files into an "audioclip" folder to fix python pathing for external projects.

I renamed master to main, but I doubt that this change is going to stay once this pull request is merged.

opened by NotNANtoN 0

Releases(v0.1)

v0.1(Jun 29, 2021)
Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)

CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)

ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)

AudioCLIP trained on AudioSet (+ video frames)

AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)

AudioCLIP-Partial-Training.pt – training of the audio-head only

Source code(tar.gz)
Source code(zip)
AudioCLIP-Full-Training.pt(512.41 MB)
AudioCLIP-Partial-Training.pt(512.41 MB)
bpe_simple_vocab_16e6.txt.gz(1.29 MB)
CLIP.pt(389.49 MB)
ESRNXFBSP.pt(119.01 MB)

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

1 Aug 19, 2021

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

612 Jan 4, 2023

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

2 Mar 4, 2022

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

2k Feb 9, 2021

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others.

FastAudioVisual Our project is developed here. The goal finish time is March 01, 2021 What is FastAudioVisual? FastAudioVisual is a tool that allows u

39 Oct 27, 2022

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.8k Dec 30, 2022

AudioCLIP Extending CLIP to Image, Text and Audio

Related tags

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

Abstract

How to Run the Model

AudioCLIP

On the ESC-50 dataset

On the UrbanSound8K dataset

Cite Us

You might also like...

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Code for Text Prior Guided Scene Text Image Super-Resolution

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Releases(v0.1)

v0.1(Jun 29, 2021)

Owner

Deep Learning Topics with Computer Vision & NLP

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Text classification on IMDB dataset using Keras and Bi-LSTM network

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

A Python/Pytorch app for easily synthesising human voices

Training RNNs as Fast as CNNs

CredData is a set of files including credentials in open source projects

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Final Project Bootcamp Zero

Simple NLP based project without any use of AI

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

TTS is a library for advanced Text-to-Speech generation.

Host your own GPT-3 Discord bot

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

A minimal code for fairseq vq-wav2vec model inference.

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai