Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Overview

FENSE

The metric, Fluency ENhanced Sentence-bert Evaluation (FENSE), for audio caption evaluation, proposed in the paper "Can Audio Captions Be Evaluated with Image Caption Metrics?"

The main branch contains an easy-to-use interface for fast evaluation of an audio captioning system.

Online demo avaliable at https://share.streamlit.io/blmoistawinde/fense/main/streamlit_demo/app.py .

To get the dataset (AudioCaps-Eval and Clotho-Eval) and the code to reproduce, please refer to the experiment-code branch.

Installation

Clone the repository and pip install it.

git clone https://github.com/blmoistawinde/fense.git
cd fense
pip install -e .

Usage

Single Sentence

To get the detailed scores of each component for a single sentence.

from fense.evaluator import Evaluator

print("----Using tiny models----")
evaluator = Evaluator(device='cpu', sbert_model='paraphrase-MiniLM-L6-v2', echecker_model='echecker_clotho_audiocaps_tiny')

eval_cap = "An engine in idling and a man is speaking and then"
ref_cap = "A machine makes stitching sounds while people are talking in the background"

score, error_prob, penalized_score = evaluator.sentence_score(eval_cap, [ref_cap], return_error_prob=True)

print("Cand:", eval_cap)
print("Ref:", ref_cap)
print(f"SBERT sim: {score:.4f}, Error Prob: {error_prob:.4f}, Penalized score: {penalized_score:.4f}")

System Score

To get a system's overall score on a dataset by averaging sentence-level FENSE, you can use eval_system.py, with your system outputs prepared in the format like test_data/audiocaps_cands.csv or test_data/clotho_cands.csv .

For AudioCaps test set:

python eval_system.py --device cuda --dataset audiocaps --cands_dir ./test_data/audiocaps_cands.csv

For Clotho Eval set:

python eval_system.py --device cuda --dataset clotho --cands_dir ./test_data/clotho_cands.csv

Performance Benchmark

We benchmark the performance of FENSE with different choices of SBERT model and Error Detector on the two benchmark dataset AudioCaps-Eval and Clotho-Eval. (*) is the combination reported in paper.

AudioCaps-Eval

SBERT echecker HC HI HM MM total
paraphrase-MiniLM-L6-v2 none 62.1 98.8 93.7 75.4 80.4
paraphrase-MiniLM-L6-v2 tiny 57.6 94.7 89.5 82.6 82.3
paraphrase-MiniLM-L6-v2 base 62.6 98 82.5 85.4 85.5
paraphrase-TinyBERT-L6-v2 none 64 99.2 92.5 73.6 79.6
paraphrase-TinyBERT-L6-v2 tiny 58.6 95.1 88.3 82.2 82.1
paraphrase-TinyBERT-L6-v2 base 64.5 98.4 91.6 84.6 85.3(*)
paraphrase-mpnet-base-v2 none 63.1 98.8 94.1 74.1 80.1
paraphrase-mpnet-base-v2 tiny 58.1 94.3 90 83.2 82.7
paraphrase-mpnet-base-v2 base 63.5 98 92.5 85.9 85.9

Clotho-Eval

SBERT echecker HC HI HM MM total
paraphrase-MiniLM-L6-v2 none 59.5 95.1 76.3 66.2 71.3
paraphrase-MiniLM-L6-v2 tiny 56.7 90.6 79.3 70.9 73.3
paraphrase-MiniLM-L6-v2 base 60 94.3 80.6 72.3 75.3
paraphrase-TinyBERT-L6-v2 none 60 95.5 75.9 66.9 71.8
paraphrase-TinyBERT-L6-v2 tiny 59 93 79.7 71.5 74.4
paraphrase-TinyBERT-L6-v2 base 60.5 94.7 80.2 72.8 75.7(*)
paraphrase-mpnet-base-v2 none 56.2 96.3 77.6 65.2 70.7
paraphrase-mpnet-base-v2 tiny 54.8 91.8 80.6 70.1 73
paraphrase-mpnet-base-v2 base 57.1 95.5 81.9 71.6 74.9

Reference

If you use FENSE in your research, please cite:

@misc{zhou2021audio,
      title={Can Audio Captions Be Evaluated with Image Caption Metrics?}, 
      author={Zelin Zhou and Zhiling Zhang and Xuenan Xu and Zeyu Xie and Mengyue Wu and Kenny Q. Zhu},
      year={2021},
      eprint={2110.04684},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
You might also like...
I-BERT: Integer-only BERT Quantization
I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

Source code for NAACL 2021 paper
Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

Pure python PEMDAS expression solver without using built-in eval function

pypemdas Pure python PEMDAS expression solver without using built-in eval function. Supports nested parenthesis. Supported operators: + - * / ^ Exampl

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Yet another video caption

Yet another video caption

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs
Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Faster R-CNN pretrained on VisualGenome This repository modifies maskrcnn-benchmark for object detection and attribute prediction on VisualGenome data

Implementation of CVPR'21: RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction

RfD-Net [Project Page] [Paper] [Video] RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction Yinyu Nie, Ji Hou, Xiaoguang Han, Matthi

Yinyu Nie 162 Jan 06, 2023
Official Code Release for "TIP-Adapter: Training-free clIP-Adapter for Better Vision-Language Modeling"

Official Code Release for "TIP-Adapter: Training-free clIP-Adapter for Better Vision-Language Modeling" Pipeline of Tip-Adapter Tip-Adapter can provid

peng gao 187 Dec 28, 2022
Official code for MPG2: Multi-attribute Pizza Generator: Cross-domain Attribute Control with Conditional StyleGAN

This is the official code for Multi-attribute Pizza Generator (MPG2): Cross-domain Attribute Control with Conditional StyleGAN. Paper Demo Setup Envir

Fangda Han 5 Sep 01, 2022
Controlling Hill Climb Racing with Hand Tacking

Controlling Hill Climb Racing with Hand Tacking Opened Palm for Gas Closed Palm for Brake

Rohit Ingole 3 Jan 18, 2022
Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021) Linsen Song, Wayne Wu, Chaoyou Fu, Chen Qian, Chen Change Loy, and Ran He [Paper], [Video

71 Dec 21, 2022
MPRNet-Cloud-removal: Progressive cloud removal

MPRNet-Cloud-removal Progressive cloud removal Requirements 1.Pytorch = 1.0 2.Python 3 3.NVIDIA GPU + CUDA 9.0 4.Tensorboard Installation 1.Clone the

Semi 95 Dec 18, 2022
ICML 21 - Voice2Series: Reprogramming Acoustic Models for Time Series Classification

Voice2Series-Reprogramming Voice2Series: Reprogramming Acoustic Models for Time Series Classification International Conference on Machine Learning (IC

49 Jan 03, 2023
Benchmark for evaluating open-ended generation

OpenMEVA Contributed by Jian Guan, Zhexin Zhang. Thank Jiaxin Wen for DeBugging. OpenMEVA is a benchmark for evaluating open-ended story generation me

25 Nov 15, 2022
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Welcome to AirSim AirSim is a simulator for drones, cars and more, built on Unreal Engine (we now also have an experimental Unity release). It is open

Microsoft 13.8k Jan 03, 2023
Repo for Photon-Starved Scene Inference using Single Photon Cameras, ICCV 2021

Photon-Starved Scene Inference using Single Photon Cameras ICCV 2021 Arxiv Project Video Bhavya Goyal, Mohit Gupta University of Wisconsin-Madison Abs

Bhavya Goyal 5 Nov 15, 2022
Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

16 Nov 28, 2022
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

210 Dec 18, 2022
Fuzzing the Kernel Using Unicornafl and AFL++

Unicorefuzz Fuzzing the Kernel using UnicornAFL and AFL++. For details, skim through the WOOT paper or watch this talk at CCCamp19. Is it any good? ye

Security in Telecommunications 283 Dec 26, 2022
This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transformers.

TransMix: Attend to Mix for Vision Transformers This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transf

Jie-Neng Chen 130 Jan 01, 2023
Model Zoo for MindSpore

Welcome to the Model Zoo for MindSpore In order to facilitate developers to enjoy the benefits of MindSpore framework, we will continue to add typical

MindSpore 226 Jan 07, 2023
Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting

Official code of APHYNITY Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting (ICLR 2021, Oral) Yuan Yin*, Vincent Le Guen*

Yuan Yin 24 Oct 24, 2022
Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

LiMuSE Overview Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION. LiMuSE explores group communication on a multi

Auditory Model and Cognitive Computing Lab 17 Oct 26, 2022
Python implementation of 3D facial mesh exaggeration using the techniques described in the paper: Computational Caricaturization of Surfaces.

Python implementation of 3D facial mesh exaggeration using the techniques described in the paper: Computational Caricaturization of Surfaces.

Wonjong Jang 8 Nov 01, 2022
LSUN Dataset Documentation and Demo Code

LSUN Please check LSUN webpage for more information about the dataset. Data Release All the images in one category are stored in one lmdb database fil

Fisher Yu 426 Jan 02, 2023
Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”

VectorNet Re-implementation This is the unofficial pytorch implementation of CVPR2020 paper "VectorNet: Encoding HD Maps and Agent Dynamics from Vecto

120 Jan 06, 2023