Unified MultiWOZ evaluation scripts for the context-to-response task.

Overview

MultiWOZ Context-to-Response Evaluation

Standardized and easy to use Inform, Success, BLEU

~ See the paper ~

 


Easy-to-use scripts for standardized evaluation of response generation on the MultiWOZ benchmark. This repository contains an implementation of the MultiWOZ database with fuzzy matching, functions for normalization of slot names and values, and a careful implementation of the BLEU score and Inform & Succes rates.

🚀 Usage

Install the repository:

pip install git+https://github.com/Tomiinek/[email protected]

Use it directly from your code. Instantiate an evaluator and then call the evaluate method with dictionary of your predictions with a specific format (described later). Set bleu to evaluate the BLEU score, success to get the Success & Inform rate, and use richness for getting lexical richness metrics such as the number of unique unigrams, trigrams, token entropy, bigram conditional entropy, corpus MSTTR-50, and average turn length. Pseudo-code:

from mwzeval.metrics import Evaluator
...

e = Evaluator(bleu=True, success=False, richness=False)
my_predictions = {}
for item in data:
    my_predictions[item.dialog_id] = model.predict(item)
    ...
    
results = e.evaluate(my_predictions)
print(f"Epoch {epoch} BLEU: {results}")

Alternative usage:

git clone https://github.com/Tomiinek/MultiWOZ_Evaluation.git && cd MultiWOZ_Evaluation
pip install -r requirements.txt

And evaluate you predictions from the input file:

python evaluate.py [--bleu] [--success] [--richness] --input INPUT.json [--output OUTPUT.json]

Set the options --bleu, --success, and --richness as you wish.

Input format:

{
    "xxx0000" : [
        {
            "response": "Your generated delexicalized response.",
            "state": {
                "restaurant" : {
                    "food" : "eatable"
                }, ...
            }, 
            "active_domains": ["restaurant"]
        }, ...
    ], ...
}

The input to the evaluator should be a dictionary (or a .json file) with keys matching dialogue ids in the xxx0000 format (e.g. sng0073 instead of SNG0073.json), and values containing a list of turns. Each turn is a dictionary with keys:

  • response – Your generated delexicalized response. You can use either the slot names with domain names, e.g. restaurant_food, or the domain adaptive delexicalization scheme, e.g. food.

  • stateOptional, the predicted dialog state. If not present (for example in the case of policy optimization models), the ground truth dialog state from MultiWOZ 2.2 is used during the Inform & Success computation. Slot names and values are normalized prior the usage.

  • active_domainsOptional, list of active domains for the corresponding turn. If not present, the active domains are estimated from changes in the dialog state during the Inform & Success rate computation. If your model predicts the domain for each turn, place them here. If you use domains in slot names, run the following command to extract the active domains from slot names automatically:

    python add_slot_domains.py [-h] -i INPUT.json -o OUTPUT.json

See the predictions folder with examples.

Output format:

{
    "bleu" : {'damd': … , 'uniconv': … , 'hdsa': … , 'lava': … , 'augpt': … , 'mwz22': … },
    "success" : {
        "inform"  : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
        "success" : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
    },
    "richness" : {
        'entropy': … , 'cond_entropy': … , 'avg_lengths': … , 'msttr': … , 
        'num_unigrams': … , 'num_bigrams': … , 'num_trigrams': … 
    }
}

The evaluation script outputs a dictionary with keys bleu, success, and richness corresponding to BLEU, Inform & Success rates, and lexical richness metrics, respectively. Their values can be None if not evaluated, otherwise:

  • BLEU results contain multiple scores corresponding to different delexicalization styles and refernces. Currently included references are DAMD, HDSA, AuGPT, LAVA, UniConv, and MultiWOZ 2.2 whitch we consider to be the canonical one that should be reported in the future.
  • Inform & Succes rates are reported for each domain (i.e. attraction, restaurant, hotel, taxi, and train in case of the test set) separately and in total.
  • Lexical richness contains the number of distinct uni-, bi-, and tri-grams, average number of tokens in generated responses, token entropy, conditional bigram entropy, and MSTTR-50 calculated on concatenated responses.

Secret feature

You can use this code even for evaluation of dialogue state tracking (DST) on MultiWOZ 2.2. Set dst=True during initialization of the Evaluator to get joint state accuracy, slot precision, recall, and F1. Note that the resulting numbers are very different from the DST results in the original MultiWOZ evaluation. This is because we use slot name and value normalization, and careful fuzzy slot value matching.

🏆 Results

Please see the orginal MultiWOZ repository for the benchmark results.

👏 Contributing

  • If you would like to add your results, modify the particular table in the original reposiotry via a pull request, add the file with predictions into the predictions folder in this repository, and create another pull request here.
  • If you need to update the slot name mapping because of your different delexicalization style, feel free to make the changes, and create a pull request.
  • If you would like to improve normalization of slot values, add your new rules, and create a pull request.

💭 Citation

@inproceedings{nekvinda-dusek-2021-shades,
    title = "Shades of {BLEU}, Flavours of Success: The Case of {M}ulti{WOZ}",
    author = "Nekvinda, Tom{\'a}{\v{s}} and Du{\v{s}}ek, Ond{\v{r}}ej",
    booktitle = "Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.gem-1.4",
    doi = "10.18653/v1/2021.gem-1.4",
    pages = "34--46"
}

Owner
Tomáš Nekvinda
Wisdom giver, bacon & eggs master, ant lover
Tomáš Nekvinda
Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

Pose Detection Project Description: Human pose estimation from video plays a critical role in various applications such as quantifying physical exerci

Hassan Shahzad 2 Jan 17, 2022
Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition"

CLIPstyler Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition" Environment Pytorch 1.7.1, Python 3.6 $ c

201 Dec 29, 2022
A framework for attentive explainable deep learning on tabular data

🧠 kendrite A framework for attentive explainable deep learning on tabular data 💨 Quick start kedro run 🧱 Built upon Technology Description Links ke

Marnix Koops 3 Nov 06, 2021
Finetuner allows one to tune the weights of any deep neural network for better embeddings on search tasks

Finetuner allows one to tune the weights of any deep neural network for better embeddings on search tasks

Jina AI 794 Dec 31, 2022
PyTorch implementation of Wide Residual Networks with 1-bit weights by McDonnell (ICLR 2018)

1-bit Wide ResNet PyTorch implementation of training 1-bit Wide ResNets from this paper: Training wide residual networks for deployment using a single

Sergey Zagoruyko 122 Dec 07, 2022
Official implementation of paper "Query2Label: A Simple Transformer Way to Multi-Label Classification".

Introdunction This is the official implementation of the paper "Query2Label: A Simple Transformer Way to Multi-Label Classification". Abstract This pa

Shilong Liu 274 Dec 28, 2022
Mixed Transformer UNet for Medical Image Segmentation

MT-UNet Update 2022/01/05 By another round of training based on previous weights, our model also achieved a better performance on ACDC (91.61% DSC). W

dotman 92 Dec 25, 2022
Investigating automatic navigation towards standard US views integrating MARL with the virtual US environment developed in CT2US simulation

AutomaticUSnavigation Investigating automatic navigation towards standard US views integrating MARL with the virtual US environment developed in CT2US

Cesare Magnetti 6 Dec 05, 2022
SCNet: Learning Semantic Correspondence

SCNet Code Region matching code is contributed by Kai Han ([email protected]). Dense

Kai Han 34 Sep 06, 2022
EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation

EFENet EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation Code is a bit messy now. I woud clean up soon. For training the EF

Yaping Zhao 19 Nov 05, 2022
OpenDelta - An Open-Source Framework for Paramter Efficient Tuning.

OpenDelta is a toolkit for parameter efficient methods (we dub it as delta tuning), by which users could flexibly assign (or add) a small amount parameters to update while keeping the most paramters

THUNLP 386 Dec 26, 2022
YKKDetector For Python

YKKDetector OpenCVを利用した機械学習データをもとに、VRChatのスクリーンショットなどからYKKさん(もとい「幽狐族のお姉様」)を検出できるソフトウェアです。 マニュアル こちらから実行環境のセットアップから解説する詳細なマニュアルをご覧いただけます。 ライセンス 本ソフトウェア

あんふぃとらいと 5 Dec 07, 2021
Data and extra materials for the food safety publications classifier

Data and extra materials for the food safety publications classifier The subdirectories contain detailed descriptions of their contents in the README.

1 Jan 20, 2022
[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

NeRFlow [ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing Datasets The pouring dataset used for experiments can be download he

44 Dec 20, 2022
Perform Linear Classification with Multi-way Data

MultiwayClassification This is an R package to perform linear classification for data with multi-way structure. The distance-weighted discrimination (

Eric F. Lock 2 Dec 15, 2020
[ECCVW2020] Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DiMP)

Feel free to visit my homepage Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DIMP) [ECCVW2020 paper] Presentation

Seokeon Choi 35 Oct 26, 2022
A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

ClimART - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models Official PyTorch Implementation Using deep le

21 Dec 31, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 01, 2023
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark We propose a benchmark to evaluate different quantization algorithms on vari

494 Dec 29, 2022
Simple tutorials using Google's TensorFlow Framework

TensorFlow-Tutorials Introduction to deep learning based on Google's TensorFlow framework. These tutorials are direct ports of Newmu's Theano Tutorial

Nathan Lintz 6k Jan 06, 2023