Assessing Dialogue Systems with Distribution Distances
We propose to measure the performance of a dialogue system by computing the distributionwise distance between its generated conversations and real-world conversations.
To appear in Findings of ACL 2021.
Note that this is not an officially supported Tencent product.
1. Configuratin
This repository requires the packages:
- pytorch
- huggingface/transformers.
2. Usage
To evaluate the system-level human correlations of metrics:
python eval_metric.py \
  --data_path ./datasets/convai2_annotation.json \
  --metric fbd \
  --sample_num 10 \
  --model_type roberta-base \
  --batch_size 32
Currently, our repo supports the common metrics used in text generation field, inclduing bleu, meteor, rouge, greedy, average, extrema, bert_score, fbd and prd.
Here are some details of the six corpura compared in the main paper:
| File Name | Dataset Name | Num. of Samples | Reference | 
|---|---|---|---|
| personam_annotation.json | Persona(M) | 60 | Shikib/usr | 
| dailyh_annotation.json | Daily(H) | 150 | li3cmz/GRADE | 
| convai2_annotation.json | Convai2 | 150 | li3cmz/GRADE | 
| empathetic_annotation.json | Empathetic | 150 | li3cmz/GRADE | 
| dailyz_annotation.json | Daily(Z) | 100 | ZHAOTING/dialog-processing | 
| personaz_annotation.json | Persona(Z) | 150 | ZHAOTING/dialog-processing | 
Citation
If you use this research/codebase/dataset, please cite our paper:
@article{xiang2021assessing,
  title={Assessing Dialogue Systems with Distribution Distances},
  author={Xiang, Jiannan and Liu, Yahui and Cai, Deng and Li, Huayang and Lian, Defu and Liu, Lemao},
  journal={arXiv preprint arXiv:2105.02573},
  year={2021}
}
Other related papers:
- [1] FID, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS 2017
- [2] PRD, Assessing Generative Models via Precision and Recall, NIPS 2018
- [3] BERTScore, BERTScore: Evaluating Text Generation with BERT, ICLR 2020