Code for evaluating Japanese pretrained models provided by NTT Ltd.

Last update: Dec 22, 2022

Overview

japanese-dialog-transformers

This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialogue model provided by NTT on fairseq.

Table of contents.
Update log
Notice for using the codes
Model download
Quick start
LICENSE

Update log

2021/09/17 Published dialogue models (fairseq version japanese-dialog-transformer-1.6B) and evaluation codes.

Notice for using the codes

The dialogue models provided are for evaluation and verification of model performance. Before downloading these models, please read the LICENSE and CAUTION documents. You can download and use these models only if you agree to the following three points.

LICENSE
To be used only for the purpose of evaluation and verification of this model, and not for the purpose of providing dialogue service itself.
Take all possible care and measures to prevent damage caused by the generated text, and take responsibility for the text you generate, whether appropriate or inappropriate.

BibTeX

When publishing results using this model, please cite the following paper.

@misc{sugiyama2021empirical,
      title={Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems}, 
      author={Hiroaki Sugiyama and Masahiro Mizukami and Tsunehiro Arimoto and Hiromi Narimatsu and Yuya Chiba and Hideharu Nakajima and Toyomi Meguro},
      year={2021},
      eprint={2109.05217},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model download

Quick start

The models published on this page can be used for utterance generation and additional fine-tuning using the scripts included in fairseq.

Install dependent libraries

The verification environment is as follows.

Python 3.8.10 on miniconda
CUDA 11.1/10.2
Pytorch 1.8.2 （For the installation commands, be sure to check the official page. We recommend using pip.)
fairseq 1.0.0（validated commit ID: 8adff65ab30dd5f3a3589315bbc1fafad52943e7）
sentencepiece 0.19.6

When installing fairseq, please check the official page and install the latest version. Normal pip install will only install the older version 0.10.2. If you want to run finetune with your own data, you need to install the standalone version of sentencepiece.

fairseq-interactive

Since fairseq-interactive does not have any way to keep the context, it generates responses based on the input sentences only, which is different from the setting that uses the context in Finetune and the paper experiment, so it is easy to generate inappropriate utterances.

In the following command, a small value (10) is used for beam and nbest (number of output candidates) to make the results easier to read. In actual use, it would be better to set the number to 20 or more for better results.

fairseq-interactive data/sample/bin/ \
 --path checkpoints/persona50k-flat_1.6B_33avog1i_4.16.pt\
 --beam 10 \
 --seed 0 \
 --min-len 10 \
 --source-lang src \
 --target-lang dst \
 --tokenizer space \
 --bpe sentencepiece \
 --sentencepiece-model data/dicts/sp_oall_32k.model \
--no-repeat-ngram-size 3 \
--nbest 10 \
--sampling \
--sampling-topp 0.9 \
--temperature 1.0

dialog.py

The system utilizes a context of about four utterances, which is equivalent to the settings used in the Finetune and paper experiments.

python scripts/dialog.py data/sample/bin/ \
 --path checkpoints/dials5_1e-4_1li20zh5_tw5.143_step85.pt \
 --beam 80 \
 --min-len 10 \
 --source-lang src \
 --target-lang dst \
 --tokenizer space \
 --bpe sentencepiece \
 --sentencepiece-model data/dicts/sp_oall_32k.model \
 --no-repeat-ngram-size 3 \
 --nbest 80 \
 --sampling \
 --sampling-topp 0.9 \
 --temperature 1.0 \
 --show-nbest 5

Perplexity calculation on a specific data set

Computes the perplexity (ppl) on a particular dataset. The lower the ppl, the better the model can represent the interaction on that dataset.

fairseq-validate $DATA_PATH \
 --path $MODEL_PATH \
 --task translation \
 --source-lang src \
 --target-lang dst \
 --batch-size 2 \ 
 --ddp-backend no_c10d \
 --valid-subset test \ 
 --skip-invalid-size-inputs-valid-test

Finetuning with Persona-chat and EmpatheticDialogues

By finetuning the Pretrained model with PersonaChat or EmpatheticDialogues, you can create a model that is almost identical to the finetuned model provided.

If you have your own dialogue data, you can place the data in the same format in data/*/raw and perform Finetune on that data. Please note, however, that we do not allow the release or distribution of Finetune models under the LISENCE. You can release your own data and let a third party run Finetune from this model.

Downloading and converting datasets

Convert data from Excel to a simple input statement (src) and output statement (dst) format, where the same row in src and dst is the corresponding input/output pair. 50000 rows are split and output as a train.

python scripts/extract_ed.py japanese_empathetic_dialogues.xlsx data/empdial/raw/

License

LISENCE

Code for evaluating Japanese pretrained models provided by NTT Ltd.

Related tags

Overview

japanese-dialog-transformers

Update log

Notice for using the codes

BibTeX

Model download

Quick start

Install dependent libraries

fairseq-interactive

dialog.py

Perplexity calculation on a specific data set

Finetuning with Persona-chat and EmpatheticDialogues

Downloading and converting datasets

License

Owner

NTT Communication Science Laboratories

The training code for the 4th place model at MDX 2021 leaderboard A.

Opal-lang - A WIP programming language based on Python

Code for lyric-section-to-comment generation based on huggingface transformers.

Guide to using pre-trained large language models of source code

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Pytorch-Named-Entity-Recognition-with-BERT

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Repository for Project Insight: NLP as a Service

NLP Text Classification

This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Image2pcl - Enter the metaverse with 2D image to 3D projections

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.