SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

Overview

SimulEval

SimulEval is a general evaluation framework for simultaneous translation on text and speech.

Requirement

  • python >= 3.7.0

Installation

git clone [email protected]:fairinternal/SimulEval.git
cd SimulEval
pip install -e .

Quick Start

Following is the evaluation of a dummy agent which operates wait-k (k = 3) policy and generates random words until the length of the generated words is the same as the number of all the source words. A tutorial can be found here.

cd examples
simuleval \
  --agent dummy/dummy_waitk_text_agent.py \
  --source data/src.txt \
  --target data/tgt.txt

License

SimulEval is licensed under Creative Commons BY-SA 4.0.

Citation

Please cite as:

@inproceedings{simuleval2020,
  title = {Simuleval: An evaluation toolkit for simultaneous translation},
  author = {Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Jiatao Gu, Juan Pino},
  booktitle = {Proceedings of the EMNLP},
  year = {2020},
}
Comments
  • Length Adaptive Average Lagging (LAAL)

    Length Adaptive Average Lagging (LAAL)

    Implementation of the Length Adaptive Average Lagging (LAAL) as proposed in CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022 (https://arxiv.org/abs/2204.06028). The name was suggested in Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation (https://arxiv.org/abs/2206.05807).

    CLA Signed Merged 
    opened by pe-trik 7
  • Several bug and format bug fix

    Several bug and format bug fix

    • Change github test plan to pytest on each individual files.
    • Fix the bug when submitting the slurm jobs
    • Fix the bug when infinite loop happens if an empty segment is produced
    • Formatting with black
    CLA Signed Merged 
    opened by xutaima 6
  • Length Adaptive Average Lagging (LAAL)

    Length Adaptive Average Lagging (LAAL)

    Implementation of the Length Adaptive Average Lagging (LAAL) as proposed in CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022 (https://arxiv.org/abs/2204.06028). The name was suggested in Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation (https://arxiv.org/abs/2206.05807).

    opened by pe-trik 3
  • Error during Simul ST(MuST-C) Evaluation

    Error during Simul ST(MuST-C) Evaluation

    Following the steps in this guide, I got this error when trying to evaluate the simultaneous speech-to-text model.

    2021-03-12 13:03:29 | ERROR    | simuleval.util.agent_builder | No 'Agent' class found in /project_scratch/nmt_shdata/speech_translation/simultaneous/fairseq-master/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
    

    This commit in the fairseq repo changed the superclass of the ST Agent, hence this error(due to the following if statement)

    https://github.com/facebookresearch/SimulEval/blob/927072ebb7df83023adcc206b7458bf9c9aad3a2/simuleval/utils/agent_finder.py#L71


    Environment

    • python 3.8.5
    • torch - 1.7.0
    • OS: Linux
    • fairseq and SimulEval - master codes
    • Scripts: Provided in the guide
    opened by mzaidi59 3
  • Commit 1753363 changes AL scores for audio input

    Commit 1753363 changes AL scores for audio input

    In commit 1753363, AudioInstance was changed to use the base class function Instance.sentence_level_eval(). This introduced the bug (?) that 1 is added to self.source_length() also for the audio input case. (For text input, this is added because of sentence end).

    This corresponds to 1ms only, however this changes the lagging_padding_mask in AverageLagging: all words predicted at the end of the sentence, i.e. after all audio was read, are now not taken into account for the AL computation. I observed the AL value to change from 5598ms to 3267ms for a very high latency system where most of the words are predicted at sentence end.

    See the line:

    lagging_padding_mask = delays >= src_lens.unsqueeze(1)

    src_lens is now 1 greater than delays[-1], i.e. the audio length.

    I think the old behaviour was correct, but please decide yourself 😬

    opened by patrick-wilken 2
  • questions about AL and DAL

    questions about AL and DAL

    Hi! I was looking at the code SimulEval/simuleval/metrics/latency.py in order to understand how you implemented the metrics for latency that you defined in your paper about SimulEval. I have a few questions I'd like to ask: can you confirm that in the code the calculation of AL considers the reference length rather than the prediction length as you described in the paper in equation 8? If so, why did you not apply the same correction also to the calculation of DAL (both in the code and in the paper)?

    Thank you very much in advance!

    opened by Onoratofra 2
  • no module named

    no module named

    # Copyright (c) Facebook, Inc. and its affiliates.
    # All rights reserved.
    #
    # This source code is licensed under the license found in the
    # LICENSE file in the root directory of this source tree.
    
    from simuleval.agents import TextAgent
    from simuleval import READ_ACTION, WRITE_ACTION, DEFAULT_EOS
    from utils.fun import function
    
    
    class DummyWaitkTextAgent(TextAgent):
    
        data_type = "text"
    
        def __init__(self, args):
            super().__init__(args)
            self.waitk = args.waitk
            # Initialize your agent here, for example load model, vocab, etc
    
        @staticmethod
        def add_args(parser):
            # Add additional command line arguments here
            parser.add_argument("--waitk", type=int, default=3)
    
        def policy(self, states):
            # Make decision here
            print(states)
            if len(states.source) - len(states.target) < self.waitk and not states.finish_read():
                return READ_ACTION
            else:
                return WRITE_ACTION
    
        def predict(self, states):
            # predict token here
            if states.finish_read():
                if states.target.length() == states.source.length():
                    return DEFAULT_EOS
    
            # return f"word_{len(states.target)}"
            return function(len(states.target))
    

    And the content of the utils/fun.py is

    def function(number):
        return 'word_' + number
    

    The directory looks like

    ├── data
    │   ├── src.txt
    │   └── tgt.txt
    └── dummy
        ├── dummy_waitk_text_agent.py
        └── utils
            ├── fun.py
            └── __init__.py
    

    When I run the command simuleval --agent dummy_waitk_text_agent.py --source ../data/src.txt --target ../data/tgt.txt, it failed to load the utils module.

    opened by Cppowboy 2
  • Fails to evaluate wait-k model on SimulEval if sequence is shorter than k

    Fails to evaluate wait-k model on SimulEval if sequence is shorter than k

    (Repost of the issue: https://github.com/pytorch/fairseq/issues/3445)

    I've got an error while evaluating the wait-k model following the instruction of README.md.

    2021-04-03 10:36:28 | INFO     | simuleval.scorer | Evaluating on text
    2021-04-03 10:36:28 | INFO     | simuleval.scorer | Source: /userdir/iwslt17.test0.en
    2021-04-03 10:36:28 | INFO     | simuleval.scorer | Target: /userdir/iwslt17.test0.ja
    2021-04-03 10:36:28 | INFO     | simuleval.scorer | Number of sentences: 1549
    2021-04-03 10:36:28 | INFO     | simuleval.server | Evaluation Server Started (process id 132622). Listening to port 12321
    2021-04-03 10:36:31 | WARNING  | simuleval.scorer | Resetting scorer
    2021-04-03 10:36:31 | INFO     | simuleval.cli    | Output dir: /userdir/iwslt17.test0
    2021-04-03 10:36:31 | INFO     | simuleval.cli    | Start data writer (process id 132639)
    2021-04-03 10:36:31 | INFO     | simuleval.cli    | Evaluating SimulTransTextAgentJA (process id 132556) on instances from 0 to 1548
    2021-04-03 10:36:31 | INFO     | fairseq.tasks.translation | [en] dictionary: 16004 types
    2021-04-03 10:36:31 | INFO     | fairseq.tasks.translation | [ja] dictionary: 16004 types
    Traceback (most recent call last):
      File "/userdir/.venv/bin/simuleval", line 11, in <module>
        load_entry_point('simuleval', 'console_scripts', 'simuleval')()
      File "/userdir/SimulEval/simuleval/cli.py", line 165, in main
        _main(args.client_only)
      File "/userdir/SimulEval/simuleval/cli.py", line 192, in _main
        evaluate(args, client, server_process)
      File "/userdir/SimulEval/simuleval/cli.py", line 145, in evaluate
        decode(args, client, result_queue, indices)
      File "/userdir/SimulEval/simuleval/cli.py", line 108, in decode
        action = agent.policy(states)
      File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/eval/agents/simul_t2t_enja.py", line 196, in policy
        x, outputs = self.model.decoder.forward(
      File "/userdir/fairseq-v0.10.2/fairseq/models/transformer.py", line 817, in forward
        x, extra = self.extract_features(
      File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/models/transformer_monotonic_attention.py", line 219, in extract_features
        x, attn, _ = layer(
      File "/userdir/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/modules/monotonic_transformer_layer.py", line 160, in forward
        x, attn = self.encoder_attn(
      File "/userdir/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/modules/monotonic_multihead_attention.py", line 667, in forward
        alpha = self.expected_alignment_infer(
      File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/modules/monotonic_multihead_attention.py", line 528, in expected_alignment_infer
        assert tgt_len == 1
    AssertionError
    

    This problem occurs when the sequence is shorter than k.

    Environment

    • fairseq Version: 0.10.2 (commit 14807a361202ba34dbbd3a533899db57a0ebda19)
    • SimulEval Version: latest (commit 1753363071f989ea3b79fdf5a21b96089a002f36)
    opened by fury00812 1
  • bug in AL calculation

    bug in AL calculation

    https://github.com/facebookresearch/SimulEval/blob/927072ebb7df83023adcc206b7458bf9c9aad3a2/simuleval/metrics/latency.py#L105

    Dear developers, I just updated simuleval from 0.1.0 to newest version, and got signifitianlty lower AL score compared to older version. And here I found you made a small bug, range oracle_latency as range(1, ref_len+1), but based on AL definition in your paper, $d_i^{}$ should be $(i-1)\gamma$, thansk a lot.

    opened by danliu2 1
  • Re-sync with internal repository

    Re-sync with internal repository

    The internal and external repositories are out of sync. This attempts to brings them back in sync by patching the GitHub repository. Please carefully review this patch. You must disable ShipIt for your project in order to merge this pull request. DO NOT IMPORT this pull request. Instead, merge it directly on GitHub using the MERGE BUTTON. Re-enable ShipIt after merging.

    CLA Signed fh:direct-merge-enabled 
    opened by facebook-github-bot 0
  • Add ATDScore

    Add ATDScore

    Added Average Token Delay (ATD) for a latency metric. paper: Average Token Delay: A Latency Metric for Simultaneous Translation (https://arxiv.org/abs/2211.13173)

    CLA Signed 
    opened by master-possible 5
  • Connection refused when testing set is large

    Connection refused when testing set is large

    Hi, I noticed that when using dev/test set with large number of sentences (e.g. CoVoST2's dev/test generally have 15k sentences), the evaluation will hang due to the following error:

    2021-11-27 03:49:51 | INFO     | simuleval.scorer | Number of sentences: 15492
    HTTPConnectionPool(host='localhost', port=12347): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fec33199ca0>: Failed to establish a new connection: [Errno 111] Connection refused'))
    

    After some printing here and there, I suspect this error comes from here . however I'm not familiar with web app so I have no idea how to correct/avoid this error.

    Is there any easy solution to this? Much appreciated. @xutaima

    EDIT: Forgot to mention that this does not happen if I use a subset of sentences e.g. 1500 sentences.

    opened by George0828Zhang 5
  • Character level latency for speech-to-text

    Character level latency for speech-to-text

    Hello, Will this feature ever be updated?

    2021-11-14 19:58:00 | ERROR    | simuleval.scorer | Character level latency for speech-to-text model is not supported at the moment. We will update this feature very soon.
    

    Also, I'm curious as to why this combination was not implemented, does it require additional handling? AFAIK, the only differenct between word and char is when calculating the reference length here. I'm trying to implemented myself, but I'm struggling to see the difference, would be great if you can provide some explanations or if this is updated. Thanks.

    opened by George0828Zhang 1
  • Pre- and post-processing text in Simuleval

    Pre- and post-processing text in Simuleval

    I am playing with the MMA-hard model to replicate WMT15 DE-EN experiments reported in the paper and my question is about preprocessing and postprocessing data. The paper says that:

    For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing. We apply byte pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared vocabulary with 32K symbols

    Following what is said above, I applied moses scripts to tokenize raw files and applied BPE to the tokenized files. Then, tokenized and BPE applied train, val and test files were binarized using following fairseq preprocess command:

    fairseq-preprocess --source-lang de --target-lang en \
        --trainpref ~/wmt15_de_en_32k/train --validpref ~/wmt15_de_en_32k/valid --testpref ~/wmt15_de_en_32k/test \
        --destdir ~/wmt15_de_en_32k/data-bin/ \
        --workers 20
    
    

    Afer that, I trained a MMA-hard model using the binarized data. Now, I would like to evaluate (w.r.t. Latency and Bleu) a checkpoint using SimulEval. My first question is about the file format: Which format should I provide the test files as --source and --target to simuleval command? There are three options as far as I can see:

    1. Using Raw files.
    2. Using tokenized files
    3. Using tokenized and bpe applied files.

    I am following EN-JA waitk model's agent file to understand what should be done. However, the difference between the experiment I'd like to replicate and EN-JA experiment is that in EN-JA sentencepiece model is used for tokenization whereas in my case moses is used and also bpe is applied.

    So, I tried following:

    I provided path of TOKENIZED files as --source and --target to simuleval. Also, I've implemented segment_to_units and build_word_splitter functions as follows but I couldn't figure out how I should implement units_to_segment.

    I tried to test this implementation as follows:

    $ head -n 1 ~/wmt15_de_en_32k/tmp/test.de
    Die Premierminister Indiens und Japans trafen sich in Tokio .
    $ head -n 1 ~/wmt15_de_en_32k/tmp/test.en
    India and Japan prime ministers meet in Tokyo
    
    simuleval --agent mma-dummy/mmaAgent.py --source ~/wmt15_de_en_32k/tmp/test.de  \
    --target  ~/wmt15_de_en_32k/tmp/test.en  --data-bin ~/wmt15_de_en_32k/data-bin/  \
    --model-path ~/checkpoints/checkpoint_best.pt --bpe_code ~/wmt15_de_en_32k/code
    

    So, my questions are:

    1. Is it correct to provide tokenized but not bpe applied test files as --source and --target to simuleval?
    2. Do implementations of segment_to_units and build_word_splitter functions seem correct?
    3. Could you please explain how units_to_segment and update_states_write should be implemented?

    Edit: When I evaluate the best checkpoint on a subset of test-set using the above code I got the following output:

    2021-09-19 22:10:08 | WARNING | sacrebleu | That's 100 lines that end in a tokenized period ('.') 2021-09-19 22:10:08 | WARNING | sacrebleu | It looks like you forgot to detokenize your test data, which may hurt your score. 2021-09-19 22:10:08 | WARNING | sacrebleu | If you insist your data is detokenized, or don't care, you can suppress this message with '--force'. 2021-09-19 22:10:08 | INFO | simuleval.cli | Evaluation results: { "Quality": { "BLEU": 6.068334932433579 }, "Latency": { "AL": 7.8185020314753055, "AP": 0.833324143320322, "DAL": 11.775593814849854 } }

    opened by kurtisxx 22
  • Moses tokenizer/detokenizer in cli options

    Moses tokenizer/detokenizer in cli options

    Hi, I'm wondering if you can add moses detokenizer in the cli options? More specifically, when post processing prediction and reference, the " ".join() function can be optionally replaced by sacremoses's detokenizer.

    Thanks.

    opened by George0828Zhang 0
  • visualization tool

    visualization tool

    Hi, I'm using SimulEval for evaluating my system and everything works. I've read in your paper that there is a visualization tool that can be run via: simuleval server --visual --log-dir $DIR but, when I try to run this command, it raises an error which is the request for the agent. Looking at your code I don't see any reference to this visualization tool, is it available at the moment? Thanks

    opened by sarapapi 1
Releases(v1.0.2)
Owner
Facebook Research
Facebook Research
Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)

Learning-Action-Completeness-from-Points Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal A

Pilhyeon Lee 67 Jan 03, 2023
Extracting knowledge graphs from language models as a diagnostic benchmark of model performance.

Interpreting Language Models Through Knowledge Graph Extraction Idea: How do we interpret what a language model learns at various stages of training?

EPFL Machine Learning and Optimization Laboratory 9 Oct 25, 2022
YOLOv4-v3 Training Automation API for Linux

This repository allows you to get started with training a state-of-the-art Deep Learning model with little to no configuration needed! You provide your labeled dataset or label your dataset using our

BMW TechOffice MUNICH 626 Dec 31, 2022
An end-to-end machine learning library to directly optimize AUC loss

LibAUC An end-to-end machine learning library for AUC optimization. Why LibAUC? Deep AUC Maximization (DAM) is a paradigm for learning a deep neural n

Andrew 75 Dec 12, 2022
Jupyter notebooks for using & learning Keras

deep-learning-with-keras-notebooks 這個github的repository主要是個人在學習Keras的一些記錄及練習。希望在學習過程中發現到一些好的資訊與範例也可以對想要學習使用 Keras來解決問題的同好,或是對深度學習有興趣的在學學生可以有一些方便理解與上手範例

ErhWen Kuo 2.1k Dec 27, 2022
A framework to train language models to learn invariant representations.

Invariant Language Modeling Implementation of the training for invariant language models. Motivation Modern pretrained language models are critical co

6 Nov 16, 2022
Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection (NimPme) The official implementation of Novel Instances Mining with

12 Sep 08, 2022
Cours d'Algorithmique Appliquée avec Python pour BTS SIO SISR

Course: Introduction to Applied Algorithms with Python (in French) This is the source code of the website for the Applied Algorithms with Python cours

Loic Yvonnet 0 Jan 27, 2022
Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search This is an implementation for our paper Contextual Non-Loca

Tencent YouTu Research 50 Dec 03, 2022
Official Implementation of DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation

DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation [Arxiv] [Paper] As acquiring pixel-wise an

Lukas Hoyer 305 Dec 29, 2022
Lightweight Cuda Renderer with Python Wrapper.

pyRender Lightweight Cuda Renderer with Python Wrapper. Compile Change compile.sh line 5 to the glm library include path. This library can be download

Jingwei Huang 53 Dec 02, 2022
Latent Execution for Neural Program Synthesis

Latent Execution for Neural Program Synthesis This repo provides the code to replicate the experiments in the paper Xinyun Chen, Dawn Song, Yuandong T

Xinyun Chen 16 Oct 02, 2022
Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN-v2 StackGAN-v1: Tensorflow implementation StackGAN-v1: Pytorch implementation Inception score evaluation Pytorch implementation for reproduci

Han Zhang 809 Dec 16, 2022
Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

GRAF This repository contains official code for the paper GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. You can find detailed usage i

349 Dec 29, 2022
A tutorial on DataFrames.jl prepared for JuliaCon2021

JuliaCon2021 DataFrames.jl Tutorial This is a tutorial on DataFrames.jl prepared for JuliaCon2021. A video recording of the tutorial is available here

Bogumił Kamiński 106 Jan 09, 2023
WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region. This repository provides the codebase and dataset for our work WORD: Revisiting Or

Healthcare Intelligence Laboratory 71 Jan 07, 2023
Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Clay Mullis 82 Oct 13, 2022
GrabGpu_py: a scripts for grab gpu when gpu is free

GrabGpu_py a scripts for grab gpu when gpu is free. WaitCondition: gpu_memory

tianyuluan 3 Jun 18, 2022
Open Source Differentiable Computer Vision Library for PyTorch

Kornia is a differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer

kornia 7.6k Jan 04, 2023
Facial expression detector

A tensorflow convolutional neural network model to detect facial expressions.

Carlos Tardón Rubio 5 Apr 20, 2022