Open Source Neural Machine Translation in PyTorch

Last update: Jan 04, 2023

Overview

OpenNMT-py: Open-Source Neural Machine Translation

OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine translation framework. It is designed to be research friendly to try out new ideas in translation, summary, morphology, and many other domains. Some companies have proven the code to be production ready.

We love contributions! Please look at issues marked with the contributions welcome tag.

Before raising an issue, make sure you read the requirements and the documentation examples.

Unless there is a bug, please use the forum or Gitter to ask questions.

Announcement - OpenNMT-py 2.0

We're happy to announce the upcoming release v2.0 of OpenNMT-py.

The major idea behind this release is the -- almost -- complete makeover of the data loading pipeline. A new 'dynamic' paradigm is introduced, allowing to apply on the fly transforms to the data.

This has a few advantages, amongst which:

remove or drastically reduce the preprocessing required to train a model;
increase the possibilities of data augmentation and manipulation through on-the fly transforms.

These transforms can be specific tokenization methods, filters, noising, or any custom transform users may want to implement. Custom transform implementation is quite straightforward thanks to the existing base class and example implementations.

You can check out how to use this new data loading pipeline in the updated docs.

All the readily available transforms are described here.

Performance

Given sufficient CPU resources according to GPU computing power, most of the transforms should not slow the training down. (Note: for now, one producer process per GPU is spawned -- meaning you would ideally need 2N CPU threads for N GPUs).

Breaking changes

For now, the new data loading paradigm does not support Audio, Video and Image inputs.

A few features are also dropped, at least for now:

audio, image and video inputs;
source word features.

For any user that still need these features, the previous codebase will be retained as legacy in a separate branch. It will no longer receive extensive development from the core team but PRs may still be accepted.

Feel free to check it out and let us know what you think of the new paradigm!

Setup
Features
Quickstart
Alternative: Run on FloydHub
Pretrained embeddings
Pretrained models
Acknowledgements
Citation

Setup

OpenNMT-py requires:

Python >= 3.6
PyTorch == 1.6.0

Install OpenNMT-py from pip:

pip install OpenNMT-py

or from the sources:

git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
pip install -e .

Note: if you encounter a MemoryError during installation, try to use pip with --no-cache-dir.

(Optional) Some advanced features (e.g. working pretrained models or specific transforms) require extra packages, you can install them with:

pip install -r requirements.opt.txt

Features

⚠️ New in OpenNMT-py 2.0: On the fly data processing
Encoder-decoder models with multiple RNN cells (LSTM, GRU) and attention types (Luong, Bahdanau)
Transformer models
Copy and Coverage Attention
Pretrained Embeddings
Source word features
TensorBoard logging
Multi-GPU training
Data preprocessing
Inference (translation) with batching and beam search
Inference time loss functions
Conv2Conv convolution model
SRU "RNNs faster than CNN" paper
Mixed-precision training with APEX, optimized on Tensor Cores
Model export to CTranslate2, a fast and efficient inference engine

Quickstart

Full Documentation

Step 1: Prepare the data

To get started, we propose to download a toy English-German dataset for machine translation containing 10k tokenized sentences:

wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
tar xf toy-ende.tar.gz
cd toy-ende

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:

src-train.txt
tgt-train.txt
src-val.txt
tgt-val.txt

Validation files are used to evaluate the convergence of the training. It usually contains no more than 5k sentences.

$ head -n 3 toy-ende/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament &apos;s legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
&quot; Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .

We need to build a YAML configuration file to specify the data that will be used:

# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt
...

From this configuration, we can build the vocab(s) that will be necessary to train the model:

onmt_build_vocab -config toy_en_de.yaml -n_sample 10000

Notes:

-n_sample is required here -- it represents the number of lines sampled from each corpus to build the vocab.
This configuration is the simplest possible, without any tokenization or other transforms. See other example configurations for more complex pipelines.

Step 2: Train the model

To train a model, we need to add the following to the YAML configuration file:

the vocabulary path(s) that will be used: can be that generated by onmt_build_vocab;
training specific parameters.

# toy_en_de.yaml

...

# Vocabulary files that were just created
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
save_model: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 1000
valid_steps: 500

Then you can simply run:

onmt_train -config toy_en_de.yaml

This configuration will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder and decoder. It will run on a single GPU (world_size 1 & gpu_ranks [0]).

Before the training process actually starts, the *.vocab.pt together with *.transforms.pt will be dumpped to -save_data with configurations specified in -config yaml file. We'll also generate transformed samples to simplify any potentially required visual inspection. The number of sample lines to dump per corpus is set with the -n_sample flag.

For more advanded models and parameters, see other example configurations or the FAQ.

Step 3: Translate

onmt_translate -model toy-ende/run/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose

Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into toy-ende/pred_1000.txt.

Note:

The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for translation or summarization.

(Optional) Step 4: Release

When you are satisfied with your trained model, you can release it for inference. The release process will remove training-only parameters from the checkpoint:

onmt_release_model -model toy-ende/run/model_step_1000.pt -output toy-ende/run/model_step_1000_release.pt

The release script can also export checkpoints to CTranslate2, a fast inference engine for Transformer models. See the -format command line option.

Alternative: Run on FloydHub

Click this button to open a Workspace on FloydHub for training/testing your code.

Pretrained embeddings (e.g. GloVe)

Please see the FAQ: How to use GloVe pre-trained embeddings in OpenNMT-py

Pretrained models

Several pretrained models can be downloaded and used with onmt_translate:

http://opennmt.net/Models-py/

Acknowledgements

OpenNMT-py is run as a collaborative open-source project. The original code was written by Adam Lerer (NYC) to reproduce OpenNMT-Lua using PyTorch.

Major contributors are:

Sasha Rush (Cambridge, MA)
Vincent Nguyen (Ubiqus)
Ben Peters (Lisbon)
Sebastian Gehrmann (Harvard NLP)
Yuntian Deng (Harvard NLP)
Guillaume Klein (Systran)
Paul Tardy (Ubiqus / Lium)
François Hernandez (Ubiqus)
Linxiao Zeng (Ubiqus)
Jianyu Zhan (Shanghai)
Dylan Flaute (University of Dayton)
... and more!

OpenNMT-py is part of the OpenNMT project.

Citation

If you are using OpenNMT-py for academic work, please cite the initial system demonstration paper published in ACL 2017:

@inproceedings{klein-etal-2017-opennmt,
    title = "{O}pen{NMT}: Open-Source Toolkit for Neural Machine Translation",
    author = "Klein, Guillaume  and
      Kim, Yoon  and
      Deng, Yuntian  and
      Senellart, Jean  and
      Rush, Alexander",
    booktitle = "Proceedings of {ACL} 2017, System Demonstrations",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-4012",
    pages = "67--72",
}

Comments

Abstractive Summarization Results

Hey guys, looking at recent pull requests and issues, it looks like a common interest of contributors (On top of NMT obv) is Abstractive Summarization.

Any suggestions of how to train a model that will get close results to recent papers on the CNN-Daily Mail Database? Any additional preprocessing?

Thanks?
type:performance type:question

opened by mataney 82
How to reproduce your results on WMT'14 ENDE datasets of "Attention is All You Need"?
Hi, I want to reproduce the results on WMT'14 ENDE datasets of "Attention is All You Need" paper. I have read OpenNMT-FAQ and I want to know the exact details about your experiments:

Did you do the BPE or word spiece?

What's the exact BLEU score of your experiments on WMT'14 ENDE dataset?

Is there some other differences between README's steps and transformer experiments? If yes please provide a complete tutorial so as to help us reproduce your results.

Thank you very much! @srush
opened by SkyAndCloud 47

Pytorch 0.4 support: python2/3 issue with detach()

This occurs on systems with either cuda8/cudnn6 or cuda9/cudnn7, Ubuntu 16.04. Pytorch built from source.

python3 train.py -data data/demo -save_model demo-model -gpuid 1

    main()
  File "train.py", line 299, in main
    train_model(model, train, valid, fields, optim)
  File "train.py", line 159, in train_model
    train_stats = trainer.train(epoch, report_func)
  File "/home/levinth/OpenNMT-py/onmt/Trainer.py", line 133, in train
    dec_state.detach()
  File "/home/levinth/OpenNMT-py/onmt/Models.py", line 444, in detach
    h.detach_()
RuntimeError: Can't detach views in-place. Use detach() instead

type:bug

opened by David-Levinthal 44

[WIP] A Deep Reinforced Model for Abstractive Summarization
Update: I made a post dedicated to this implementation: http://forum.opennmt.net/t/about-a-deep-reinforced-model-for-abstractive-summarization/1347

Working to implement Paulus et al (2017), A Deep Reinforced Model for Abstractive Summarization.

What's in the model (and what is implemented):

[x] temporal attention on source

attention between decoding h and encoder outputs

normalizing w.r.t. previous decoding step => penalizing previous high attention scores

[x] intra-decoder attention

attention between hd_t and [hd_1; hd_t-1] (for t>0)

[x] pointer-generator

[x] using a soft-switch

[x] using tgt_embedding i.e. pointer_generator.linear = proj(tgt_embedding) (helps learning by using semantic informations)(requires partially shared embeddings)

[x] Reducing Exposure Bias

[x] output predictions at each decoding steps

[x] using predicted token with 0.25 probability

[x] partially sharing embeddings

[x] sharing tgt & src embeddings while having different vocabulary size src_vocab > tgt_vocab ; src_embedding = linear( tgt_embedding )

[x] Reinforcement Policy

[x] Using ROUGE score to adjust learning

Implementations

Most of my edits are in onmt.Reinforced

I added the -reinforced flag to use this model.

I tried to used as much existing functions as possible

note that, because we need prediction at each time step the training process is quite different (run .backward() at each time steps. This is why I use a slightly modified Trainer class. It seems to have a bad impact on performance tho, GPU load < 100% most of the time.

I currently use a different generator that is, in his current version, just using the CopyGenerator. I will be then upgraded in order to use the embedding matrix as part of its linear projection

Discussion

I was first testing on really simple case like sequence copy i.e. using src = tgt = ptb.train.txt the model learns it really easily (more than >90% in 1 epoch on PTB). I specifically tested the copy mechanims by setting very low vocab size and looking at translation, which is correct even with a bad source (lots of unk).

I am still finding lots of repetitions even on copy tasks. The model may continue to generate the last word for some time i.e. probably don't manage to generate <eos> for some reason.

I am currently investigating it, I ran some not very promising training on summarization task (CNN/DM). The ppl goes as low as 30 (and ~30% acc) but translation give really poor results (lots of repetitions etc) there may be some issues about it.

What's next

I would like to reproduce the "ML part" (without RL) result of the paper before moving to Reinforcement learning.

Small implementation improvement like partial embeddings.

If people here are interested by this, running tests and suggesting improvement would help a lot :)
type:feature
opened by pltrdy 35

Report for Chinese Abstractive summarization performance

This the report for Chinese Abstractive summarization performance. Welcome to discuss.

Result

LCSTS dataset
Rouge-1 / ROUGE-2 / ROUGE-L
34.8 / 22.5 / 32.3

Gigaword Chinese dataset
Rouge-1 / ROUGE-2 / ROUGE-L
51.92 / 38.39 / 49.12

Preprocessing script

python3 preprocess.py \  
      -train_src $ORIGIN_DIR/train.source \  
      -train_tgt $ORIGIN_DIR/train.target \  
      -valid_src $ORIGIN_DIR/valid.source \
      -valid_tgt $ORIGIN_DIR/valid.target \
      -src_vocab_size 8000 \
      -tgt_vocab_size 8000 \
      -src_seq_length 400 \
      -tgt_seq_length 30 \
      -src_seq_length_trunc 400 \
      -tgt_seq_length_trunc 100 \
      -max_shard_size 20000000 \
      -save_data $DATA_DIR/processed

Training script

 python3 train.py \
      -data $DATA_DIR/processed \
      -word_vec_size 500 \
      -encoder_type brnn \
      -epochs 30 \
      -enc_layers 1 \
      -dec_layers 1 \
      -rnn_size 300 \
      -gpuid 0 \
      -save_model $MODEL_DIR/ \
      > $MODEL_DIR/log.txt

Generating script

 python3 translate.py \
      -model $MODEL_DIR/$BEST_MODEL \
      -beam_size 5 \
      -verbose \
      -batch_size 1 \
      -tgt $GOLD \
      -output $MODEL_DIR/$PRED \
      -src $TEST

opened by playma 34

Update checkpoint vocabulary
Draft PR to add new vocabulary to existing checkpoint by reusing checkpoint's embeddings

This is what the new training procedure would be:

Run build_vocab as usual. This would generate the vocabulary files for the new corpora.

Then in the training procedure:

Load checkpoint (fields would be populated with the checkpoint’s vocab)

Build fields for the new vocabulary as usual from the vocabulary files generated in step1.

Extend checkpoint fields vocabulary with the new vocabulary fields (only appending new words at the end, as stated in Torch docs)

Assign the extended vocabulary to the new models fields

In build_base_model, after loading the checkpoint’s state_dicts, replace new model’s embeddings from 0 to len(checkpoint.vocab_size) with checkpoint’s embeddings. It should be in the same order as in the extended vocabulary new words were appended at the end.

Remove embeddings parameters from checkpoint as new embeddings have been already added to the model

Continue as usual
opened by anderleich 33
Option -gpuid not working as it should

When I use -gpuid option in current master of OpenNMT, regardless of what gpu I choose, the training script always chooses gpu 0. If I use multigpu, for example, -gpuid 2 3, the training goes to gpus 0 and 1.

Is this a known issue?
type:bug contributions welcome

opened by goncalomcorreia 27

Upgrade to torchtext 0.3

pytorch: 0.4 torchtext: 0.3

Hello, I've got one error when doing the first preprocessing step from here.

Traceback (most recent call last):
  File "preprocess.py", line 204, in <module>
    main()
  File "preprocess.py", line 191, in main
    fields = onmt.io.get_fields(opt.data_type, src_nfeats, tgt_nfeats)
  File "/home/lr/yukun/OpenNMT-py/onmt/io/IO.py", line 44, in get_fields
    return TextDataset.get_fields(n_src_features, n_tgt_features)
  File "/home/lr/yukun/OpenNMT-py/onmt/io/TextDataset.py", line 229, in get_fields
    postprocessing=make_src, sequential=False)
TypeError: __init__() got an unexpected keyword argument 'tensor_type'

It seems that torchtext 0.3 has changed their interface parameter in torchtext.data.Field()

But in torchtext 0.2.3, it works well except following warning:

xxx/torchtext/data/field.py:321: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.

And it seems new version of torchtext has fixed this from here. I am new to pytorch and opennmt. I don't know whether this warning matter?

type:feature

opened by yukunfeng 27

Update transformer based on opennmt-tf
forward() in Transformer is not compatible with Translator.py (line 153), which requires context_lengths.

Transformer actually doesn't converge using default options. (PPL is just ridiculous)

Too many parameters are reported.

Some of them are already reported in other threads. It could be hardly used in current state.
type:bug
opened by helson73 27
Massive memory use because entire file is slurped in IO.py

Is there a strong reason for slurping the entire source file in translate.py? On a source file of 2M single-character-per-token sentences, memory usage exceeds 256 GB. I figure that since evaluation is sentence-by-sentence (or batch-by-batch), this can be avoided.

Lab-mates and I have written an itertools.islice-based workaround. I'm happy to convert it into a PR if I have more information about the design goals and decisions, so I don't step on toes.
type:bug

opened by aryamccarthy 27
Fixed embeddings

This pull request adds support fixed word embeddings (that is, embeddings that will not be changed as the parameter is changed) on the source or target side. In order to make this go more smoothly, I also refactored Embeddings -- the constructor now takes the parameters it needs explicitly, instead of just taking the relevant dictionaries and opt. If opt were passed, the model would somehow need to decide whether the -fix_word_vecs_enc or -fix_word_vecs_dec is relevant. This refactoring also allows loading of pretrained embeddings to be done when the object is instantiated, instead of later on in train.py. Future work could also be done to allow pretrained and/or fixed feature vectors.

opened by bpopeters 27
TransformerDecoder class layer cache

In v3.0 in the TransformerDecoder class, the layer cache is no longer reset. This leads to an error when switiching between training and inference. During inference, in the layer cache the keys, etc. of the last tokens are saved, but they need to be removed when switching back to training.

opened by MaxThFe 6

Error with train_from using multiple gpus

Hi,

When training from a checkpoint, I consistently get an error when using multiple GPUs. The error doesn't occur if I use a single GPU.

I'm using version 2.2.0 but have verified the same behaviour in 2.1.2 as well.

See below example:

/usr/bin/python3 -m onmt.bin.train -config <path_to_train.yaml> --tensorboard --tensorboard_log_dir <path_to_log_dir> --world_size 2 --gpu_ranks 0 1 --train_from <path_to_checkpoint>
[2022-12-06 15:43:17,044 INFO] Missing transforms field for valid data, set to default: [].
[2022-12-06 15:43:17,048 INFO] Parsed 3 corpora from -data.
[2022-12-06 15:43:17,049 INFO] Loading checkpoint from en_es_step_285000.pt
[2022-12-06 15:43:19,364 INFO] Loading fields from checkpoint...
[2022-12-06 15:43:19,364 INFO]  * src vocab size = 38000
[2022-12-06 15:43:19,365 INFO]  * tgt vocab size = 38000
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
  len(cache))
Bus error (core dumped)

The training will succeed either by using only one GPU, or by removing the train_from argument altogether.

Thanks, Rob

opened by robertBrnnn 3

Handle bad lines in vocab file gracefully
If there is a line in the vocab file that doesn't have a token and a frequency correctly separated by whitespace skip the line instead of failing.

Example vocab file with broken last line:

В 118 ισχύουν 118 ΓΙΚΑ 118 ÁČ 117 ✓ 117 117

https://github.com/OpenNMT/OpenNMT-py/issues/2242

https://github.com/OpenNMT/OpenNMT-py/commit/baa3d73f645f10e8f3a94b334fc217afdfc4e550
opened by argosopentech 3
[WIP] Support target features
This PR intends to add target features support to OpenNMT-py v3.0. All the code has been adapted for this new version.

Both source and target features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character ￨ as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data: train: path_src: src_with_features.txt # I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1 path_tgt: tgt_with_features.txt # Me￨1 gusta￨0 comer￨0 pizza￨0 transforms: [onmt_tokenize, inferfeats, filtertoolong] valid: path_src: src_with_features.txt path_tgt: tgt_with_features.txt transforms: [onmt_tokenize, inferfeats] save_data: ./data n_sample: -1 # # Vocab opts src_vocab: data.vocab.src tgt_vocab: data.vocab.tgt n_src_feats: 2 n_tgt_feats: 1 src_feats_defaults: "0￨1" tgt_feats_defaults: "1" feat_merge: "sum"

For now, I've made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work.

@vince62s , do you think this a good starting point?
opened by anderleich 2
allow to override the checkpoint's droupout settings

This PR provides the ability to override the checkpoint's dropout settings (dropout, attention_dropout) when finetuning a Transformer model. A new flag override_opts is added. This overriding can thus be disabled with --override_opts False It is not possible at the moment because the checkpoint's model_optsoverride those set in the config. See the following discussion about this topic: https://forum.opennmt.net/t/tranformer-change-dropout-when-finetuning/4625

opened by l-k-11235 0
Translation outputs differ with different batch sizes
Hi,

I've recently noticed that I get slightly different translation results when translating with different batch sizes. I guess this is not expected...

For example,

Batch size 100 --> PRED SCORE: -6.0490, PRED PPL: 423.68 NB SENTENCES: 5000 Batch size 150 --> PRED SCORE: -5.2232, PRED PPL: 185.54 NB SENTENCES: 5000

I'm using the latest version of OpenNMT-py

Thanks
opened by anderleich 4

Releases(v3.0.3)

v3.0.3(Dec 19, 2022)
fix loss normalization when using accum or nb GPU > 1

use native CrossEntropyLoss with Label Smoothing. reported loss/ppl impacted by LS

fix long-time coverage loss bug thanks Sanghyuk-Choi

fix detok at scoring / fix tokenization Subword_nmt + Sentencepiece

various small bugs fixed

Source code(tar.gz)
Source code(zip)
v3.0.2(Dec 7, 2022)
3.0.2 (2022-12-07)

pyonmttok.Vocab is now pickable. dataloader switched to spawn. (MacOS/Windows compatible)

fix scoring with specific metrics (BLEU, TER)

fix tensorboard logging

fix dedup in batch iterator (only for TRAIN, was happening at inference also)

New: Change: tgt_prefix renamed to tgt_file_prefix

New: tgt_prefix / src_prefix used for "prefix" Transform (onmt/transforms/misc.py)

New: process transforms of buckets in batches (vs per example) / faster

Source code(tar.gz)
Source code(zip)
v3.0.1(Nov 23, 2022)
fix dynamic scoring

reinstate apex.amp level O1/O2 for benchmarking

New: LM distillation for NMT training

New: bucket_size ramp-up to avoid slow start

fix special tokens order

remove Library and add link to Yasmin's Tuto

Source code(tar.gz)
Source code(zip)
v3.0.0(Nov 3, 2022)
v3.0 !

Removed completely torchtext. Use Vocab object of pyonmttok instead

Dataloading changed accordingly with the use of pytorch Dataloader (num_workers)

queue_size / pool_factor no longer needed. bucket_size optimal value > 64K

options renamed: rnn_size => hidden_size (enc/dec_rnn_size => enc/dec_hid_size)

new tools/convertv2_v3.py to upgrade v2 models.pt

inference with length_penalty=avg is now the default

add_qkvbias (default false, but true for old model)

Source code(tar.gz)
Source code(zip)
2.3.0(Sep 14, 2022)
New features

BLEU/TER (& custom) scoring during training and validation (#2198)

LM related tools (#2197)

Allow encoder/decoder freezing (#2176)

Dynamic data loading for inference (#2145)

Sentence-level scores at inference (#2196)

MBR and oracle reranking scoring tools (#2196)

Fixes and improvements

Updated beam exit condition (#2190)

Improve scores reporting (#2191)

Fix dropout scheduling (#2194)

Better catch CUDA ooms when training (#2195)

Fix source features support in inference and REST server (#2109)

Make REST server more flexible with dictionaries (#2104)

Fix target prefixing in LM decoding (#2099)

Source code(tar.gz)
Source code(zip)
2.2.0(Sep 14, 2021)
New features

Support source features (thanks @anderleich !)

Fixes and improvements

Adaptations to relax torch version

Customizable transform statistics (#2059)

Adapt release code for ctranslate2 2.0

Source code(tar.gz)
Source code(zip)
2.1.2(Apr 30, 2021)
Fixes and improvements

Fix update_vocab for LM (#2056)

Source code(tar.gz)
Source code(zip)
2.1.1(Apr 30, 2021)
Fixes and improvements

Fix potential deadlock (b1a4615)

Add more CT2 conversion checks (e4ab06c)

Source code(tar.gz)
Source code(zip)
2.1.0(Apr 16, 2021)
New features

Allow vocab update when training from a checkpoint (cec3cc8, 2f70dfc)

Fixes and improvements

Various transforms related bug fixes

Fix beam warning and buffers reuse

Handle invalid lines in vocab file gracefully

Source code(tar.gz)
Source code(zip)
2.0.1(Jan 27, 2021)
Fixes and improvements

Support embedding layer for larger vocabularies with GGNN (e8065b7)

Reorganize some inference options (9fb5f30)

Source code(tar.gz)
Source code(zip)
2.0.0(Jan 20, 2021)
First official release for OpenNMT-py major upgdate to 2.0!

New features

Language Model (GPT-2 style) training and inference

Nucleus (top-p) sampling decoding

Fixes and improvements

Fix some BART default values

Source code(tar.gz)
Source code(zip)
2.0.0rc2(Nov 10, 2020)
Fixes and improvements

Parallelize onmt_build_vocab (422d824)

Some fixes to the on-the-fly transforms

Some CTranslate2 related updates

Some fixes to the docs

This will be the first release to be automatically deployed via GitHub Actions.
Source code(tar.gz)
Source code(zip)
2.0.0rc1(Sep 25, 2020)
This is the first release candidate for OpenNMT-py major upgdate to 2.0.0!

The major idea behind this release is the -- almost -- complete makeover of the data loading pipeline . A new 'dynamic' paradigm is introduced, allowing to apply on the fly transforms to the data.

This has a few advantages, amongst which:

remove or drastically reduce the preprocessing required to train a model;

increase and simplify the possibilities of data augmentation and manipulation through on-the fly transforms.

These transforms can be specific tokenization methods, filters, noising, or any custom transform users may want to implement. Custom transform implementation is quite straightforward thanks to the existing base class and example implementations.

You can check out how to use this new data loading pipeline in the updated docs and examples.

All the readily available transforms are described here.

Performance

Given sufficient CPU resources according to GPU computing power, most of the transforms should not slow the training down. (Note: for now, one producer process per GPU is spawned -- meaning you would ideally need 2N CPU threads for N GPUs).

Breaking changes

A few features are dropped, at least for now:

audio, image and video inputs;

source word features.

Some very old checkpoints with previous fields and vocab structure are also incompatible with this new version.

For any user that still need some of these features, the previous codebase will be retained as legacy in a separate branch. It will no longer receive extensive development from the core team but PRs may still be accepted.
Source code(tar.gz)
Source code(zip)
1.2.0(Aug 17, 2020)
Fixes and improvements

Support pytorch 1.6 (e813f4d, eaaae6a)

Support official torch 1.6 AMP for mixed precision training (2ac1ed0)

Flag to override batch_size_multiple in FP16 mode, useful in some memory constrained setups (23e5018)

Pass a dict and allow custom options in preprocess/postprocess functions of REST server (41f0c02, 8ec54d2)

Allow different tokenization for source and target in REST server (bb2d045, 4659170)

Various bug fixes

New features

Gated Graph Sequence Neural Networks encoder (11e8d0), thanks @SteveKommrusch

Decoding with a target prefix (95aeefb, 0e143ff, 91ab592), thanks @Zenglinxiao

Source code(tar.gz)
Source code(zip)
1.1.1(Mar 20, 2020)
Fixes and improvements

Fix backcompatibility when no 'corpus_id' field (c313c28)

Source code(tar.gz)
Source code(zip)
1.1.0(Mar 19, 2020)
New features

Support CTranslate2 models in REST server (91d5d57)

Extend support for custom preprocessing/postprocessing function in REST server by using return dictionaries (d14613d, 9619ac3, 92a7ba5)

Experimental: BART-like source noising (5940dcf)

Fixes and improvements

Add options to CTranslate2 release (e442f3f)

Fix dataset shard order (458fc48)

Rotate only the server logs, not training (189583a)

Fix alignment error with empty prediction (91287eb)

Source code(tar.gz)
Source code(zip)
1.0.2(Mar 5, 2020)
Fixes and improvements

Enable CTranslate2 conversion of Transformers with relative position (db11135)

Adapt -replace_unk to use with learned alignments if they exist (7625b53)

Source code(tar.gz)
Source code(zip)
1.0.1(Feb 17, 2020)
Fixes and improvements

Ctranslate2 conversion handled in release script (1b50e0c)

Use attention_dropout properly in MHA (f5c9cd4)

Update apex FP16_Optimizer path (d3e2268)

Some REST server optimizations

Fix and add some docs

Source code(tar.gz)
Source code(zip)
1.0.0(Dec 13, 2019)
New features

Implementation of "Jointly Learning to Align & Translate with Transformer" (@Zenglinxiao)

Fixes and improvements

Add nbest support to REST server (@Zenglinxiao)

Merge greedy and beam search codepaths (@Zenglinxiao)

Fix "block ngram repeats" (@KaijuML, @pltrdy)

Small fixes, some more docs

Source code(tar.gz)
Source code(zip)
1.0.0.rc1(Oct 1, 2019)
We have now reached some good stability of the code base.

This is the 1.0.0 release candidate.

Fix Apex / FP16 training (Apex new API is buggy)

Multithread preprocessing way faster (Thanks François Hernandez)

Pip Installation v1.0.0.rc1 (thanks Paul Tardy)

Enjoy and feel free to report issues.
Source code(tar.gz)
Source code(zip)
0.9.2(Sep 5, 2019)
Switch to Pytorch 1.2

Pre/post processing on the translation server (useful for Chinese) Thanks @Zenglinxiao

option to remove the FFN layer in AAN + AAN optimization (faster)

Coverage loss (per Abisee paper 2017) implementation Thanks @pltrdy

Video Captioning task: Thanks @flauted !

Token batch at inference

Small fixes and add-ons

Source code(tar.gz)
Source code(zip)
0.9.1(Jun 13, 2019)
New mechanism for MultiGPU training "1 batch producer / multi batch consumers" resulting in big memory saving when handling huge datasets thanks @pltrdy @francoishernandez

New APEX AMP (mixed precision) API thanks @francoishernandez NB: you need to resintall Nvidia/Apex

Option to overwrite shards when preprocessing

Small fixes and add-ons

Source code(tar.gz)
Source code(zip)
0.9.0(May 16, 2019)
Updated Travis to Pytorch 1.1

Faster vocab building when processing shards (no reloading) thanks @francoishernandez

New dataweighting feature thanks @francoishernandez see the FAQ doc for more information

New dropout scheduler. Same logic as accum_count / accum_steps see opts.py

fix Gold Scores

small fixes and add-ons.

Unrelated, but new website online ! thanks @guillaumekln

Enjoy !
Source code(tar.gz)
Source code(zip)
0.8.2(Feb 17, 2019)
Update documentation and Library example (thanks @flauted @elisemicho )

Revamp args

Bug fixes, save moving average in FP32 (thanks @francoishernandez )

Allow FP32 inference for FP16 models

Source code(tar.gz)
Source code(zip)
0.8.1(Feb 12, 2019)

Mostly bug fixes.
Source code(tar.gz)
Source code(zip)
0.8.0(Feb 9, 2019)

Many fixes and code cleaning thanks @flauted, @guillaumekln

Datasets code refactor (thanks @flauted) you need to re-preprocess datasets

New features FP16 Support: Experimental, using Apex, Checkpoints may break in future version. Continuous exponential moving average (thanks @francoishernandez, and Marian) Relative positions encoding (thanks @francoishernandez, and Google T2T) Deprecate the old beam search, fast batched beam search supports all options
Source code(tar.gz)
Source code(zip)
0.7.2(Jan 31, 2019)

Multi level text fields for better handling of embeddings. thanks @flauted

code cleaning and bug fixing thanks @bpopeters @guillaumekln @pltrdy

NB: you cannot train on 0.7.2 with preprocessed data on a prior version, you need to re-preprocess.
Source code(tar.gz)
Source code(zip)
0.7.1(Jan 24, 2019)

Many fixes and code refactoring thanks @bpopeters, @flauted, @guillaumekln

New features Random sampling thanks @daphnei Enable sharding for huge files at translation
Source code(tar.gz)
Source code(zip)
0.7.0(Jan 2, 2019)

Many fixes and code refactoring thanks @benopeters Migrated to Pytorch 1.0
Source code(tar.gz)
Source code(zip)
0.6.0(Nov 28, 2018)

Mostly fixes and code improvements.

New: yml config files. See the config folder
Source code(tar.gz)
Source code(zip)

Owner

OpenNMT

Open source ecosystem for neural machine translation and neural sequence learning

GitHub Repository https://opennmt.net/

NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

11.4k Jan 04, 2023

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

2 Mar 04, 2022

Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

8 Oct 25, 2022

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

2k Jan 01, 2023

Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

23 Dec 30, 2022

A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

37 Dec 07, 2022

Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

66 Dec 31, 2022

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.8k Dec 27, 2022

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

6 Oct 18, 2022

Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

169 Jan 05, 2023

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

21 Dec 17, 2022

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

1 Nov 20, 2021

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

3.6k Jan 07, 2023

Open Source Neural Machine Translation in PyTorch

Related tags

Overview

OpenNMT-py: Open-Source Neural Machine Translation

Announcement - OpenNMT-py 2.0

Performance

Breaking changes

Table of Contents

Setup

Features

Quickstart

Step 1: Prepare the data

Step 2: Train the model

Step 3: Translate

(Optional) Step 4: Release

Alternative: Run on FloydHub

Pretrained embeddings (e.g. GloVe)

Pretrained models

Acknowledgements

Citation

Comments

What's in the model (and what is implemented):

Implementations

Discussion

What's next

Releases(v3.0.3)

v3.0.3(Dec 19, 2022)

v3.0.2(Dec 7, 2022)

3.0.2 (2022-12-07)

v3.0.1(Nov 23, 2022)

v3.0.0(Nov 3, 2022)

2.3.0(Sep 14, 2022)

New features

Fixes and improvements

2.2.0(Sep 14, 2021)

New features

Fixes and improvements

2.1.2(Apr 30, 2021)

Fixes and improvements

2.1.1(Apr 30, 2021)

Fixes and improvements

2.1.0(Apr 16, 2021)

New features

Fixes and improvements

2.0.1(Jan 27, 2021)

Fixes and improvements

2.0.0(Jan 20, 2021)

New features

Fixes and improvements

2.0.0rc2(Nov 10, 2020)

Fixes and improvements

2.0.0rc1(Sep 25, 2020)

Performance

Breaking changes

1.2.0(Aug 17, 2020)

Fixes and improvements

New features

1.1.1(Mar 20, 2020)

Fixes and improvements

1.1.0(Mar 19, 2020)

New features

Fixes and improvements

1.0.2(Mar 5, 2020)

Fixes and improvements

1.0.1(Feb 17, 2020)

Fixes and improvements

1.0.0(Dec 13, 2019)

New features

Fixes and improvements

1.0.0.rc1(Oct 1, 2019)

0.9.2(Sep 5, 2019)

0.9.1(Jun 13, 2019)

0.9.0(May 16, 2019)

0.8.2(Feb 17, 2019)

0.8.1(Feb 12, 2019)

0.8.0(Feb 9, 2019)

0.7.2(Jan 31, 2019)

0.7.1(Jan 24, 2019)

0.7.0(Jan 2, 2019)

0.6.0(Nov 28, 2018)