Training open neural machine translation models

Last update: Jan 03, 2023

Overview

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Training open neural machine translation models

Related tags

Overview

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

Owner

Language Technology at the University of Helsinki

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

A collection of models for image - text generation in ACM MM 2021.

A website which allows you to play with the GPT-2 transformer

Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

Final Project Bootcamp Zero

Autoregressive Entity Retrieval

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

State of the Art Natural Language Processing

This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

Awesome Treasure of Transformers Models Collection

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Code for Editing Factual Knowledge in Language Models

Predict the spans of toxic posts that were responsible for the toxic label of the posts

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Script to generate VAD dataset used in Asteroid recipe