Translate - a PyTorch Language Library

Overview

NOTE

PyTorch Translate is now deprecated, please use fairseq instead.


Translate - a PyTorch Language Library

Translate is a library for machine translation written in PyTorch. It provides training for sequence-to-sequence models. Translate relies on fairseq, a general sequence-to-sequence library, which means that models implemented in both Translate and Fairseq can be trained. Translate also provides the ability to export some models to Caffe2 graphs via ONNX and to load and run these models from C++ for production purposes. Currently, we export components (encoder, decoder) to Caffe2 separately and beam search is implemented in C++. In the near future, we will be able to export the beam search as well. We also plan to add export support to more models.

Quickstart

If you are just interested in training/evaluating MT models, and not in exporting the models to Caffe2 via ONNX, you can install Translate for Python 3 by following these few steps:

  1. Install pytorch
  2. Install fairseq
  3. Clone this repository git clone https://github.com/pytorch/translate.git pytorch-translate && cd pytorch-translate
  4. Run python setup.py install

Provided you have CUDA installed you should be good to go.

Requirements and Full Installation

Translate Requires:

  • A Linux operating system with a CUDA compatible card
  • GNU C++ compiler version 4.9.2 and above
  • A CUDA installation. We recommend CUDA 8.0 or CUDA 9.0

Use Our Docker Image:

Install Docker and nvidia-docker, then run

sudo docker pull pytorch/translate
sudo nvidia-docker run -i -t --rm pytorch/translate /bin/bash
. ~/miniconda/bin/activate
cd ~/translate

You should now be able to run the sample commands in the Usage Examples section below. You can also see the available image versions under https://hub.docker.com/r/pytorch/translate/tags/.

Install Translate from Source:

These instructions were mainly tested on Ubuntu 16.04.5 LTS (Xenial Xerus) with a Tesla M60 card and a CUDA 9 installation. We highly encourage you to report an issue if you are unable to install this project for your specific configuration.

  • If you don't already have an existing Anaconda environment with Python 3.6, you can install one via Miniconda3:

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
    chmod +x miniconda.sh
    ./miniconda.sh -b -p ~/miniconda
    rm miniconda.sh
    . ~/miniconda/bin/activate
    
  • Clone the Translate repo:

    git clone https://github.com/pytorch/translate.git
    pushd translate
    
  • Install the PyTorch conda package:

    # Set to 8 or 9 depending on your CUDA version.
    TMP_CUDA_VERSION="9"
    
    # Uninstall previous versions of PyTorch. Doing this twice is intentional.
    # Error messages about torch not being installed are benign.
    pip uninstall -y torch
    pip uninstall -y torch
    
    # This may not be necessary if you already have the latest cuDNN library.
    conda install -y cudnn
    
    # Add LAPACK support for the GPU.
    conda install -y -c pytorch "magma-cuda${TMP_CUDA_VERSION}0"
    
    # Install the combined PyTorch nightly conda package.
    conda install pytorch-nightly cudatoolkit=${TMP_CUDA_VERSION}.0 -c pytorch
    
    # Install NCCL2.
    wget "https://s3.amazonaws.com/pytorch/nccl_2.1.15-1%2Bcuda${TMP_CUDA_VERSION}.0_x86_64.txz"
    TMP_NCCL_VERSION="nccl_2.1.15-1+cuda${TMP_CUDA_VERSION}.0_x86_64"
    tar -xvf "${TMP_NCCL_VERSION}.txz"
    rm "${TMP_NCCL_VERSION}.txz"
    
    # Set some environmental variables needed to link libraries correctly.
    export CONDA_PATH="$(dirname $(which conda))/.."
    export NCCL_ROOT_DIR="$(pwd)/${TMP_NCCL_VERSION}"
    export LD_LIBRARY_PATH="${CONDA_PATH}/lib:${NCCL_ROOT_DIR}/lib:${LD_LIBRARY_PATH}"
    
  • Install ONNX:

    git clone --recursive https://github.com/onnx/onnx.git
    yes | pip install ./onnx 2>&1 | tee ONNX_OUT
    

If you get a Protobuf compiler not found error, you need to install it:

conda install -c anaconda protobuf

Then, try to install ONNX again:

yes | pip install ./onnx 2>&1 | tee ONNX_OUT
  • Build Translate:

    pip uninstall -y pytorch-translate
    python3 setup.py build develop
    

Now you should be able to run the example scripts below!

Usage Examples

Note: the example commands given assume that you are the root of the cloned GitHub repository or that you're in the translate directory of the Docker or Amazon image. You may also need to make sure you have the Anaconda environment activated.

Training

We provide an example script to train a model for the IWSLT 2014 German-English task. We used this command to obtain a pretrained model:

bash pytorch_translate/examples/train_iwslt14.sh

The pretrained model actually contains two checkpoints that correspond to training twice with random initialization of the parameters. This is useful to obtain ensembles. This dataset is relatively small (~160K sentence pairs), so training will complete in a few hours on a single GPU.

Training with tensorboard visualization

We provide support for visualizing training stats with tensorboard. As a dependency, you will need tensorboard_logger installed.

pip install tensorboard_logger

Please also make sure that tensorboard is installed. It also comes with tensorflow installation.

You can use the above example script to train with tensorboard, but need to change line 10 from :

CUDA_VISIBLE_DEVICES=0 python3 pytorch_translate/train.py

to

CUDA_VISIBLE_DEVICES=0 python3 pytorch_translate/train_with_tensorboard.py

The event log directory for tensorboard can be specified by option --tensorboard_dir with a default value: run-1234. This directory is appended to your --save_dir argument.

For example in the above script, you can visualize with:

tensorboard --logdir checkpoints/runs/run-1234

Multiple runs can be compared by specifying different --tensorboard_dir. i.e. run-1234 and run-2345. Then

tensorboard --logdir checkpoints/runs

can visualize stats from both runs.

Pretrained Model

A pretrained model for IWSLT 2014 can be evaluated by running the example script:

bash pytorch_translate/examples/generate_iwslt14.sh

Note the improvement in performance when using an ensemble of size 2 instead of a single model.

Exporting a Model with ONNX

We provide an example script to export a PyTorch model to a Caffe2 graph via ONNX:

bash pytorch_translate/examples/export_iwslt14.sh

This will output two files, encoder.pb and decoder.pb, that correspond to the computation of the encoder and one step of the decoder. The example exports a single checkpoint (--checkpoint model/averaged_checkpoint_best_0.pt but is also possible to export an ensemble (--checkpoint model/averaged_checkpoint_best_0.pt --checkpoint model/averaged_checkpoint_best_1.pt). Note that during export, you can also control a few hyperparameters such as beam search size, word and UNK rewards.

Using the Model

To use the sample exported Caffe2 model to translate sentences, run:

echo "hallo welt" | bash pytorch_translate/examples/translate_iwslt14.sh

Note that the model takes in BPE inputs, so some input words need to be split into multiple tokens. For instance, "hineinstopfen" is represented as "hinein@@ stop@@ fen".

PyTorch Translate Research

We welcome you to explore the models we have in the pytorch_translate/research folder. If you use them and encounter any errors, please paste logs and a command that we can use to reproduce the error. Feel free to contribute any bugfixes or report your experience, but keep in mind that these models are a work in progress and thus are currently unsupported.

Join the Translate Community

We welcome contributions! See the CONTRIBUTING.md file for how to help out.

License

Translate is BSD-licensed, as found in the LICENSE file.

A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

SafeGraph 251 Jan 01, 2023
✨Fast Coreference Resolution in spaCy with Neural Networks

✨ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks. NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolv

Hugging Face 2.6k Jan 04, 2023
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classifi

186 Dec 24, 2022
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 05, 2022
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 07, 2022
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
Natural Language Processing Best Practices & Examples

NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive bus

Microsoft 6.1k Dec 31, 2022
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022