Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Overview



MIT License Latest Release Build Status Documentation Status


Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

We provide reference implementations of various sequence modeling papers:

List of implemented papers

What's New:

Previous updates

Features:

We also provide pre-trained models for translation and language modeling with a convenient torch.hub interface:

en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model')
en2de.translate('Hello world', beam=5)
# 'Hallo Welt'

See the PyTorch Hub tutorials for translation and RoBERTa for more examples.

Requirements and Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install fairseq and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./

# to install the latest stable release (0.10.x)
# pip install fairseq
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Getting Started

The full documentation contains instructions for getting started, training new models and extending fairseq with new model types and tasks.

Pre-trained models and examples

We provide pre-trained models and pre-processed, binarized test sets for several tasks listed below, as well as example training and evaluation commands.

We also have more detailed READMEs to reproduce results from specific papers:

Join the fairseq community

License

fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well.

Citation

Please cite as:

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}
You might also like...
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

💬   Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

💬   Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

💬   Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : errora[email protected] Install script on Termux $ apt update && apt upgra

 A Facebook Messenger Chatbot using NLP
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

An open-source NLP research library, built on PyTorch.
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Releases(v0.10.2)
  • v0.10.2(Jan 5, 2021)

  • v0.10.0(Nov 12, 2020)

    It's been a long time since our last release (0.9.0) nearly a year ago! There have been numerous changes and new features added since then, which we've tried to summarize below. While this release carries the same major version as our previous release (0.x.x), if you have code that relies on 0.9.0, it is likely you'll need to adapt it before updating to 0.10.0.

    Looking forward, this will also be the last significant release with the 0.x.x numbering. The next release will be 1.0.0 and will include a major migration to the Hydra configuration system, with an eye towards modularizing fairseq to be more usable as a library.

    Changelog:

    New papers:

    Major new features:

    • TorchScript support for Transformer and SequenceGenerator (PyTorch 1.6+ only)
    • Model parallel training support (see Megatron-11b)
    • TPU support via --tpu and --bf16 options (775122950d145382146e9120308432a9faf9a9b8)
    • Added VizSeq (a visual analysis toolkit for evaluating fairseq models)
    • Migrated to Python logging (fb76dac1c4e314db75f9d7a03cb4871c532000cb)
    • Added “SlowMo” distributed training backend (0dac0ff3b1d18db4b6bb01eb0ea2822118c9dd13)
    • Added Optimizer State Sharding (ZeRO) (5d7ed6ab4f92d20ad10f8f792b8703e260a938ac)
    • Added several features to improve speech recognition support in fairseq: CTC criterion, external ASR decoder support (currently only wav2letter decoder) with KenLM and fairseq language model fusion

    Minor features:

    • Added --patience for early stopping
    • Added --shorten-method=[none|truncate|random_crop] to language modeling (and other) tasks
    • Added --eval-bleu for computing BLEU scores during training (60fbf64f302a825eee77637a0b7de54fde38fb2c)
    • Added support for training huggingface models (e.g. hf_gpt2) (2728f9b06d9a3808cc7ebc2afa1401eddef35e35)
    • Added FusedLAMB optimizer (--optimizer=lamb) (f75411af2690a54a5155871f3cf7ca1f6fa15391)
    • Added LSTM-based language model (lstm_lm) (9f4256edf60554afbcaadfa114525978c141f2bd)
    • Added dummy tasks and models for benchmarking (91f05347906e80e6705c141d4c9eb7398969a709; a541b19d853cf4a5209d3b8f77d5d1261554a1d9)
    • Added tutorial and pretrained models for paraphrasing (630701eaa750efda4f7aeb1a6d693eb5e690cab1)
    • Support quantization for Transformer (6379573c9e56620b6b4ddeb114b030a0568ce7fe)
    • Support multi-GPU validation in fairseq-validate (2f7e3f33235b787de2e34123d25f659e34a21558)
    • Support batched inference in hub interface (3b53962cd7a42d08bcc7c07f4f858b55bf9bbdad)
    • Support for language model fusion in standard beam search (5379461e613263911050a860b79accdf4d75fd37)

    Breaking changes:

    • Updated requirements to Python 3.6+ and PyTorch 1.5+
    • --max-sentences renamed to --batch-size
    • Main entry point scripts (eval_lm.py, generate.py, etc.) removed from root directory into fairseq_cli
    • Changed format for generation output; H- now corresponds to tokenized system outputs and newly added D- lines correspond to detokenized outputs (f353913420b6ef8a31ecc55d2ec0c988178698e0)
    • We now log the stats from the log-interval (displayed as train_inner) instead of a rolling average over each epoch.
    • SequenceGenerator/Scorer does not print alignment by default, re-enable with --print-alignment
    • Print base 2 scores in generation scripts (660d69fd2bdc4c3468df7eb26b3bbd293c793f94)
    • Incremental decoding interface changed to use FairseqIncrementalState (4e48c4ae5da48a5f70c969c16793e55e12db3c81; 88185fcc3f32bd24f65875bd841166daa66ed301)
    • Refactor namespaces in Criterions to support library usage (introduce LegacyFairseqCriterion for BC) (46b773a393c423f653887c382e4d55e69627454d)
    • Deprecate FairseqCriterion::aggregate_logging_outputs interface, use FairseqCriterion::reduce_metrics instead (86793391e38bf88c119699bfb1993cb0a7a33968)
    • Moved fairseq.meters to fairseq.logging.meters and added new metrics aggregation module (fairseq.logging.metrics) (1e324a5bbe4b1f68f9dadf3592dab58a54a800a8; f8b795f427a39c19a6b7245be240680617156948)
    • Reset mid-epoch stats every log-interval steps (244835d811c2c66b1de2c5e86532bac41b154c1a)
    • Ignore duplicate entries in dictionary files (dict.txt) and support manual overwrite with #fairseq:overwrite option (dd1298e15fdbfc0c3639906eee9934968d63fc29; 937535dba036dc3759a5334ab5b8110febbe8e6e)
    • Use 1-based indexing for epochs everywhere (aa79bb9c37b27e3f84e7a4e182175d3b50a79041)

    Minor interface changes:

    • Added FairseqTask::begin_epoch hook (122fc1db49534a5ca295fcae1b362bbd6308c32f)
    • FairseqTask::build_generator interface changed (cd2555a429b5f17bc47260ac1aa61068d9a43db8)
    • Change RobertaModel base class to FairseqEncoder (307df5604131dc2b93cc0a08f7c98adbfae9d268)
    • Expose FairseqOptimizer.param_groups property (8340b2d78f2b40bc365862b24477a0190ad2e2c2)
    • Deprecate --fast-stat-sync and replace with FairseqCriterion::logging_outputs_can_be_summed interface (fe6c2edad0c1f9130847b9a19fbbef169529b500)
    • --raw-text and --lazy-load are fully deprecated; use --dataset-impl instead
    • Mixture of expert tasks moved to examples/ (8845dcf5ff43ca4d3e733ade62ceca52f1f1d634)

    Performance improvements:

    • Use cross entropy from apex for improved memory efficiency (5065077dfc1ec4da5246a6103858641bfe3c39eb)
    • Added buffered dataloading (--data-buffer-size) (411531734df8c7294e82c68e9d42177382f362ef)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Dec 4, 2019)

    Possibly breaking changes:

    • Set global numpy seed (4a7cd58)
    • Split in_proj_weight into separate k, v, q projections in MultiheadAttention (fdf4c3e)
    • TransformerEncoder returns namedtuples instead of dict (27568a7)

    New features:

    • Add --fast-stat-sync option (e1ba32a)
    • Add --empty-cache-freq option (315c463)
    • Support criterions with parameters (ba5f829)

    New papers:

    • Simple and Effective Noisy Channel Modeling for Neural Machine Translation (49177c9)
    • Levenshtein Transformer (86857a5, ...)
    • Cross+Self-Attention for Transformer Models (4ac2c5f)
    • Jointly Learning to Align and Translate with Transformer Models (1c66792)
    • Reducing Transformer Depth on Demand with Structured Dropout (dabbef4)
    • Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) (e23e5ea)
    • BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (a92bcda)
    • CamemBERT: a French BERT (b31849a)

    Speed improvements:

    • Add CUDA kernels for LightConv and DynamicConv (f840564)
    • Cythonization of various dataloading components (4fc3953, ...)
    • Don't project mask tokens for MLM training (718677e)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Aug 14, 2019)

    Changelog:

    • Relicensed under MIT license
    • Add RoBERTa
    • Add wav2vec
    • Add WMT'19 models
    • Add initial ASR code
    • Changed torch.hub interface (generate renamed to translate)
    • Add --tokenizer and --bpe
    • f812e52: Renamed data.transforms -> data.encoders
    • 654affc: New Dataset API (optional)
    • 47fd985: Deprecate old Masked LM components
    • 5f78106: Set mmap as default dataset format and infer format automatically
    • Misc fixes for sampling
    • Misc fixes to support PyTorch 1.2
    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Jul 19, 2019)

    No major API changes since the last release. Cutting a new release since we'll be merging significant (possibly breaking) changes to logging, data loading and the masked LM implementation soon.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Jun 20, 2019)

  • v0.7.0(Jun 19, 2019)

    Notable (possibly breaking) changes:

    • d45db80: Remove checkpoint utility functions from utils.py into checkpoint_utils.py
    • f2563c2: Move LM definitions into separate files
    • dffb167: Updates to model API:
      • FairseqModel -> FairseqEncoderDecoderModel
      • add FairseqDecoder.extract_features and FairseqDecoder.output_layer
      • encoder_out_dict -> encoder_out
      • rm unused remove_head functions
    • 34726d5: Move distributed_init into DistributedFairseqModel
    • cf17068: Simplify distributed launch by automatically launching multiprocessing on each node for all visible GPUs (allows launching just one job per node instead of one per GPU)
    • d45db80: Change default LR scheduler from reduce_lr_on_plateau to fixed
    • 96ac28d: Rename --sampling-temperature -> --temperature
    • fc1a19a: Deprecate dummy batches
    • a1c997b: Add memory mapped datasets
    • 0add50c: Allow cycling over multiple datasets, where each one becomes an "epoch"

    Plus many additional features and bugfixes

    Source code(tar.gz)
    Source code(zip)
  • v0.6.2(Mar 15, 2019)

    Changelog:

    • 998ba4f: Add language models from Baevski & Auli (2018)
    • 4294c4f: Add mixture of experts code from Shen et al. (2019)
    • 0049349: Add example for multilingual training
    • 48d9afb: Speed improvements, including fused operators from apex
    • 44d27e6: Add Tensorboard support
    • d17fa85: Add Adadelta optimizer
    • 9e1c880: Add FairseqEncoderModel
    • b65c579: Add FairseqTask.inference_step to modularize generate.py
    • 2ad1178: Add back --curriculum
    • Misc bug fixes and other features
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Feb 9, 2019)

  • v0.6.0(Sep 26, 2018)

    Changelog:

    • 4908863: Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0
      • no more FP16Trainer, we just have an FP16Optimizer wrapper
      • most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time
      • Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0
      • Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq
    • 1c56b58: parallelize preprocessing
    • Misc bug fixes and features
    Source code(tar.gz)
    Source code(zip)
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
Code for using and evaluating SpanBERT.

SpanBERT This repository contains code and models for the paper: SpanBERT: Improving Pre-training by Representing and Predicting Spans. If you prefer

Meta Research 798 Dec 30, 2022
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Toward a Visual Concept Vocabulary for GAN Latent Space Code and data from the ICCV 2021 paper Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Kl

Sarah Schwettmann 13 Dec 23, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
BERT score for text generation

BERTScore Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). News: Features to appear in

Tianyi 1k Jan 08, 2023
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 03, 2023
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022