This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Overview

Improving Transformer Models by Reordering their Sublayers

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers (video presentation here, summary here).

Our character-level model (and this repo) is based on the Adaptive Attention Span for Transformers model. In our paper we showed that by simply reordering that model's self-attention and feedforward sublayers, we could improve performance on the enwik8 benchmark (where we achieve 0.968 BPC on the test set).

The code here simply adds a way to reorder the sublayers of the Adaptive Span model, using the --architecture parameter.

If you use this code or results from our paper, please cite:

@inproceedings{press-etal-2020-improving,
    title = "Improving Transformer Models by Reordering their Sublayers",
    author = "Press, Ofir and Smith, Noah A. and Levy, Omer",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.270",
    doi = "10.18653/v1/2020.acl-main.270",
    pages = "2996--3005",
}

Requirements

You need CUDA 10 and PyTorch 1.2.0 to run this code. See this page for installation instructions. To replicate our experimental conditions eight V100 GPUs are needed.

Running experiments in the paper

The scripts for training the character-level models from the paper are located in the ./experiments/ directory. For example, to train the enwik8 model, run:

bash experiments/enwik8_large.sh

We used eight V100 GPUs, but if you'd like to run this model on GPUs with less memory you can increase the --batch-split (it splits batches into smaller pieces without changing the final result).

We obtained the following results in our experiments:

Experiment #params valid (bpc) test (bpc)
enwik8 Sandwich Transformer 209M 0.992 0.968
text8 Sandwich Transformer 209M 1.012 1.076

The --architecture parameter

A standard transformer with 3 layers (so 6 self-attention and feedforward sublayers) would use be trained using --architecture sfsfsf. That 6 sublayer model with a sandwiching coefficient of 1 would be --architecture s.sfsf.f and with a sandwiching coefficient of 2 would be --architecture s.s.sf.f.f. Make sure to also set the --nlayers parameter to be the length of the architecture string divided by 2.

License

The code is licensed under CC-BY-NC license. See the LICENSE file for more details.

Acknowledgements + More Information

This code is based on the code of the Adaptive Span model. We recommend reading the Adaptive Span README for further information on this codebase.

Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Yeoun Yi 3 May 23, 2022
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023
EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier. The literature has witnessed the success of applying

Alibaba 819 Jan 03, 2023
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions šŸŽ‡ Check collection of public projects šŸŽ , where you can find multiple

minerva.ml 153 Jun 22, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 03, 2023
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 04, 2023
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | äø­ę–‡ ā— Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022