An implementation of the Pay Attention when Required transformer

Overview

Pay Attention when Required (PAR) Transformer-XL

An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pdf/2009.04534.pdf

alt text [source: Jonathan Kernes]

Quick overview

The Pay Attention when Required Transformer (Mandava, et. al. 2020) is just a regular transformer-XL (Dai et. al. 2019)[https://arxiv.org/pdf/1901.02860.pdf] , but the ratio of attention and dense layers has been optimized. This optimization is performed by allowing the network to choose which types of layer it prefers in each block of the network. The present implementation is not an exact replica of the author's efforts. Instead, we perform a simultaneous optimization procedure on both the model architecture and model parameters. The search is performed using a SuperNet, which is a sequential neural network composed of stochastic blocks, as shown in the figure below (taken from the paper. Please don't sue me!)

alt text [Source: Mandava et. al. 2020]

The key component is a Gumbel-Softmax layer [(Jang et al., 2016) and (Maddison et al., 2016). jang link: https://arxiv.org/pdf/1611.01144.pdf]. This layer is a continuous representation of a discrete sampling from a Categorical distribution, thereby allowing us to use gradients to learn parameters of a discrete distribution. (Recall a categorical is a distrbution over K states with kth state having probability pi_k, and we must have the normalization condition \sum_{i=1}^K pi_i = 1)

As the model learns, it is free to adjust both the usual model parameters, as well as its architecture search parameters pi, indicating the probability of choosing either

  1. Attention

  2. Dense

  3. Identity

for any given stochastic block. We perform simulated annealing: since the categorical distribution is approximated by a continuous representation, we get some scores like (0.02, 0.98, 0.02) for the probability of say sampling that state 2 is picked. The sharpness of this is set by a parameter \tau (the temperature), with a categorical distribution the limit tau-->0. Simulated annealing means we begin with tau=1 to let the model figure out what it wants, then slowly decrease tau so the distribution approaches a categorical.

All of this is implemented on the freely available wiki-text2 dataset.

Explanation of the main GIF: The main gif is the result of our experiments. It shows the pi distribution for each stochastic block of a 6 block SuperNet, as a function of training iterations. The number indicates the probability of the most likely layer type (darker means more probable). As you can see, the model learns to put attention in the beginning, and dense layers at the end.

Requirements

Usual ML stuff, if you have a conda environment, python 3+, TensorFlow 2+ you should be ok. You will need TensorFlow Text as well to handle the SentencePiece Tokenization

If you choose to run your own tokenizer (a flag option in data_utils for handling new text data), you will also need to download the SentencePiece package: https://github.com/google/sentencepiece

Data

The dataset used is Wiki-text2. We have provided a copy of this in the data folder, along with some preprocessed data for training. In order to reproduce this from scratch, run the shell script

./create_tfrecords.sh

This will download the wiki-text2 dataset from its source, then proceed to clean, batch, and write the data to a tfrecords file. The shell script calls build_data.py which offers more control over what type of data to generate. The general parameters you will want to tune are:

*batch_size *seq_len.

You can also supply your own dataset instead of the one provided. The underlying tokenizer uses sentencepiece (Kudo): https://github.com/google/sentencepiece, which works at the byte level and can handle any kind of input. Simply change the --input_text flag to your file, and set the desired --vocab_size.

Why do we need to specify the batch size? Transformer XL uses memory states to form a recurrent, long range network. After analyzing a particular sequence say [A,B] of the sequence [A,B,C,D], the results of [A,B] are fed into the [C,D] calculation with a stop gradient. Therefore, we must be sure that each datapoint follows chronologically from the previous one.

This is achieved by context batching (see data_utils.py function) where we break the entire dataset into batch_size segments, then pull in order one sequence from each batch at a time to form the dataset. Because of this, note that adding more shards to the data could result in a large loss (order of batch_size*seq_len*shards), as each shard will drop the remaining datapoint of size (batch_size*seq_len) to keep the tensor shapes.

Addtional technical details

Per the original Transformer-XL, we also implement an adaptive softmax layer (Grave et. al. 2017, https://arxiv.org/abs/1609.04309) to deal with a potentially large number of outputs in the final dense layer. This implemenation is inspired by the TF 1.0 example at https://github.com/yangsaiyong/tf-adaptive-softmax-lstm-lm. To use the adaptive softmax, set the --cutoffs= flag in train.py. The cutoffs are the max values of each bin, and should NOT include the vocab size (i.e. the max cutoff of the final bin). If no cutoffs are specified, the model defaults to normal softmax.

For completeness, we have also provided a script optimal_cuts.py that determines the optimal cutoffs given a return space separated file of unigram probabilities (based on the assumptions of Grave et. al. regarding GPU computation complexity -- see the paper for details). The algorithm uses dynamic programming, but is quite slow at O(KN^2), for K cutoffs and N vocab words. In principle it's a one time cost to determine the cutoffs, but we are impatient and recommend to just play around with the cutoffs instead. See the script for flag details

Training and Benchmarks

The default model we use has memory length 16, feed-forward dimension 1024, attention dimension 128, and 6 stochastic blocks, with an adaptive softmax layer and 2 clusters. We trained on a colab GPU for 20 epochs, taking a total of 37 minutes. We use an Adam optimzer with cosine rate decay: an initial warmup of 4000 steps and a maximum learning rate of 1e-4, decaying to zero at the end of training. Our training benchmarks are:

Iteration (thousands) Train_perplexity Validation_perplexity Time
2.7k 163.9 114.4 1m 58s
8.5k 78.56 62.33 5m 37s
14.1k 65.71 51.88 9m 28s
28.3k 48.52 42.61 18m 40s
48.1k 41.85 39.57 31m 51s
56.5k 42.12 39.41 37m 14s

To train, simply run the shell script

./base_model.sh

adjusting the parameters as you see fit. The above model is the default configuration. To train in colab, simply open up the notebook "colab.ipynb" and follow the instructions. This is most easily done by going to [google.colab.com] and searching this repository in github. The benefit of colab, is it's easier to play around with the model after training.

While training, we have provided two ways to monitor the output

  1. A tensorboard log. The colab notebook takes care of running this for you. In the terminal, first create a 'logs' directory, then run the command tensorboard --logdir logs in a separate tab. This will open a port where you can view live plots of the learning rate, tau annealing, train/valid loss and perplexity.

  2. An output log saved to training_log.log. This will log the model summary, parameters, etc. as well as print out loss updates every 100 steps and save it to the log file.

Thanks for reading this far!

Enjoy! And thank you to the wonderful researchers that inspired this project.

If you would like to contribute, or have any comments questions concerns please open a pull request or email me directly.

Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Linear Multihead Attention (Linformer) PyTorch Implementation of reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer:

Kui Xu 58 Dec 23, 2022
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

OpenAI 16.1k Jan 08, 2023
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

BERT-of-Theseus Code for paper "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing". BERT-of-Theseus is a new compressed BERT by progre

Kevin Canwen Xu 284 Nov 25, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System Authors: Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai

Amazon Web Services - Labs 124 Jan 03, 2023
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022