Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Last update: Dec 13, 2022

Related tags

Overview

T-TA (Transformer-based Text Auto-encoder)

This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning) using TensorFlow 2.

How to train T-TA using custom dataset

Prepare datasets. You need text line files.

Example:
```
Sentence 1.
Sentence 2.
Sentence 3.
```
Train the sentencepiece tokenizer. You can use the train_sentencepiece.py or train sentencepiece model by yourself.

Train T-TA model. Run train.py with customizable arguments. Here's the usage.

$ python train.py --help
usage: train.py [-h] [--train-data TRAIN_DATA] [--dev-data DEV_DATA] [--model-config MODEL_CONFIG] [--batch-size BATCH_SIZE] [--spm-model SPM_MODEL]
                [--learning-rate LEARNING_RATE] [--target-epoch TARGET_EPOCH] [--steps-per-epoch STEPS_PER_EPOCH] [--warmup-ratio WARMUP_RATIO]

optional arguments:
    -h, --help            show this help message and exit
    --train-data TRAIN_DATA
    --dev-data DEV_DATA
    --model-config MODEL_CONFIG
    --batch-size BATCH_SIZE
    --spm-model SPM_MODEL
    --learning-rate LEARNING_RATE
    --target-epoch TARGET_EPOCH
    --steps-per-epoch STEPS_PER_EPOCH
    --warmup-ratio WARMUP_RATIO

I want to train models until the designated steps, so I added the steps_per_epoch and target_epoch arguments. The total steps will be the steps_per_epoch * target_epoch.

(Optional) Test your model using KorSTS data. I trained my model with the Korean corpus, so I tested it using KorSTS data. You can evaluate KorSTS score (Spearman correlation) using evaluate_unsupervised_korsts.py. Here's the usage.

$ python evaluate_unsupervised_korsts.py --help
usage: evaluate_unsupervised_korsts.py [-h] --model-weight MODEL_WEIGHT --dataset DATASET

optional arguments:
    -h, --help            show this help message and exit
    --model-weight MODEL_WEIGHT
    --dataset DATASET
$ # To evaluate on dev set
$ # python evaluate_unsupervised_korsts.py --model-weight ./path/to/checkpoint --dataset ./path/to/dataset/sts-dev.tsv

Training details

Training data: lovit/namuwikitext
Peak learning rate: 1e-4
learning rate scheduler: Linear Warmup and Linear Decay.
Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
Vocab size: 15000
num layers: 3
intermediate size: 2048
hidden size: 512
attention heads: 8
activation function: gelu
max sequence length: 128
tokenizer: sentencepiece
Total steps: 1M
Final validation accuracy of auto encoding task (ignores padding): 0.5513
Final validation loss: 2.1691

Unsupervised KorSTS

Model	Params	development	test
My Implementation	17M	65.98	56.75
-	-	-	-
Korean SRoBERTa (base)	111M	63.34	48.96
Korean SRoBERTa (large)	338M	60.15	51.35
SXLM-R (base)	270M	64.27	45.05
SXLM-R (large)	550M	55.00	39.92
Korean fastText	-	-	47.96

KorSTS development and test set scores (100 * Spearman Correlation). You can check the details of other models on this paper (KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding).

How to use pre-trained weight using tensorflow-hub

>>> import tensorflow as tf
>>> import tensorflow_text as text
>>> import tensorflow_hub as hub
>>> # load model
>>> model = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/model.tar.gz")
>>> preprocess = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/preprocess.tar.gz")
>>> # inference
>>> input_tensor = preprocess(["이 모델은 나무위키로 학습되었습니다.", "근데 이 모델 어디다가 쓸 수 있을까요?", "나는 고양이를 좋아해!", "나는 강아지를 좋아해!"])
>>> representation = model(input_tensor)
>>> representation = tf.reduce_sum(representation * tf.cast(input_tensor["input_mask"], representation.dtype)[:, :, tf.newaxis], axis=1)
>>> representation = tf.nn.l2_normalize(representation, axis=-1)
>>> similarities = tf.tensordot(representation, representation, axes=[[1], [1]])
>>> # results
>>> similarities
<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[0.9999999 , 0.76468784, 0.7384633 , 0.7181306 ],
       [0.76468784, 1.        , 0.81387675, 0.79722893],
       [0.7384633 , 0.81387675, 0.9999999 , 0.96217746],
       [0.7181306 , 0.79722893, 0.96217746, 1.        ]], dtype=float32)>

References

짧은 영어를 뒤로 하고, 대부분의 독자분이실 한국분들을 위해 적어보자면, 단순히 "회사에서 구상중인 모델 구조가 좋을까?"를 테스트해보기 위해 개인적으로 학습해본 모델입니다. 어느정도로 잘 나오는지 궁금해서 작성한 코드이기 때문에 하이퍼 파라미터 튜닝이라던가, 데이터셋을 신중히 골랐다던가 하는 것은 없었습니다. 단지 학습해보다보니 생각보다 값이 잘 나와서 결과와 함께 공개하게 되었습니다. 커밋 로그를 보시면 짐작하실 수 있겠지만, 하루 정도에 후다닥 짜서 작은 GPU로 약 50시간 가량 돌린 모델입니다.

원 논문에 나온 값들을 최대한 따라가려 했으며, 밤에 작성했던 코드라 조금 명확하지 않은 부분이 있을 수도 있고, 원 구현과 다를 수도 있습니다. 해당 부분은 이슈로 달아주신다면 다시 확인해보겠습니다.

트러블 슈팅에 도움을 주신 백영민님(@baekyeongmin)께 감사드립니다.

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

922 Dec 10, 2021

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

20 Jul 11, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

33 Sep 22, 2022

Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Releases(0)

0(Feb 6, 2021)
Training data: lovit/namuwikitext

Peak learning rate: 1e-4

learning rate scheduler: Linear Warmup and Linear Decay.

Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)

Vocab size: 15000

num layers: 3

intermediate size: 2048

hidden size: 512

attention heads: 8

activation function: gelu

max sequence length: 128

tokenizer: sentencepiece

Total steps: 1M

Final validation accuracy of auto encoding task (ignores padding): 0.5513

Final validation loss: 2.1691

Source code(tar.gz)
Source code(zip)
model.tar.gz(60.93 MB)
preprocess.tar.gz(507.45 KB)

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Related tags

Overview

T-TA (Transformer-based Text Auto-encoder)

How to train T-TA using custom dataset

Training details

Unsupervised KorSTS

How to use pre-trained weight using tensorflow-hub

References

You might also like...

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Unsupervised text tokenizer for Neural Network-based text generation.

Unsupervised text tokenizer for Neural Network-based text generation.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Making text a first-class citizen in TensorFlow.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Releases(0)

0(Feb 6, 2021)

Owner

Jeong Ukjae

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

Code for the paper "Flexible Generation of Natural Language Deductions"

The swas programming language

HAN2HAN : Hangul Font Generation

This repository contains the code for "Generating Datasets with Pretrained Language Models".

A python gui program to generate reddit text to speech videos from the id of any post.

Espial is an engine for automated organization and discovery of personal knowledge

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Tool to check whether a GCP bucket is public or not.

An Explainable Leaderboard for NLP

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

Code for the paper "Language Models are Unsupervised Multitask Learners"

Python SDK for working with Voicegain Speech-to-Text

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Shellcode antivirus evasion framework

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Community and sentiment analysis based on tweets

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"