Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Related tags

Text Data & NLPtta
Overview

T-TA (Transformer-based Text Auto-encoder)

This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning) using TensorFlow 2.

How to train T-TA using custom dataset

  1. Prepare datasets. You need text line files.

    Example:

    Sentence 1.
    Sentence 2.
    Sentence 3.
    
  2. Train the sentencepiece tokenizer. You can use the train_sentencepiece.py or train sentencepiece model by yourself.

  3. Train T-TA model. Run train.py with customizable arguments. Here's the usage.

    $ python train.py --help
    usage: train.py [-h] [--train-data TRAIN_DATA] [--dev-data DEV_DATA] [--model-config MODEL_CONFIG] [--batch-size BATCH_SIZE] [--spm-model SPM_MODEL]
                    [--learning-rate LEARNING_RATE] [--target-epoch TARGET_EPOCH] [--steps-per-epoch STEPS_PER_EPOCH] [--warmup-ratio WARMUP_RATIO]
    
    optional arguments:
        -h, --help            show this help message and exit
        --train-data TRAIN_DATA
        --dev-data DEV_DATA
        --model-config MODEL_CONFIG
        --batch-size BATCH_SIZE
        --spm-model SPM_MODEL
        --learning-rate LEARNING_RATE
        --target-epoch TARGET_EPOCH
        --steps-per-epoch STEPS_PER_EPOCH
        --warmup-ratio WARMUP_RATIO

    I want to train models until the designated steps, so I added the steps_per_epoch and target_epoch arguments. The total steps will be the steps_per_epoch * target_epoch.

  4. (Optional) Test your model using KorSTS data. I trained my model with the Korean corpus, so I tested it using KorSTS data. You can evaluate KorSTS score (Spearman correlation) using evaluate_unsupervised_korsts.py. Here's the usage.

    $ python evaluate_unsupervised_korsts.py --help
    usage: evaluate_unsupervised_korsts.py [-h] --model-weight MODEL_WEIGHT --dataset DATASET
    
    optional arguments:
        -h, --help            show this help message and exit
        --model-weight MODEL_WEIGHT
        --dataset DATASET
    $ # To evaluate on dev set
    $ # python evaluate_unsupervised_korsts.py --model-weight ./path/to/checkpoint --dataset ./path/to/dataset/sts-dev.tsv

Training details

  • Training data: lovit/namuwikitext
  • Peak learning rate: 1e-4
  • learning rate scheduler: Linear Warmup and Linear Decay.
  • Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
  • Vocab size: 15000
  • num layers: 3
  • intermediate size: 2048
  • hidden size: 512
  • attention heads: 8
  • activation function: gelu
  • max sequence length: 128
  • tokenizer: sentencepiece
  • Total steps: 1M
  • Final validation accuracy of auto encoding task (ignores padding): 0.5513
  • Final validation loss: 2.1691

Unsupervised KorSTS

Model Params development test
My Implementation 17M 65.98 56.75
- - - -
Korean SRoBERTa (base) 111M 63.34 48.96
Korean SRoBERTa (large) 338M 60.15 51.35
SXLM-R (base) 270M 64.27 45.05
SXLM-R (large) 550M 55.00 39.92
Korean fastText - - 47.96

KorSTS development and test set scores (100 * Spearman Correlation). You can check the details of other models on this paper (KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding).

How to use pre-trained weight using tensorflow-hub

>>> import tensorflow as tf
>>> import tensorflow_text as text
>>> import tensorflow_hub as hub
>>> # load model
>>> model = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/model.tar.gz")
>>> preprocess = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/preprocess.tar.gz")
>>> # inference
>>> input_tensor = preprocess(["이 모델은 나무위키로 학습되었습니다.", "근데 이 모델 어디다가 쓸 수 있을까요?", "나는 고양이를 좋아해!", "나는 강아지를 좋아해!"])
>>> representation = model(input_tensor)
>>> representation = tf.reduce_sum(representation * tf.cast(input_tensor["input_mask"], representation.dtype)[:, :, tf.newaxis], axis=1)
>>> representation = tf.nn.l2_normalize(representation, axis=-1)
>>> similarities = tf.tensordot(representation, representation, axes=[[1], [1]])
>>> # results
>>> similarities
<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[0.9999999 , 0.76468784, 0.7384633 , 0.7181306 ],
       [0.76468784, 1.        , 0.81387675, 0.79722893],
       [0.7384633 , 0.81387675, 0.9999999 , 0.96217746],
       [0.7181306 , 0.79722893, 0.96217746, 1.        ]], dtype=float32)>

References


짧은 영어를 뒤로 하고, 대부분의 독자분이실 한국분들을 위해 적어보자면, 단순히 "회사에서 구상중인 모델 구조가 좋을까?"를 테스트해보기 위해 개인적으로 학습해본 모델입니다. 어느정도로 잘 나오는지 궁금해서 작성한 코드이기 때문에 하이퍼 파라미터 튜닝이라던가, 데이터셋을 신중히 골랐다던가 하는 것은 없었습니다. 단지 학습해보다보니 생각보다 값이 잘 나와서 결과와 함께 공개하게 되었습니다. 커밋 로그를 보시면 짐작하실 수 있겠지만, 하루 정도에 후다닥 짜서 작은 GPU로 약 50시간 가량 돌린 모델입니다.

원 논문에 나온 값들을 최대한 따라가려 했으며, 밤에 작성했던 코드라 조금 명확하지 않은 부분이 있을 수도 있고, 원 구현과 다를 수도 있습니다. 해당 부분은 이슈로 달아주신다면 다시 확인해보겠습니다.

트러블 슈팅에 도움을 주신 백영민님(@baekyeongmin)께 감사드립니다.

You might also like...
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Making text a first-class citizen in TensorFlow.
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

Releases(0)
  • 0(Feb 6, 2021)

    • Training data: lovit/namuwikitext
    • Peak learning rate: 1e-4
    • learning rate scheduler: Linear Warmup and Linear Decay.
    • Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
    • Vocab size: 15000
    • num layers: 3
    • intermediate size: 2048
    • hidden size: 512
    • attention heads: 8
    • activation function: gelu
    • max sequence length: 128
    • tokenizer: sentencepiece
    • Total steps: 1M
    • Final validation accuracy of auto encoding task (ignores padding): 0.5513
    • Final validation loss: 2.1691
    Source code(tar.gz)
    Source code(zip)
    model.tar.gz(60.93 MB)
    preprocess.tar.gz(507.45 KB)
Owner
Jeong Ukjae
Machine Learning Engineer
Jeong Ukjae
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
Code for the paper "Flexible Generation of Natural Language Deductions"

Code for the paper "Flexible Generation of Natural Language Deductions"

Kaj Bostrom 12 Nov 11, 2022
The swas programming language

The Swas programming language This is a language that was made for fun. Installation Step 0: Make sure you have python installed Step 1. Clone this re

Swas.py 19 Jul 18, 2022
HAN2HAN : Hangul Font Generation

HAN2HAN : Hangul Font Generation

Changwoo Lee 36 Dec 28, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
A python gui program to generate reddit text to speech videos from the id of any post.

Reddit text to speech generator A python gui program to generate reddit text to speech videos from the id of any post. Current functionality Generate

Aadvik 17 Dec 19, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

APEACH - Korean Hate Speech Evaluation Datasets APEACH is the first crowd-generated Korean evaluation dataset for hate speech detection. Sentences of

Kevin-Yang 70 Dec 06, 2022
Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

OpenAI 16.1k Jan 08, 2023
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

Courtney Newcomer 17 Sep 29, 2021
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022