Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Related tags

Text Data & NLPtta
Overview

T-TA (Transformer-based Text Auto-encoder)

This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning) using TensorFlow 2.

How to train T-TA using custom dataset

  1. Prepare datasets. You need text line files.

    Example:

    Sentence 1.
    Sentence 2.
    Sentence 3.
    
  2. Train the sentencepiece tokenizer. You can use the train_sentencepiece.py or train sentencepiece model by yourself.

  3. Train T-TA model. Run train.py with customizable arguments. Here's the usage.

    $ python train.py --help
    usage: train.py [-h] [--train-data TRAIN_DATA] [--dev-data DEV_DATA] [--model-config MODEL_CONFIG] [--batch-size BATCH_SIZE] [--spm-model SPM_MODEL]
                    [--learning-rate LEARNING_RATE] [--target-epoch TARGET_EPOCH] [--steps-per-epoch STEPS_PER_EPOCH] [--warmup-ratio WARMUP_RATIO]
    
    optional arguments:
        -h, --help            show this help message and exit
        --train-data TRAIN_DATA
        --dev-data DEV_DATA
        --model-config MODEL_CONFIG
        --batch-size BATCH_SIZE
        --spm-model SPM_MODEL
        --learning-rate LEARNING_RATE
        --target-epoch TARGET_EPOCH
        --steps-per-epoch STEPS_PER_EPOCH
        --warmup-ratio WARMUP_RATIO

    I want to train models until the designated steps, so I added the steps_per_epoch and target_epoch arguments. The total steps will be the steps_per_epoch * target_epoch.

  4. (Optional) Test your model using KorSTS data. I trained my model with the Korean corpus, so I tested it using KorSTS data. You can evaluate KorSTS score (Spearman correlation) using evaluate_unsupervised_korsts.py. Here's the usage.

    $ python evaluate_unsupervised_korsts.py --help
    usage: evaluate_unsupervised_korsts.py [-h] --model-weight MODEL_WEIGHT --dataset DATASET
    
    optional arguments:
        -h, --help            show this help message and exit
        --model-weight MODEL_WEIGHT
        --dataset DATASET
    $ # To evaluate on dev set
    $ # python evaluate_unsupervised_korsts.py --model-weight ./path/to/checkpoint --dataset ./path/to/dataset/sts-dev.tsv

Training details

  • Training data: lovit/namuwikitext
  • Peak learning rate: 1e-4
  • learning rate scheduler: Linear Warmup and Linear Decay.
  • Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
  • Vocab size: 15000
  • num layers: 3
  • intermediate size: 2048
  • hidden size: 512
  • attention heads: 8
  • activation function: gelu
  • max sequence length: 128
  • tokenizer: sentencepiece
  • Total steps: 1M
  • Final validation accuracy of auto encoding task (ignores padding): 0.5513
  • Final validation loss: 2.1691

Unsupervised KorSTS

Model Params development test
My Implementation 17M 65.98 56.75
- - - -
Korean SRoBERTa (base) 111M 63.34 48.96
Korean SRoBERTa (large) 338M 60.15 51.35
SXLM-R (base) 270M 64.27 45.05
SXLM-R (large) 550M 55.00 39.92
Korean fastText - - 47.96

KorSTS development and test set scores (100 * Spearman Correlation). You can check the details of other models on this paper (KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding).

How to use pre-trained weight using tensorflow-hub

>>> import tensorflow as tf
>>> import tensorflow_text as text
>>> import tensorflow_hub as hub
>>> # load model
>>> model = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/model.tar.gz")
>>> preprocess = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/preprocess.tar.gz")
>>> # inference
>>> input_tensor = preprocess(["이 모델은 나무위키로 학습되었습니다.", "근데 이 모델 어디다가 쓸 수 있을까요?", "나는 고양이를 좋아해!", "나는 강아지를 좋아해!"])
>>> representation = model(input_tensor)
>>> representation = tf.reduce_sum(representation * tf.cast(input_tensor["input_mask"], representation.dtype)[:, :, tf.newaxis], axis=1)
>>> representation = tf.nn.l2_normalize(representation, axis=-1)
>>> similarities = tf.tensordot(representation, representation, axes=[[1], [1]])
>>> # results
>>> similarities
<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[0.9999999 , 0.76468784, 0.7384633 , 0.7181306 ],
       [0.76468784, 1.        , 0.81387675, 0.79722893],
       [0.7384633 , 0.81387675, 0.9999999 , 0.96217746],
       [0.7181306 , 0.79722893, 0.96217746, 1.        ]], dtype=float32)>

References


짧은 영어를 뒤로 하고, 대부분의 독자분이실 한국분들을 위해 적어보자면, 단순히 "회사에서 구상중인 모델 구조가 좋을까?"를 테스트해보기 위해 개인적으로 학습해본 모델입니다. 어느정도로 잘 나오는지 궁금해서 작성한 코드이기 때문에 하이퍼 파라미터 튜닝이라던가, 데이터셋을 신중히 골랐다던가 하는 것은 없었습니다. 단지 학습해보다보니 생각보다 값이 잘 나와서 결과와 함께 공개하게 되었습니다. 커밋 로그를 보시면 짐작하실 수 있겠지만, 하루 정도에 후다닥 짜서 작은 GPU로 약 50시간 가량 돌린 모델입니다.

원 논문에 나온 값들을 최대한 따라가려 했으며, 밤에 작성했던 코드라 조금 명확하지 않은 부분이 있을 수도 있고, 원 구현과 다를 수도 있습니다. 해당 부분은 이슈로 달아주신다면 다시 확인해보겠습니다.

트러블 슈팅에 도움을 주신 백영민님(@baekyeongmin)께 감사드립니다.

You might also like...
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Making text a first-class citizen in TensorFlow.
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

Releases(0)
  • 0(Feb 6, 2021)

    • Training data: lovit/namuwikitext
    • Peak learning rate: 1e-4
    • learning rate scheduler: Linear Warmup and Linear Decay.
    • Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
    • Vocab size: 15000
    • num layers: 3
    • intermediate size: 2048
    • hidden size: 512
    • attention heads: 8
    • activation function: gelu
    • max sequence length: 128
    • tokenizer: sentencepiece
    • Total steps: 1M
    • Final validation accuracy of auto encoding task (ignores padding): 0.5513
    • Final validation loss: 2.1691
    Source code(tar.gz)
    Source code(zip)
    model.tar.gz(60.93 MB)
    preprocess.tar.gz(507.45 KB)
Owner
Jeong Ukjae
Machine Learning Engineer
Jeong Ukjae
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 07, 2023
基于“Seq2Seq+前缀树”的知识图谱问答

KgCLUE-bert4keras 基于“Seq2Seq+前缀树”的知识图谱问答 简介 博客:https://kexue.fm/archives/8802 环境 软件:bert4keras=0.10.8 硬件:目前的结果是用一张Titan RTX(24G)跑出来的。 运行 第一次运行的时候,会给知

苏剑林(Jianlin Su) 65 Dec 12, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022