BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Last update: Dec 23, 2022

Overview

Introduction
Using BARTpho with fairseq
Using BARTpho with transformers
Notes

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Two BARTpho versions BARTpho-syllable and BARTpho-word are the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. BARTpho uses the "large" architecture and pre-training scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, BARTpho outperforms the strong baseline mBART and improves the state-of-the-art.

The general architecture and experimental results of BARTpho can be found in our paper:

@article{bartpho,
title     = {{BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese}},
author    = {Nguyen Luong Tran and Duong Minh Le and Dat Quoc Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2109.09701},
year      = {2021}
}

Please CITE our paper when BARTpho is used to help produce published results or incorporated into other software.

Using BARTpho in `fairseq`

Installation

There is an issue w.r.t. the encode function in the BART hub_interface, as discussed in this pull request https://github.com/pytorch/fairseq/pull/3905. While waiting for this pull request's approval, please install fairseq as follows:

git clone https://github.com/datquocnguyen/fairseq.git
cd fairseq
pip install --editable ./

Pre-trained models

Model	#params	Download	Input text
BARTpho-syllable	396M	fairseq-bartpho-syllable.zip	Syllable level
BARTpho-word	420M	fairseq-bartpho-word.zip	Word level

unzip fairseq-bartpho-syllable.zip
unzip fairseq-bartpho-word.zip

Example usage

from fairseq.models.bart import BARTModel  

#Load BARTpho-syllable model:  
model_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/'  
spm_model_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/sentence.bpe.model'  
bartpho_syllable = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='sentencepiece', sentencepiece_model=spm_model_path).eval()
#Input syllable-level/raw text:  
sentence = 'Chúng tôi là những nghiên cứu viên.'  
#Apply SentencePiece to the input text
tokenIDs = bartpho_syllable.encode(sentence, add_if_not_exist=False)
#Extract features from BARTpho-syllable
last_layer_features = bartpho_syllable.extract_features(tokenIDs)

##Load BARTpho-word model:  
model_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/'  
bpe_codes_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/bpe.codes'  
bartpho_word = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='fastbpe', bpe_codes=bpe_codes_path).eval()
#Input word-level text:  
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  
#Apply BPE to the input text
tokenIDs = bartpho_word.encode(sentence, add_if_not_exist=False)
#Extract features from BARTpho-word
last_layer_features = bartpho_word.extract_features(tokenIDs)

Using BARTpho in `transformers`

Installation

Installation with pip (v4.12+): pip install transformers
Installing from source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

Pre-trained models

Model	#params	Input text
`vinai/bartpho-syllable`	396M	Syllable level
`vinai/bartpho-word`	420M	Word level

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

#BARTpho-syllable
syllable_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable", use_fast=False)
bartpho_syllable = AutoModel.from_pretrained("vinai/bartpho-syllable")
TXT = 'Chúng tôi là những nghiên cứu viên.'  
input_ids = syllable_tokenizer(TXT, return_tensors='pt')['input_ids']
features = bartpho_syllable(input_ids)

from transformers import MBartForConditionalGeneration
bartpho_syllable = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
input_ids = syllable_tokenizer(TXT, return_tensors='pt')['input_ids']
logits = bartpho_syllable(input_ids).logits
masked_index = (input_ids[0] == syllable_tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
print(syllable_tokenizer.decode(predictions).split())

#BARTpho-word
word_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-word", use_fast=False)
bartpho_word = AutoModel.from_pretrained("vinai/bartpho-word")
TXT = 'Chúng_tôi là những nghiên_cứu_viên .'  
input_ids = word_tokenizer(TXT, return_tensors='pt')['input_ids']
features = bartpho_word(input_ids)

bartpho_word = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-word")
TXT = 'Chúng_tôi là những <mask> .'
input_ids = word_tokenizer(TXT, return_tensors='pt')['input_ids']
logits = bartpho_word(input_ids).logits
masked_index = (input_ids[0] == word_tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
print(word_tokenizer.decode(predictions).split())

Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of both the encoder and decoder. Thus, when converted to be used with transformers, BARTpho can be called via mBART-based classes.

Notes

Before fine-tuning BARTpho on a downstream task, users should perform Vietnamese tone normalization on the downstream task's data as this pre-process was also applied to the pre-training corpus. A Python script for Vietnamese tone normalization is available at HERE.
For BARTpho-word, users should use VnCoreNLP to segment input raw texts as it was used to perform both Vietnamese tone normalization and word segmentation on the pre-training corpus.

License

MIT License

Copyright (c) 2021 VinAI Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Related tags

Overview

Table of contents

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Using BARTpho in `fairseq`

Installation

Pre-trained models

Example usage

Using BARTpho in `transformers`

Installation

Pre-trained models

Example usage

Notes

License

Owner

VinAI Research

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Simple translation demo showcasing our headliner package.

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Train 🤗-transformers model with Poutyne.

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Must-read papers on improving efficiency for pre-trained language models.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Language-Agnostic SEntence Representations

Crowd sourced training data for Rasa NLU models

Model parallel transformers in JAX and Haiku

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Unsupervised Language Modeling at scale for robust sentiment classification

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Related tags

Overview

Table of contents

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Using BARTpho in fairseq

Installation

Pre-trained models

Example usage

Using BARTpho in transformers

Installation

Pre-trained models

Example usage

Notes

License

Owner

VinAI Research

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Simple translation demo showcasing our headliner package.

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Train 🤗-transformers model with Poutyne.

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Must-read papers on improving efficiency for pre-trained language models.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Language-Agnostic SEntence Representations

Crowd sourced training data for Rasa NLU models

Model parallel transformers in JAX and Haiku

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Unsupervised Language Modeling at scale for robust sentiment classification

Using BARTpho in `fairseq`

Using BARTpho in `transformers`