Styled Augmented Translation

Related tags

Deep LearningSAT
Overview

SAT

Style Augmented Translation

PWC

Introduction

By collecting high-quality data, we were able to train a model that outperforms Google Translate on 6 different domains of English-Vietnamese Translation.

English to Vietnamese Translation (BLEU score)

drawing

Vietnamese to English Translation (BLEU score)

drawing

Get data and model at Google Cloud Storage

Check out our demo web app

Visit our blog post for more details.


Using the code

This code is build on top of vietai/dab:

To prepare for training, generate tfrecords from raw text files:

python t2t_datagen.py \
--data_dir=$path_to_folder_contains_vocab_file \
--tmp_dir=$path_to_folder_that_contains_training_data \
--problem=$problem

To train a Transformer model on the generated tfrecords

python t2t_trainer.py \
--data_dir=$path_to_folder_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_to_save_checkpoints

To run inference on the trained model:

python t2t_decoder.py \
--data_dir=$path_to_folde_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_contains_checkpoints

In this colab, we demonstrated how to run these three phases in the context of hosting data/model on Google Cloud Storage.


Dataset

Our data contains roughly 3.3 million pairs of texts. After augmentation, the data is of size 26.7 million pairs of texts. A more detail breakdown of our data is shown in the table below.

Pure Augmented
Fictional Books 333,189 2,516,787
Legal Document 1,150,266 3,450,801
Medical Publication 5,861 27,588
Movies Subtitles 250,000 3,698,046
Software 79,912 239,745
TED Talk 352,652 4,983,294
Wikipedia 645,326 1,935,981
News 18,449 139,341
Religious texts 124,389 1,182,726
Educational content 397,008 8,475,342
No tag 5,517 66,299
Total 3,362,569 26,715,950

Data sources is described in more details here.

Comments
  • Data leakage issue in evaluation?

    Data leakage issue in evaluation?

    Hi team @lmthang @thtrieu @heraclex12 @hqphat @KienHuynh

    The obtained results of a Transformer-based model on the PhoMT test set surprised me. My first thought was that as VietAI and PhoMT datasets have several overlapping domains (e.g. Wikihow, TED talks, Opensubtitles, news..): whether there might be a potential data leakage issue in your evaluation (e.g. PhoMT English-Vietnamese test pairs appear in the VietAI training set)?

    In particular, we find that 6294/19151 PhoMT English-Vietnamese test pairs appear in the VietAI training set (v2). When evaluating your model on the PhoMT test set, did you guys retrain the model on a VietAI training set variant that does not contain PhoMT English-Vietnamese test pairs?

    Cheers, Dat.

    opened by datquocnguyen 3
  • Demo website is not working

    Demo website is not working

    Hi, seems like the easiest to reach out here but https://demo.vietai.org/ is down, looks like the page tried to serve a 404 error page.

    Connection failed with status 404, and response "<!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 404 (Not Found)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>404.</b> <ins>That’s an error.</ins> <p>The requested URL <code>/healthz</code> was not found on this server. <ins>That’s all we know.</ins> ".
    
    opened by VietThan 1
  • Got RuntimeError when run on Google Colab

    Got RuntimeError when run on Google Colab

    I ran the Readme.md samples on Google Colab with GPU and got this Error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)".

    Error code: outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512)

    opened by kietbg0079 0
  • Got error 'AssertionError: Torch not compiled with CUDA enabled' on Macbook M1 pro

    Got error 'AssertionError: Torch not compiled with CUDA enabled' on Macbook M1 pro

    I have tried the example on my Macbook M1 pro but got this error: =>outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512) raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

    Please help!

    opened by htnha 4
  • Question about loading model

    Question about loading model

    I have a question about loading model. I have trained a Russian-to-Vietnamese model base on your code and tensor2tensor. Every time I want to predict a new sentence, it always load the model again, even before that I have already predicted another sentence. I want to ask that if there is a way not to have reload the model when predict a new sentence. Thank you very much

    opened by hieunguyenquoc 1
  • I have a issue about running decoder

    I have a issue about running decoder

    Data loss: Unable to open table file /content/drive/MyDrive/SAT/checkpoint: Failed precondition: /content/drive/MyDrive/SAT/checkpoint; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?

    I used a pretrain model : model.augmented.envi.ckpt-1415000.data-00000-of-00001, model.augmented.envi.ckpt-1415000.index, model.augmented.envi.ckpt-1415000.meta. All 3 file are put in checkpoint

    Could somebody help me with this issue ?

    opened by hieunguyenquoc 6
Releases(v1.0)
  • v1.0(Oct 2, 2021)

    First version.

    Trained on 3.3M training data points. Transformer with 9-layer encoder and 9-layer decoder. Tested on a multi-domain dataset, outperforming Google Translate. Experiments with style-tagging and data appending.

    Source code(tar.gz)
    Source code(zip)
HandFoldingNet ✌️ : A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton

HandFoldingNet ✌️ : A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton Wencan Cheng, Jae Hyun Park, Jong

cwc1260 23 Oct 21, 2022
Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

LMMNN Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks This is the working dire

Giora Simchoni 10 Nov 02, 2022
An Open-Source Toolkit for Prompt-Learning.

An Open-Source Framework for Prompt-learning. Overview • Installation • How To Use • Docs • Paper • Citation • What's New? Nov 2021: Now we have relea

THUNLP 2.3k Jan 07, 2023
A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation

Paper Khoi Nguyen, Sinisa Todorovic "A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation", accepted to ICCV 2021 Our code is mai

Khoi Nguyen 5 Aug 14, 2022
An energy estimator for eyeriss-like DNN hardware accelerator

Energy-Estimator-for-Eyeriss-like-Architecture- An energy estimator for eyeriss-like DNN hardware accelerator This is an energy estimator for eyeriss-

HEXIN BAO 2 Mar 26, 2022
This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?".

Graphormer By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng*, Guolin Ke, Di He*, Yanming Shen and Tie-Yan Liu. This repo is the official impl

Microsoft 1.3k Dec 29, 2022
Deep Learning as a Cloud API Service.

Deep API Deep Learning as Cloud APIs. This project provides pre-trained deep learning models as a cloud API service. A web interface is available as w

Wu Han 4 Jan 06, 2023
This repository provides an unified frameworks to train and test the state-of-the-art few-shot font generation (FFG) models.

FFG-benchmarks This repository provides an unified frameworks to train and test the state-of-the-art few-shot font generation (FFG) models. What is Fe

Clova AI Research 101 Dec 27, 2022
Advbox is a toolbox to generate adversarial examples that fool neural networks in PaddlePaddle、PyTorch、Caffe2、MxNet、Keras、TensorFlow and Advbox can benchmark the robustness of machine learning models.

Advbox is a toolbox to generate adversarial examples that fool neural networks in PaddlePaddle、PyTorch、Caffe2、MxNet、Keras、TensorFlow and Advbox can benchmark the robustness of machine learning models

AdvBox 1.3k Dec 25, 2022
Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

Distributed Evolutionary Algorithms in Python 4.9k Jan 05, 2023
Inkscape extensions for figure resizing and editing

Academic-Inkscape: Extensions for figure resizing and editing This repository contains several Inkscape extensions designed for editing plots. Scale P

192 Dec 26, 2022
A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.

Welcome to Carbon Insight Carbon Insight is a platform aiming to display the carbon neutralization roadmap for researchers, decision-makers, and other

Microsoft 14 Oct 24, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 03, 2023
A curated list of automated deep learning (including neural architecture search and hyper-parameter optimization) resources.

Awesome AutoDL A curated list of automated deep learning related resources. Inspired by awesome-deep-vision, awesome-adversarial-machine-learning, awe

D-X-Y 2k Dec 30, 2022
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

1 Nov 01, 2021
ICON: Implicit Clothed humans Obtained from Normals (CVPR 2022)

ICON: Implicit Clothed humans Obtained from Normals Yuliang Xiu · Jinlong Yang · Dimitrios Tzionas · Michael J. Black CVPR 2022 News 🚩 [2022/04/26] H

Yuliang Xiu 1.1k Jan 04, 2023
VisionKG: Vision Knowledge Graph

VisionKG: Vision Knowledge Graph Official Repository of VisionKG by Anh Le-Tuan, Trung-Kien Tran, Manh Nguyen-Duc, Jicheng Yuan, Manfred Hauswirth and

Continuous Query Evaluation over Linked Stream (CQELS) 9 Jun 23, 2022
Learning Efficient Online 3D Bin Packing on Packing Configuration Trees

Learning Efficient Online 3D Bin Packing on Packing Configuration Trees This repository is being continuously updated, please stay tuned! Any code con

86 Dec 28, 2022
IEEE Winter Conference on Applications of Computer Vision 2022 Accepted

SSKT(Accepted WACV2022) Concept map Dataset Image dataset CIFAR10 (torchvision) CIFAR100 (torchvision) STL10 (torchvision) Pascal VOC (torchvision) Im

1 Nov 17, 2022
A Pytorch loader for MVTecAD dataset.

MVTecAD A Pytorch loader for MVTecAD dataset. It strictly follows the code style of common Pytorch datasets, such as torchvision.datasets.CIFAR10. The

Jiyuan 1 Dec 27, 2021