A Chinese to English Neural Model Translation Project

Overview

ZH-EN NMT Chinese to English Neural Machine Translation

This project is inspired by Stanford's CS224N NMT Project

Dataset used in this project: News Commentary v14

Intro

This project is more of a learning project to make myself familiar with Pytorch, machine translation, and NLP model training.

To investigate how would various setups of the recurrent layer affect the final performance, I compared Training Efficiency and Effectiveness of different types of RNN layer for encoder by changing one feature each time while controlling all other parameters:

  • RNN types

    • GRU
    • LSTM
  • Activation Functions on Output Layer

    • Tanh
    • ReLU
    • LeakyReLU
  • Number of layers

    • single layer
    • double layer

Code Files

_/
├─ utils.py # utilities
├─ vocab.py # generate vocab
├─ model_embeddings.py # embedding layer
├─ nmt_model.py # nmt model definition
├─ run.py # training and testing

Good Translation Examples

  • source: 相反,这意味着合作的基础应当是共同的长期战略利益,而不是共同的价值观。

    • target: Instead, it means that cooperation must be anchored not in shared values, but in shared long-term strategic interests.
    • translation: On the contrary, that means cooperation should be a common long-term strategic interests, rather than shared values.
  • source: 但这个问题其实很简单: 谁来承受这些用以降低预算赤字的紧缩措施的冲击。

    • target: But the issue is actually simple: Who will bear the brunt of measures to reduce the budget deficit?
    • translation: But the question is simple: Who is to bear the impact of austerity measures to reduce budget deficits?
  • source: 上述合作对打击恐怖主义、贩卖人口和移民可能发挥至关重要的作用。

    • target: Such cooperation is essential to combat terrorism, human trafficking, and migration.
    • translation: Such cooperation is essential to fighting terrorism, trafficking, and migration.
  • source: 与此同时, 政治危机妨碍着政府追求艰难的改革。

    • target: At the same time, political crisis is impeding the government’s pursuit of difficult reforms.
    • translation: Meanwhile, political crises hamper the government’s pursuit of difficult reforms.

Preprocessing

Preprocessing Colab notebook

  • using jieba to separate Chinese words by spaces

Generate Vocab From Training Data

  • Input: training data of Chinese and English

  • Output: a vocab file containing mapping from (sub)words to ids of Chinese and English -- a limited size of vocab is selected using SentencePiece (essentially Byte Pair Encoding of character n-grams) to cover around 99.95% of training data

Model Definition

  • a Seq2Seq model with attention

    This image is from the book DIVE INTO DEEP LEARNING

    • Encoder
      • A Recurrent Layer
    • Decoder
      • LSTMCell (hidden_size=512)
    • Attention
      • Multiplicative Attention

Training And Testing Results

Training Colab notebook

  • Hyperparameters:
    • Embedding Size & Hidden Size: 512
    • Dropout Rate: 0.25
    • Starting Learning Rate: 5e-4
    • Batch Size: 32
    • Beam Size for Beam Search: 10
  • NOTE: The BLEU score calculated here is based on the Test Set, so it could only be used to compare the relative effectiveness of the models using this data

For Experiment

  • Dataset: the dataset is split into training set(~260000), validation set(~20000), and testing set(~20000) randomly (they are the same for each experiment group)
  • Max Number of Iterations: 50000
  • NOTE: I've tried Vanilla-RNN(nn.RNN) in various ways, but the BLEU score turns out to be extremely low for it (absence of residual connections might be the issue)
    • I decided to not include it for comparison until the issue is resolved
Training Time(sec) BLEU Score on Test Set Training Perplexities Validation Perplexities
A. Bidirectional 1-Layer GRU with Tanh 5158.99 14.26
B. Bidirectional 1-Layer LSTM with Tanh 5150.31 16.20
C. Bidirectional 2-Layer LSTM with Tanh 6197.58 16.38
D. Bidirectional 1-Layer LSTM with ReLU 5275.12 14.01
E. Bidirectional 1-Layer LSTM with LeakyReLU(slope=0.1) 5292.58 14.87

Current Best Version

Bidirectional 2-Layer LSTM with Tanh, 1024 embed_size & hidden_size, trained 11517.19 sec (44000 iterations), BLEU score 17.95

Traning Time BLEU Score on Test Set Training Perplexities Validation Perplexities
Best Model 11517.19 17.95

Analysis

  • LSTM tends to have better performance than GRU (it has an extra set of parameters)
  • Tanh tends to be better since less information is lost
  • Making the LSTM deeper (more layers) could improve the performance, but it cost more time to train
  • Surprisingly, the training time for A, B, and D are roughly the same
    • the issue may be the dataset is not large enough, or the cloud service I used to train models does not perform consistently

Bad Examples & Case Analysis

  • source: 全球目击组织(Global Witness)的报告记录, 光是2015年就有16个国家的185人被杀。
    • target: A Global Witness report documented 185 killings across 16 countries in 2015 alone.
    • translation: According to the Global eye, the World Health Organization reported that 185 people were killed in 2015.
    • problems:
      • Information Loss: 16 countries
      • Unknown Proper Noun: Global Witness
  • source: 大自然给了足以满足每个人需要的东西, 但无法满足每个人的贪婪
    • target: Nature provides enough for everyone’s needs, but not for everyone’s greed.
    • translation: Nature provides enough to satisfy everyone.
    • problems:
      • Huge Information Loss
  • source: 我衷心希望全球经济危机和巴拉克·奥巴马当选总统能对新冷战的荒唐理念进行正确的评估。
    • target: It is my hope that the global economic crisis and Barack Obama’s presidency will put the farcical idea of a new Cold War into proper perspective.
    • translation: I do hope that the global economic crisis and President Barack Obama will be corrected for a new Cold War.
    • problems:
      • Action Sender And Receiver Exchanged
      • Failed To Translate Complex Sentence
  • source: 人们纷纷猜测欧元区将崩溃。
    • target: Speculation about a possible breakup was widespread.
    • translation: The eurozone would collapse.
    • problems:
      • Significant Information Loss

Means to Improve the NMT model

  • Dataset
    • The dataset is fairly small, and our model is not being trained thorough all data
    • Being a native Chinese speaker, I could not understand what some of the source sentences are saying
    • The target sentences are not informational comprehensive; they themselves need context to be understood (e.g. the target sentence in the last "Bad Examples")
    • Even for human, some of the source sentence was too hard to translate
  • Model Architecture
    • CNN & Transformer
    • character based model
    • Make the model even larger & deeper (... I need GPUs)
  • Tricks that might help
    • Add a proper noun dictionary to translate unknown proper nouns word-by-word (phrase-by-phrase)
    • Initialize (sub)word embedding with pretrained embedding

How To Run

  • Download the dataset you desire, and change all "./zh_en_data" in run.sh to the path where your data is stored
  • To run locally on a CPU (mostly for sanity check, CPU is not able to train the model)
    • set up the environment using conda/miniconda conda env create --file local env.yml
  • To run on a GPU
    • set up the environment and running process following the Colab notebook

Contact

If you have any questions or you have trouble running the code, feel free to contact me via email

Owner
Zhenbang Feng
Be an engineer, not a coder. [email protected]
Zhenbang Feng
NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

PeterPham 1 Jan 13, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

8 Dec 25, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022