A Chinese to English Neural Model Translation Project

Overview

ZH-EN NMT Chinese to English Neural Machine Translation

This project is inspired by Stanford's CS224N NMT Project

Dataset used in this project: News Commentary v14

Intro

This project is more of a learning project to make myself familiar with Pytorch, machine translation, and NLP model training.

To investigate how would various setups of the recurrent layer affect the final performance, I compared Training Efficiency and Effectiveness of different types of RNN layer for encoder by changing one feature each time while controlling all other parameters:

  • RNN types

    • GRU
    • LSTM
  • Activation Functions on Output Layer

    • Tanh
    • ReLU
    • LeakyReLU
  • Number of layers

    • single layer
    • double layer

Code Files

_/
├─ utils.py # utilities
├─ vocab.py # generate vocab
├─ model_embeddings.py # embedding layer
├─ nmt_model.py # nmt model definition
├─ run.py # training and testing

Good Translation Examples

  • source: 相反,这意味着合作的基础应当是共同的长期战略利益,而不是共同的价值观。

    • target: Instead, it means that cooperation must be anchored not in shared values, but in shared long-term strategic interests.
    • translation: On the contrary, that means cooperation should be a common long-term strategic interests, rather than shared values.
  • source: 但这个问题其实很简单: 谁来承受这些用以降低预算赤字的紧缩措施的冲击。

    • target: But the issue is actually simple: Who will bear the brunt of measures to reduce the budget deficit?
    • translation: But the question is simple: Who is to bear the impact of austerity measures to reduce budget deficits?
  • source: 上述合作对打击恐怖主义、贩卖人口和移民可能发挥至关重要的作用。

    • target: Such cooperation is essential to combat terrorism, human trafficking, and migration.
    • translation: Such cooperation is essential to fighting terrorism, trafficking, and migration.
  • source: 与此同时, 政治危机妨碍着政府追求艰难的改革。

    • target: At the same time, political crisis is impeding the government’s pursuit of difficult reforms.
    • translation: Meanwhile, political crises hamper the government’s pursuit of difficult reforms.

Preprocessing

Preprocessing Colab notebook

  • using jieba to separate Chinese words by spaces

Generate Vocab From Training Data

  • Input: training data of Chinese and English

  • Output: a vocab file containing mapping from (sub)words to ids of Chinese and English -- a limited size of vocab is selected using SentencePiece (essentially Byte Pair Encoding of character n-grams) to cover around 99.95% of training data

Model Definition

  • a Seq2Seq model with attention

    This image is from the book DIVE INTO DEEP LEARNING

    • Encoder
      • A Recurrent Layer
    • Decoder
      • LSTMCell (hidden_size=512)
    • Attention
      • Multiplicative Attention

Training And Testing Results

Training Colab notebook

  • Hyperparameters:
    • Embedding Size & Hidden Size: 512
    • Dropout Rate: 0.25
    • Starting Learning Rate: 5e-4
    • Batch Size: 32
    • Beam Size for Beam Search: 10
  • NOTE: The BLEU score calculated here is based on the Test Set, so it could only be used to compare the relative effectiveness of the models using this data

For Experiment

  • Dataset: the dataset is split into training set(~260000), validation set(~20000), and testing set(~20000) randomly (they are the same for each experiment group)
  • Max Number of Iterations: 50000
  • NOTE: I've tried Vanilla-RNN(nn.RNN) in various ways, but the BLEU score turns out to be extremely low for it (absence of residual connections might be the issue)
    • I decided to not include it for comparison until the issue is resolved
Training Time(sec) BLEU Score on Test Set Training Perplexities Validation Perplexities
A. Bidirectional 1-Layer GRU with Tanh 5158.99 14.26
B. Bidirectional 1-Layer LSTM with Tanh 5150.31 16.20
C. Bidirectional 2-Layer LSTM with Tanh 6197.58 16.38
D. Bidirectional 1-Layer LSTM with ReLU 5275.12 14.01
E. Bidirectional 1-Layer LSTM with LeakyReLU(slope=0.1) 5292.58 14.87

Current Best Version

Bidirectional 2-Layer LSTM with Tanh, 1024 embed_size & hidden_size, trained 11517.19 sec (44000 iterations), BLEU score 17.95

Traning Time BLEU Score on Test Set Training Perplexities Validation Perplexities
Best Model 11517.19 17.95

Analysis

  • LSTM tends to have better performance than GRU (it has an extra set of parameters)
  • Tanh tends to be better since less information is lost
  • Making the LSTM deeper (more layers) could improve the performance, but it cost more time to train
  • Surprisingly, the training time for A, B, and D are roughly the same
    • the issue may be the dataset is not large enough, or the cloud service I used to train models does not perform consistently

Bad Examples & Case Analysis

  • source: 全球目击组织(Global Witness)的报告记录, 光是2015年就有16个国家的185人被杀。
    • target: A Global Witness report documented 185 killings across 16 countries in 2015 alone.
    • translation: According to the Global eye, the World Health Organization reported that 185 people were killed in 2015.
    • problems:
      • Information Loss: 16 countries
      • Unknown Proper Noun: Global Witness
  • source: 大自然给了足以满足每个人需要的东西, 但无法满足每个人的贪婪
    • target: Nature provides enough for everyone’s needs, but not for everyone’s greed.
    • translation: Nature provides enough to satisfy everyone.
    • problems:
      • Huge Information Loss
  • source: 我衷心希望全球经济危机和巴拉克·奥巴马当选总统能对新冷战的荒唐理念进行正确的评估。
    • target: It is my hope that the global economic crisis and Barack Obama’s presidency will put the farcical idea of a new Cold War into proper perspective.
    • translation: I do hope that the global economic crisis and President Barack Obama will be corrected for a new Cold War.
    • problems:
      • Action Sender And Receiver Exchanged
      • Failed To Translate Complex Sentence
  • source: 人们纷纷猜测欧元区将崩溃。
    • target: Speculation about a possible breakup was widespread.
    • translation: The eurozone would collapse.
    • problems:
      • Significant Information Loss

Means to Improve the NMT model

  • Dataset
    • The dataset is fairly small, and our model is not being trained thorough all data
    • Being a native Chinese speaker, I could not understand what some of the source sentences are saying
    • The target sentences are not informational comprehensive; they themselves need context to be understood (e.g. the target sentence in the last "Bad Examples")
    • Even for human, some of the source sentence was too hard to translate
  • Model Architecture
    • CNN & Transformer
    • character based model
    • Make the model even larger & deeper (... I need GPUs)
  • Tricks that might help
    • Add a proper noun dictionary to translate unknown proper nouns word-by-word (phrase-by-phrase)
    • Initialize (sub)word embedding with pretrained embedding

How To Run

  • Download the dataset you desire, and change all "./zh_en_data" in run.sh to the path where your data is stored
  • To run locally on a CPU (mostly for sanity check, CPU is not able to train the model)
    • set up the environment using conda/miniconda conda env create --file local env.yml
  • To run on a GPU
    • set up the environment and running process following the Colab notebook

Contact

If you have any questions or you have trouble running the code, feel free to contact me via email

Owner
Zhenbang Feng
Be an engineer, not a coder. [email protected]
Zhenbang Feng
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022
a test times augmentation toolkit based on paddle2.0.

Patta Image Test Time Augmentation with Paddle2.0! Input | # input batch of images / / /|\ \ \ # apply

AgentMaker 110 Dec 03, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

Spokestack 133 Sep 20, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

Morgan McGuire 111 Nov 16, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

Summarization module based on KoBART

KoBART-summarization Install KoBART pip install git+https://github.com/SKT-AI/KoBART#egg=kobart Requirements pytorch==1.7.0 transformers==4.0.0 pytor

seujung hwan, Jung 148 Dec 28, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023