Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Overview

Mask-Align: Self-Supervised Neural Word Alignment

This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment.

@inproceedings{chen2021maskalign,
   title={Mask-Align: Self-Supervised Neural Word Alignment},
   author={Chi Chen and Maosong Sun and Yang Liu},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

The implementation is built on top of THUMT.

Contents

Introduction

Mask-Align is a self-supervised neural word aligner. It parallelly masks out each target token and predicts it conditioned on both source and the remaining target tokens. The source token that contributes most to recovering a masked target token will be aligned to that target token.

Prerequisites

  • PyTorch
  • NLTK
  • remi *
  • pyecharts *
  • pandas *
  • matplotlib *
  • seaborn *

*: optional, only used for Visualization.

Usage

Data Preparation

To get the data used in our paper, you can follow the instructions in https://github.com/lilt/alignment-scripts.

To train an aligner with your own data, you should pre-process it yourself. Usually this includes tokenization, BPE, etc. You can find a simple guide here.

Now we have the pre-processed parallel training data (train.src, train.tgt), validation data (optional) (valid.src, valid.tgt) and test data (test.src, test.tgt). An example 3-sentence German–English parallel training corpus is:

# train.src
wiederaufnahme der sitzungsperiode
frau präsidentin , zur geschäfts @@ordnung .
ich bitte sie , sich zu einer schweigeminute zu erheben .

# train.tgt
resumption of the session
madam president , on a point of order .
please rise , then , for this minute ' s silence .

The next step is to shuffle the training set, which proves to be helpful for improving the results.

python thualign/scripts/shuffle_corpus.py --corpus train.src train.tgt

The resulting files train.src.shuf and train.tgt.shuf rearrange the sentence pairs randomly.

Then we need to generate vocabulary from the training set.

python thualign/scripts/build_vocab.py train.src.shuf vocab.train.src
python thualign/scripts/build_vocab.py train.tgt.shuf vocab.train.tgt

The resulting files vocab.train.src.txt and vocab.train.tgt.txt are final source and target vocabularies used for model training.

Training

All experiments are configured via config files in thualign/configs, see Configs for more details.. We provide an example config file thualign/configs/user/example.config. You can easily use it by making three changes:

  1. change device_list, update_cycle and batch_size to match your machine configuration;

  2. change exp_dir and output to your own experiment directory

  3. change train/valid/test_input and vocab to your data paths;

When properly configured, you can use the following command to train an alignment model described in the config file

bash thualign/bin/train.sh -s thualign/configs/user/example.config

or more simply

bash thualign/bin/train.sh -s example

The configuration file is an INI file and is parsed through configparser. By adding a new section, you can easily customize some configs while keep other configs unchanged.

[DEFAULT]
...

[small_budget]
batch_size = 4500
update_cycle = 8
device_list = [0]
half = False

Use -e option to run this small_budget section

bash thualign/bin/train.sh -s example -e small_budget

You can also monitor the training process through tensorboard

tensorboard --logdir=[output]

Test

After training, the following command can be used to generate attention weights (-g), generate data for attention visualization (-v), and test its AER (-t) if test_ref is provided.

bash thualign/bin/test.sh -s [CONFIG] -e [EXP] -gvt

For example, to test the model trained with the configs in example.config

bash thualign/bin/test.sh -s example -gvt

You might get the following output

alignment-soft.txt: 14.4% (87.7%/83.5%/9467)

The alignment results (alignment.txt) along with other test results are stored in [output]/test by default.

Configs

Most of the configuration of Mask-Align is done through configuration files in thualign/configs. The model reads the basic configs first, followed by the user-defined configs.

Basic Config

Predefined configs for experiments to use.

  • base.config: basic configs for training, validation and test

  • model.config: define different models with their hyperparameters

User Config

Customized configs that must describe the following configuration and maybe other experiment-specific parameters:

  • train/valid/test_input: paths of input parallel corpuses
  • vocab: paths of vocabulary files generated from thualign/scripts/build_vocab.py
  • output: path to save the model outputs
  • model: which model to use
  • batch_size: the batch size (number of tokens) used in the training stage.
  • update_cycle: the number of iterations for updating model parameters. The default value is 1. If you have only 1 GPU and want to obtain the same translation performance with using 4 GPUs, simply set this parameter to 4. Note that the training time will also be prolonged.
  • device_list: the list of GPUs to be used in training. Use the nvidia-smi command to find unused GPUs. If the unused GPUs are gpu0 and gpu1, set this parameter as device_list=[0,1].
  • half: set this to True if you wish to use half-precision training. This will speeds up the training procedure. Make sure that you have the GPUs with half-precision support.

Here is a minimal experiment config:

### thualign/configs/user/example.config
[DEFAULT]

train_input = ['train.src', 'train.tgt']
valid_input = ['valid.src', 'valid.tgt']
vocab = ['vocab.src.txt', 'vocab.tgt.txt']
test_input = ['test.src', 'test.tgt']
test_ref = test.talp

exp_dir = exp
label = agree_deen
output = ${exp_dir}/${label}

model = mask_align

batch_size = 9000
update_cycle = 1
device_list = [0,1,2,3]
half = True

Visualization

To better understand and analyze the model, Mask-Align supports the following two types of visulizations.

Training Visualization

Add eval_plot = True in your config file to turn on visualization during training. This will plot 5 attention maps from evaluation in the tensorboard.

These packages are required for training visualization:

  • pandas
  • matplotlib
  • seaborn

Attention Visualization

Use -v in the test command to generate alignment_vizdata.pt first. It is stored in [output]/test by default. To visualize it, using this script

python thualign/scripts/visualize.py [output]/test/alignment_vizdata.pt [--port PORT]

This will start a local service that plots the attention weights for all the test sentence pairs. You can access it through a web browser.

These packages are required for training visualization:

  • remi
  • pyecharts

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Owner
THUNLP-MT
Machine Translation Group, Natural Language Processing Lab at Tsinghua University (THUNLP). Please refer to https://github.com/thunlp for more NLP resources.
THUNLP-MT
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
Repository for the paper: VoiceMe: Personalized voice generation in TTS

🗣 VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

Pol van Rijn 80 Dec 29, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

Ori Ram 88 Dec 31, 2022
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

Logadheep 1 Nov 27, 2021
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

Simran Farrukh 0 Mar 28, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

snoop2head 14 Nov 15, 2022
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

Marcel Bollmann 11 Sep 25, 2022
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022