Punctuation Restoration using Transformer Models
This repository contins official implementation of the paper Punctuation Restoration using Transformer Models for High-and Low-Resource Languages accepted at the EMNLP workshop W-NUT 2020.
Data
English
English datasets are provided in data/en
directory. These are collected from here.
Bangla
Bangla datasets are provided in data/bn
directory.
Model Architecture
We fine-tune a Transformer architecture based language model (e.g., BERT) for the punctuation restoration task. Transformer encoder is followed by a bidirectional LSTM and linear layer that predicts target punctuation token at each sequence position.
Dependencies
Install PyTorch following instructions from PyTorch website. Remaining dependencies can be installed with the following command
pip install -r requirements.txt
Training
To train punctuation restoration model with optimal parameter settings for English run the following command
python src/train.py --cuda=True --pretrained-model=roberta-large --freeze-bert=False --lstm-dim=-1
--language=english --seed=1 --lr=5e-6 --epoch=10 --use-crf=False --augment-type=all --augment-rate=0.15
--alpha-sub=0.4 --alpha-del=0.4 --data-path=data --save-path=out
To train for Bangla the corresponding command is
python src/train.py --cuda=True --pretrained-model=xlm-roberta-large --freeze-bert=False --lstm-dim=-1
--language=bangla --seed=1 --lr=5e-6 --epoch=10 --use-crf=False --augment-type=all --augment-rate=0.15
--alpha-sub=0.4 --alpha-del=0.4 --data-path=data --save-path=out
Supported models for English
bert-base-uncased
bert-large-uncased
bert-base-multilingual-cased
bert-base-multilingual-uncased
xlm-mlm-en-2048
xlm-mlm-100-1280
roberta-base
roberta-large
distilbert-base-uncased
distilbert-base-multilingual-cased
xlm-roberta-base
xlm-roberta-large
albert-base-v1
albert-base-v2
albert-large-v2
Supported models for Bangla
bert-base-multilingual-cased
bert-base-multilingual-uncased
xlm-mlm-100-1280
distilbert-base-multilingual-cased
xlm-roberta-base
xlm-roberta-large
Pretrained Models
You can find pretrained mdoels for RoBERTa-large model with augmentation for English here
XLM-RoBERTa-large model with augmentation for Bangla can be found here
Inference
You can run inference on unprocessed text file to produce punctuated text using inference
module. Note that if the text already contains punctuation they are removed before inference.
Example script for English:
python inference.py --pretrained-model=roberta-large --weight-path=roberta-large-en.pt --language=en
--in-file=data/test_en.txt --out-file=data/test_en_out.txt
This should create the text file with following output:
Tolkien drew on a wide array of influences including language, Christianity, mythology, including the Norse Völsunga saga, archaeology, especially at the Temple of Nodens, ancient and modern literature and personal experience. He was inspired primarily by his profession, philology. his work centred on the study of Old English literature, especially Beowulf, and he acknowledged its importance to his writings.
Similarly, For Bangla
python inference.py --pretrained-model=xlm-roberta-large --weight-path=xlm-roberta-large-bn.pt --language=bn
--in-file=data/test_bn.txt --out-file=data/test_bn_out.txt
The expected output is
বিংশ শতাব্দীর বাংলা মননে কাজী নজরুল ইসলামের মর্যাদা ও গুরুত্ব অপরিসীম। একাধারে কবি, সাহিত্যিক, সংগীতজ্ঞ, সাংবাদিক, সম্পাদক, রাজনীতিবিদ এবং সৈনিক হিসেবে অন্যায় ও অবিচারের বিরুদ্ধে নজরুল সর্বদাই ছিলেন সোচ্চার। তার কবিতা ও গানে এই মনোভাবই প্রতিফলিত হয়েছে। অগ্নিবীণা হাতে তার প্রবেশ, ধূমকেতুর মতো তার প্রকাশ। যেমন লেখাতে বিদ্রোহী, তেমনই জীবনে কাজেই "বিদ্রোহী কবি"। তার জন্ম ও মৃত্যুবার্ষিকী বিশেষ মর্যাদার সঙ্গে উভয় বাংলাতে প্রতি বৎসর উদযাপিত হয়ে থাকে।
Please note that Comma includes commas, colons and dashes, Period includes full stops, exclamation marks and semicolons and Question is just question marks.
Test
Trained models can be tested on processed data using test
module to prepare result.
For example, to test the best preforming English model run following command
python src/test.py --pretrained-model=roberta-large --lstm-dim=-1 --use-crf=False --data-path=data/test
--weight-path=weights/roberta-large-en.pt --sequence-length=256 --save-path=out
Please provide corresponding arguments for pretrained-model
, lstm-dim
, use-crf
that were used during training the model. This will run test for all data available in data-path
directory.
Cite this work
@inproceedings{alam-etal-2020-punctuation,
title = "Punctuation Restoration using Transformer Models for High-and Low-Resource Languages",
author = "Alam, Tanvirul and
Khan, Akib and
Alam, Firoj",
booktitle = "Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.wnut-1.18",
pages = "132--142",
}