BanglaBERT
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".
Table of Contents
Models
We are releasing a slightly better checkpoint than the one reported in the paper, pretrained with 27.5 GB data, more code switched and code mixed texts, and pretrained further for 2.5M steps. The pretrained model checkpoint is available here. To use this model for the supported downstream tasks in this repository see Training & Evaluation.
Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.
Datasets
We are also releasing the Bangla Natural Language Inference (NLI) dataset introduced in the paper. The dataset can be found here.
Setup
For installing the necessary requirements, use the following snippet
$ git clone https://https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh
- Use the newly created environment for running the scripts in this repository.
Training & Evaluation
To use the pretrained model for finetuning / inference on different downstream tasks see the following section:
- Sequence Classification.
- For single sequence classification such as
- Document classification
- Sentiment classification
- Emotion classification etc.
- For double sequence classification such as
- Natural Language Inference (NLI)
- Paraphrase detection etc.
- For single sequence classification such as
- Token Classification.
- For token tagging / classification tasks such as
- Named Entity Recognition (NER)
- Parts of Speech Tagging (PoS) etc.
- For token tagging / classification tasks such as
Benchmarks
SC | EC | DC | NER | NLI | |
---|---|---|---|---|---|
Metrics |
Accuracy |
F1* |
Accuracy |
F1 (Entity)* |
Accuracy |
mBERT | 83.39 | 56.02 | 98.64 | 67.40 | 75.40 |
XLM-R | 89.49 | 66.70 | 98.71 | 70.63 | 76.87 |
sagorsarker/bangla-bert-base | 87.30 | 61.51 | 98.79 | 70.97 | 70.48 |
monsoon-nlp/bangla-electra | 73.54 | 34.55 | 97.64 | 52.57 | 63.48 |
BanglaBERT | 92.18 | 74.27 | 99.07 | 72.18 | 82.94 |
*
- Weighted Average
The benchmarking datasets are as follows:
- SC: Sentiment Classification
- EC: Emotion Classification
- DC: Document Classification
- NER: Named Entity Recognition
- NLI: Natural Language Inference
Acknowledgements
We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.
License
Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
Citation
If you use any of the datasets, models or code modules, please cite the following paper:
@article{bhattacharjee2021banglabert,
author = {Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
title = {BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
journal = {CoRR},
volume = {abs/2101.00204},
year = {2021},
url = {https://arxiv.org/abs/2101.00204},
eprinttype = {arXiv},
eprint = {2101.00204}
}