DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Overview

DziriBERT

dziribert drawing

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian text contents written using both Arabic and Latin characters. It sets new state of the art results on Algerian text classification datasets, even if it has been pre-trained on much less data (~1 million tweets).

The model is publicly available at: https://huggingface.co/alger-ia/dziribert.

For more information, please visit our paper: https://arxiv.org/pdf/2109.12346.pdf

Evaluation

The Twifil dataset was used to compare DziriBERT with current multilingual, standard Arabic and dialectal Arabic models:

Model Sentiment acc. Emotion acc.
bert-base-multilingual-cased 73.6 % 59.4 %
aubmindlab/bert-base-arabert 72.1 % 61.2 %
CAMeL-Lab/bert-base-arabic-camelbert-mix 77.1 % 65.7 %
qarib/bert-base-qarib 77.7 % 67.6 %
UBC-NLP/MARBERT 80.1 % 68.4 %
alger-ia/dziribert 80.3 % 69.3 %

In order to reproduce these results, please install the following requirements:

pip install -r requirements.txt

Then, run the following evaluation script:

python3 evaluate_model.py

These results have been obtained on a Tesla K80 GPU.

Pretrained DziriBERT

DziriBERT has been uploaded to the HuggingFace hub in order to facilitate its use: https://huggingface.co/alger-ia/dziribert.

It can be easily downloaded and loaded using the transformers library:

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("alger-ia/dziribert")
model = BertForMaskedLM.from_pretrained("alger-ia/dziribert")

How to cite

@article{dziribert,
  title={DziriBERT: a Pre-trained Language Model for the Algerian Dialect},
  author={Abdaoui, Amine and Berrimi, Mohamed and Oussalah, Mourad and Moussaoui, Abdelouahab},
  journal={arXiv preprint arXiv:2109.12346},
  year={2021}
}

Contact

Please contact [email protected] for any question, feedback or request.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computa

Open Business Software Solutions 129 Jan 06, 2023
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

RuCLIPtiny Zero-shot image classification model for Russian language RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network

Shahmatov Arseniy 26 Sep 20, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
auto_code_complete is a auto word-completetion program which allows you to customize it on your need

auto_code_complete v1.3 purpose and usage auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the m

RUO 2 Feb 22, 2022
PortaSpeech - PyTorch Implementation

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 276 Dec 26, 2022
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

20 Dec 29, 2022