RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

Overview

RoBERTa base model for Marathi Language (मराठी भाषा)

Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa was introduced in this paper and first released in this repository. We trained RoBERTa model for Marathi Language during community week hosted by Huggingface 🤗 using JAX/Flax for NLP & CV jax.

RoBERTa base model for Marathi language (मराठी भाषा)

huggingface-marathi-roberta

Model description

Marathi RoBERTa is a transformers model pretrained on a large corpus of Marathi data in a self-supervised fashion.

Intended uses & limitations ❗️

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. We used this model to fine tune on text classification task for iNLTK and indicNLP news text classification problem statement. Since marathi mc4 dataset is made by scraping marathi newspapers text, it will involve some biases which will also affect all fine-tuned versions of this model.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-base-mr')
>>> unmasker("मोठी बातमी! उद्या दुपारी <mask> वाजता जाहीर होणार दहावीचा निकाल")
[{'score': 0.057209037244319916,'sequence': 'मोठी बातमी! उद्या दुपारी आठ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 2226,
  'token_str': 'आठ'},
 {'score': 0.02796074189245701,
  'sequence': 'मोठी बातमी! उद्या दुपारी २० वाजता जाहीर होणार दहावीचा निकाल',
  'token': 987,
  'token_str': '२०'},
 {'score': 0.017235398292541504,
  'sequence': 'मोठी बातमी! उद्या दुपारी नऊ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 4080,
  'token_str': 'नऊ'},
 {'score': 0.01691395975649357,
  'sequence': 'मोठी बातमी! उद्या दुपारी २१ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 1944,
  'token_str': '२१'},
 {'score': 0.016252165660262108,
  'sequence': 'मोठी बातमी! उद्या दुपारी  ३ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 549,
  'token_str': ' ३'}]

Training data 🏋🏻‍♂️

The RoBERTa Marathi model was pretrained on mr dataset of C4 multilingual dataset:

C4 (Colossal Clean Crawled Corpus), Introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

The dataset can be downloaded in a pre-processed form from allennlp or huggingface's datsets - mc4 dataset. Marathi (mr) dataset consists of 14 billion tokens, 7.8 million docs and with weight ~70 GB of text.

Data Cleaning 🧹

Though initial mc4 marathi corpus size ~70 GB, Through data exploration, it was observed it contains docs from different languages especially thai, chinese etc. So we had to clean the dataset before traning tokenizer and model. Surprisingly, results after cleaning Marathi mc4 corpus data:

Train set:

Clean docs count 1581396 out of 7774331.
~20.34% of whole marathi train split is actually Marathi.

Validation set

Clean docs count 1700 out of 7928.
~19.90% of whole marathi validation split is actually Marathi.

Training procedure 👨🏻‍💻

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s> The details of the masking procedure for each sentence are the following:

  • 15% of the tokens are masked.
  • In 80% of the cases, the masked tokens are replaced by <mask>.
  • In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
  • In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

Pretraining

The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e-4, β1 = 0.9, β2 = 0.98 and ε = 1e-8, a weight decay of 0.01, learning rate warmup for 1,000 steps and linear decay of the learning rate after.

We tracked experiments and hyperparameter tunning on weights and biases platform. Here is link to main dashboard:
Link to Weights and Biases Dashboard for Marathi RoBERTa model

Pretraining Results 📊

RoBERTa Model reached eval accuracy of 85.28% around ~35K step with train loss at 0.6507 and eval loss at 0.6219.

Fine Tuning on downstream tasks

We performed fine-tuning on downstream tasks. We used following datasets for classification:

  1. IndicNLP Marathi news classification
  2. iNLTK Marathi news headline classification

Fine tuning on downstream task results (Segregated)

1. IndicNLP Marathi news classification

IndicNLP Marathi news dataset consists 3 classes - ['lifestyle', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9672 477 478

💯 Our Marathi RoBERTa **roberta-base-mr model outperformed both classifier ** mentioned in Arora, G. (2020). iNLTK and Kunchukuttan, Anoop et al. AI4Bharat-IndicNLP.

Dataset FT-W FT-WC INLP iNLTK roberta-base-mr 🏆
iNLTK Headlines 83.06 81.65 89.92 92.4 97.48

🤗 Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK Headlines classification dataset model:

flax-community/mr-indicnlp-classifier

🧪 Fine tuning experiment's weight and biases dashboard link

2. iNLTK Marathi news headline classification

This dataset consists 3 classes - ['state', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9658 1210 1210

💯 Here as well roberta-base-mr outperformed iNLTK marathi news text classifier.

Dataset iNLTK ULMFiT roberta-base-mr 🏆
iNLTK news dataset (kaggle) 92.4 94.21

🤗 Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK news classification dataset model:

flax-community/mr-inltk-classifier

Fine tuning experiment's weight and biases dashboard link

Want to check how above models generalise on real world Marathi data?

Head to 🤗 Huggingface's spaces 🪐 to play with all three models:

  1. Mask Language Modelling with Pretrained Marathi RoBERTa model:
    flax-community/roberta-base-mr
  2. Marathi Headline classifier:
    flax-community/mr-indicnlp-classifier
  3. Marathi news classifier:
    flax-community/mr-inltk-classifier

alt text Streamlit app of Pretrained Roberta Marathi model on Huggingface Spaces

image

Team Members

Credits

Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to @patil-suraj & @patrickvonplaten for mentoring during whole week.

Owner
Nipun Sadvilkar
I like to explore Jungle of Data with Python as my swiss knife with pandas, numpy, matplotlib and scikit-learn as its multi-tools😅
Nipun Sadvilkar
Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

PurNet Project for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss" Abstract Image-based salie

Jinming Su 4 Aug 25, 2022
Official Implementation for the paper DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification

DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification Official Implementation for the pape

Anh M. Nguyen 36 Dec 28, 2022
Speed-Test - You can check your intenet speed using this tool

Speed-Test Tool By Hez_X AVAILABLE ON : Termux & Kali linux & Ubuntu (Linux E

Hez-X 3 Feb 17, 2022
RoFormer_pytorch

PyTorch RoFormer 原版Tensorflow权重(https://github.com/ZhuiyiTechnology/roformer) chinese_roformer_L-12_H-768_A-12.zip (提取码:xy9x) 已经转化为PyTorch权重 chinese_r

yujun 283 Dec 12, 2022
“英特尔创新大师杯”深度学习挑战赛 赛道3:CCKS2021中文NLP地址相关性任务

ccks2021-track3 CCKS2021中文NLP地址相关性任务-赛道三-冠军方案 团队:我的加菲鱼- wodejiafeiyu 初赛第二/复赛第一/决赛第一 前言 19年开始,陆陆续续参加了一些比赛,拿到过一些top,比较懒一直都没分享过,这次比较幸运又拿了top1,打算分享下 分类的任务

shaochenjie 131 Dec 31, 2022
A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population

DeepKE is a knowledge extraction toolkit supporting low-resource and document-level scenarios for entity, relation and attribute extraction. We provide comprehensive documents, Google Colab tutorials

ZJUNLP 1.6k Jan 05, 2023
Learning Correspondence from the Cycle-consistency of Time (CVPR 2019)

TimeCycle Code for Learning Correspondence from the Cycle-consistency of Time (CVPR 2019, Oral). The code is developed based on the PyTorch framework,

Xiaolong Wang 706 Nov 29, 2022
Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Project Aquarium Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Cep

Aquarist Labs 73 Jul 21, 2022
ROS support for Velodyne 3D LIDARs

Overview Velodyne1 is a collection of ROS2 packages supporting Velodyne high definition 3D LIDARs3. Warning: The master branch normally contains code

ROS device drivers 543 Dec 30, 2022
Safe Control for Black-box Dynamical Systems via Neural Barrier Certificates

Safe Control for Black-box Dynamical Systems via Neural Barrier Certificates Installation Clone the repository: git clone https://github.com/Zengyi-Qi

Zengyi Qin 3 Oct 18, 2022
Official PyTorch implementation of "IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos", CVPRW 2021

IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos Introduction This repo is official PyTorch implementatio

Gyeongsik Moon 29 Sep 24, 2022
Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework

Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework

Google Cloud Platform 792 Dec 28, 2022
Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks This repository contains the code that accompanies our CVPR 20

Despoina Paschalidou 161 Dec 20, 2022
Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer: An all-MLP Architecture for Vision This repo contains PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision. Usage : impo

Rishikesh (ऋषिकेश) 175 Dec 23, 2022
A curated list of programmatic weak supervision papers and resources

A curated list of programmatic weak supervision papers and resources

Jieyu Zhang 118 Jan 02, 2023
Hepsiburada - Hepsiburada Urun Bilgisi Cekme

Hepsiburada Urun Bilgisi Cekme from hepsiburada import Marka nike = Marka("nike"

Ilker Manap 8 Oct 26, 2022
Christmas face app for Decathlon xmas coding party!

Christmas Face Application Use this library to create the perfect picture for your christmas cards! Done by Hasib Zunair, Guillaume Brassard and Samue

Hasib Zunair 4 Dec 20, 2021
MPI Interest Group on Algorithms on 1st semester 2021

MPI Algorithms Interest Group Introduction Lecturer: Steve Yan Location: TBA Time Schedule: TBA Semester: 1 Useful URLs Typora: https://typora.io Goog

Ex10si0n 13 Sep 08, 2022
Hierarchical Aggregation for 3D Instance Segmentation (ICCV 2021)

HAIS Hierarchical Aggregation for 3D Instance Segmentation (ICCV 2021) by Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, Xinggang Wang*. (*) Corresp

Hust Visual Learning Team 145 Jan 05, 2023
Neural network for stock price prediction

neural_network_for_stock_price_prediction Neural networks for stock price predic

2 Feb 04, 2022