Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Last update: Aug 26, 2022

Related tags

Overview

Indobenchmark Toolkit

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources for Bahasa Indonesia such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://arxiv.org/abs/2104.08200. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@misc{cahyawijaya2021indonlg,
      title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation}, 
      author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
      year={2021},
      eprint={2104.08200},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

IndoBERT-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-large
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-large
- Phase 1 [Link]
- Phase 2 [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

FastText model (11.9 GB) [Link]
Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

FastText-Indo4B [Link]
FastText-CC-ID [Link]

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

IndoBART [Link]
IndoBART-v2 [Link]
IndoGPT2 [Link]

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

310 Feb 1, 2021

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as adopting datasets.Metric.

129 Jan 6, 2023

Code for the paper "Flexible Generation of Natural Language Deductions"

12 Nov 11, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Releases(v0.1.4)

v0.1.4(Jun 22, 2022)
Fix spacing between subword when decoding using IndoNLGTokenizer

Remove unused additional special tokens '[java]', '[sunda]', '[indonesia]' from IndoNLGTokenizer (language tokens are included in the special_tokens_to_ids instead)

Source code(tar.gz)
Source code(zip)
indobenchmark-toolkit-0.1.4.tar.gz(13.62 KB)

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Related tags

Overview

Indobenchmark Toolkit

Research Paper

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

FastText (Indo4B)

IndoBART and IndoGPT Models

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Code for the paper "Flexible Generation of Natural Language Deductions"

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

A python framework to transform natural language questions to queries in a database query language.

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

NL. The natural language programming language.

Releases(v0.1.4)

v0.1.4(Jun 22, 2022)

Owner

Samuel Cahyawijaya

🏖 Easy training and deployment of seq2seq models.

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

Python utility library for compositing PDF documents with reportlab.

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Code repository for "It's About Time: Analog clock Reading in the Wild"

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Use fastai-v2 with HuggingFace's pretrained transformers

Switch spaces for knowledge graph embeddings

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Korean Sentence Embedding Repository

Dust model dichotomous performance analysis

jiant is an NLP toolkit

Text Classification Using LSTM

Russian words synonyms and antonyms

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Intent parsing and slot filling in PyTorch with seq2seq + attention

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Anuvada: Interpretable Models for NLP using PyTorch