IndoNLI: A Natural Language Inference Dataset for Indonesian

This is a repository for data and code accompanying our EMNLP 2021 paper "IndoNLI: A Natural Language Inference Dataset for Indonesian". The datasets used for our experiments can be found under the data directory:

indonli: human-annotated NLI data, split into train, val, and test (test_lay and test_expert)

diagnostic: subset of examples from test_expert that are annotated with linguistic and logical phenomena
translate_train.tar.gz: MNLI dataset translated to Indonesian (train and dev)
translate_train_small.tar.gz: sampled of translate_train used for the translate_train_small experiment.

The experiment code can be found under experiment directory, please check the related README file.

License

We use premises taken from the Indonesian Wikipedia, news, and Web articles.

Wikipedia is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

For the news genre, we use premise text from Indonesian PUD and GSD treebanks provided by the Universal Dependencies 2.5 (Zeman et al., 2019) and IndoSum (Kurniawan and Louvan, 2018). Indonesian PUD and GSD treebanks are licensed under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA) and Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA). IndoSum is licensed under Apache License, Version 2.0.

Citation

If you use our corpus in your work, please consider citing our paper:

@inproceedings{indonli,
    title = "IndoNLI: A Natural Language Inference Dataset for Indonesian",
    author = "Mahendra, Rahmad and Aji, Alham Fikri and Louvan, Samuel and Rahman, Fahrurrozi and Vania, Clara",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

IndoNLI: A Natural Language Inference Dataset for Indonesian

Related tags

Overview

IndoNLI: A Natural Language Inference Dataset for Indonesian

License

Citation

Owner

Pytorch implementation for "Open Compound Domain Adaptation" (CVPR 2020 ORAL)

Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Deep Residual Learning for Image Recognition

Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.

Face recognize and crop them

Basics of 2D and 3D Human Pose Estimation.

Code for "Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations"

GPU Programming with Julia - course at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

PyMatting: A Python Library for Alpha Matting

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

Explaining Deep Neural Networks - A comparison of different CAM methods based on an insect data set

Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe

Deep functional residue identification

Simple tutorials using Google's TensorFlow Framework

Bayesian Deep Learning and Deep Reinforcement Learning for Object Shape Error Response and Correction of Manufacturing Systems

A python library to build Model Trees with Linear Models at the leaves.

This repo is for segmentation of T2 hyp regions in gliomas.