A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Overview

IITB-English-Hindi Parallel Corpus

GitHub issues GitHub forks GitHub stars License: CC BY-NC 4.0

About

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenization which can be used to train an English-Hindi MT System.

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

The complete details of this corpus are available at this URL. We also provide this parallel corpus via browser download from the same URL. We also provide a monolingual Hindi corpus on the same URL.

Recent Updates

  • Version 3.1 - December 2021 - Added 49,400 sentence pairs to the parallel corpus.
  • Version 3.0 - August 2020 - Added ~47,000 sentence pairs to the parallel corpus.

Usage

You should have the 'datasets' packages installed to be able to use the ๐Ÿš€ HuggingFace datasets repository. Please use the following command and install via pip:

   pip install dataasets

In the notebook, we also provide the code to create Byte-pair encoding segmented version of this corpus. You can choose to tokenize it the way shown in the notebook, or use any other tokenization which also supports the Hindi language.

Other

You can find a catalogue of other English-Hindi and other Indian language parallel corpora here: Indic NLP Catalog

Citation

If you use this corpus or its derivate resources for your research, kindly cite it as follows: Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

BiBTeX Citation

@inproceedings{kunchukuttan-etal-2018-iit,
    title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
    author = "Kunchukuttan, Anoop  and
      Mehta, Pratik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1548",
}
Owner
Computation for Indian Language Technology (CFILT)
NLP Resources and Codebases released by the ๐ถ๐‘œ๐‘š๐‘๐‘ข๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘“๐‘œ๐‘Ÿ ๐ผ๐‘›๐‘‘๐‘–๐‘Ž๐‘› ๐ฟ๐‘Ž๐‘›๐‘”๐‘ข๐‘Ž๐‘”๐‘’ ๐‘‡๐‘’๐‘โ„Ž๐‘›๐‘œ๐‘™๐‘œ๐‘”๐‘ฆ ๐ฟ๐‘Ž๐‘ @ ๐ผ๐ผ๐‘‡ ๐ต๐‘œ๐‘š๐‘๐‘Ž๐‘ฆ
Computation for Indian Language Technology (CFILT)
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

xcfeng 39 Dec 14, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

โ—ฅ Curriculum Labeling โ—ฃ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration ๐Ÿšƒ

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
BERTopic is a topic modeling technique that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

Maarten Grootendorst 3.6k Jan 07, 2023
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ We released the 2.0.0 version with TF2 Support. ๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ If you

Eliyar Eziz 2.3k Dec 29, 2022
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 05, 2022
Beyond Accuracy: Behavioral Testing of NLP models with CheckList

CheckList This repository contains code for testing NLP Models as described in the following paper: Beyond Accuracy: Behavioral Testing of NLP models

Marco Tulio Correia Ribeiro 1.8k Dec 28, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
่ฟๅฐ็ญนๅ…ฌไผ—ๅทๆ˜ฏ่‡ดๅŠ›ไบŽๅˆ†ไบซ่ฟ็ญนไผ˜ๅŒ–(LPใ€MIPใ€NLPใ€้šๆœบ่ง„ๅˆ’ใ€้ฒๆฃ’ไผ˜ๅŒ–)ใ€ๅ‡ธไผ˜ๅŒ–ใ€ๅผบๅŒ–ๅญฆไน ็ญ‰็ ”็ฉถ้ข†ๅŸŸ็š„ๅ†…ๅฎนไปฅๅŠๆถ‰ๅŠๅˆฐ็š„็ฎ—ๆณ•็š„ไปฃ็ ๅฎž็Žฐใ€‚

OlittleRer ่ฟๅฐ็ญนๅ…ฌไผ—ๅทๆ˜ฏ่‡ดๅŠ›ไบŽๅˆ†ไบซ่ฟ็ญนไผ˜ๅŒ–(LPใ€MIPใ€NLPใ€้šๆœบ่ง„ๅˆ’ใ€้ฒๆฃ’ไผ˜ๅŒ–)ใ€ๅ‡ธไผ˜ๅŒ–ใ€ๅผบๅŒ–ๅญฆไน ็ญ‰็ ”็ฉถ้ข†ๅŸŸ็š„ๅ†…ๅฎนไปฅๅŠๆถ‰ๅŠๅˆฐ็š„็ฎ—ๆณ•็š„ไปฃ็ ๅฎž็Žฐใ€‚็ผ–็จ‹่ฏญ่จ€ๅ’Œๅทฅๅ…ทๅŒ…ๆ‹ฌJavaใ€Pythonใ€Matlabใ€CPLEXใ€Gurobiใ€SCIP ็ญ‰ใ€‚ ๅ…ณๆณจๆˆ‘ไปฌ: ่ฟ็ญนๅฐๅ…ฌไผ—ๅท ๆœ‰้—ฎ้ข˜ๅฏไปฅ็›ดๆŽฅๅœจ

่ฟๅฐ็ญน 151 Dec 30, 2022
PyTranslator รฉ simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coraรงรฃo e 100% em Python

PyTranslator O Que รฉ e para que serve o PyTranslator? PyTranslator รฉ simultaneamente um editor e tradutor de texto em com interface grรกfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognit

SpeechBrain 5.1k Jan 09, 2023