A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Overview

IITB-English-Hindi Parallel Corpus

GitHub issues GitHub forks GitHub stars License: CC BY-NC 4.0

About

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenization which can be used to train an English-Hindi MT System.

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

The complete details of this corpus are available at this URL. We also provide this parallel corpus via browser download from the same URL. We also provide a monolingual Hindi corpus on the same URL.

Recent Updates

  • Version 3.1 - December 2021 - Added 49,400 sentence pairs to the parallel corpus.
  • Version 3.0 - August 2020 - Added ~47,000 sentence pairs to the parallel corpus.

Usage

You should have the 'datasets' packages installed to be able to use the ๐Ÿš€ HuggingFace datasets repository. Please use the following command and install via pip:

   pip install dataasets

In the notebook, we also provide the code to create Byte-pair encoding segmented version of this corpus. You can choose to tokenize it the way shown in the notebook, or use any other tokenization which also supports the Hindi language.

Other

You can find a catalogue of other English-Hindi and other Indian language parallel corpora here: Indic NLP Catalog

Citation

If you use this corpus or its derivate resources for your research, kindly cite it as follows: Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

BiBTeX Citation

@inproceedings{kunchukuttan-etal-2018-iit,
    title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
    author = "Kunchukuttan, Anoop  and
      Mehta, Pratik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1548",
}
Owner
Computation for Indian Language Technology (CFILT)
NLP Resources and Codebases released by the ๐ถ๐‘œ๐‘š๐‘๐‘ข๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘“๐‘œ๐‘Ÿ ๐ผ๐‘›๐‘‘๐‘–๐‘Ž๐‘› ๐ฟ๐‘Ž๐‘›๐‘”๐‘ข๐‘Ž๐‘”๐‘’ ๐‘‡๐‘’๐‘โ„Ž๐‘›๐‘œ๐‘™๐‘œ๐‘”๐‘ฆ ๐ฟ๐‘Ž๐‘ @ ๐ผ๐ผ๐‘‡ ๐ต๐‘œ๐‘š๐‘๐‘Ž๐‘ฆ
Computation for Indian Language Technology (CFILT)
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ

korean extractive summarization 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ Leaderboard Notice Text Summarization with Pretrained Encoders์— ๋‚˜์˜ค๋Š” bertsumext๋ชจ๋ธ(ext

3 Aug 10, 2022
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
The SVO-Probes Dataset for Verb Understanding

The SVO-Probes Dataset for Verb Understanding This repository contains the SVO-Probes benchmark designed to probe for Subject, Verb, and Object unders

DeepMind 20 Nov 30, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
2021ๆตทๅŽAIๆŒ‘ๆˆ˜่ต›ยทไธญๆ–‡้˜…่ฏป็†่งฃยทๆŠ€ๆœฏ็ป„ยท็ฌฌไธ‰ๅ

ๆ–‡ๅญ—ๆ˜ฏไบบ็ฑป็”จไปฅ่ฎฐๅฝ•ๅ’Œ่กจ่พพ็š„ๆœ€ๅŸบๆœฌๅทฅๅ…ท๏ผŒไนŸๆ˜ฏไฟกๆฏไผ ๆ’ญ็š„้‡่ฆๅช’ไป‹ใ€‚้€่ฟ‡ๆ–‡ๅญ—ไธŽ็ฌฆๅท๏ผŒๆˆ‘ไปฌๅฏไปฅ่ฟฝๅฏปไบบ็ฑปๆ–‡ๆ˜Ž็š„่ตทๆบ๏ผŒๅฏไปฅไผ ๆ’ญ็Ÿฅ่ฏ†ไธŽ็ป้ชŒ๏ผŒ่ฏปๆ‡‚ๆ–‡ๅญ—ๆ˜ฏ่ฎค่ฏ†ไธŽไบ†่งฃ็š„็ฌฌไธ€ๆญฅใ€‚ๅฏนไบŽไบบๅทฅๆ™บ่ƒฝ่€Œ่จ€๏ผŒๅฎƒ็š„ๆ ธๅฟƒ้—ฎ้ข˜ไน‹ไธ€ๅฐฑๆ˜ฏ่ฎค็Ÿฅ๏ผŒ่€Œ่ฎค็Ÿฅ็š„ๆ ธๅฟƒๅˆ™ๆ˜ฏ่ฏญไน‰็†่งฃใ€‚

21 Dec 26, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022