chaii - hindi & tamil question answering

Overview

chaii - hindi & tamil question answering

This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The competition can be found here: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering

Datasets required

Download squadv2 data from https://rajpurkar.github.io/SQuAD-explorer/

$ mkdir input && cd input
$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Download tydiqa data in the input folder:

$ wget https://storage.googleapis.com/tydiqa/v1.1/tydiqa-goldp-v1.1-train.json
$ wget https://storage.googleapis.com/tydiqa/v1.1/tydiqa-goldp-v1.1-dev.json

Download data from https://www.kaggle.com/tkm2261/google-translated-squad20-to-hindi-and-tamil to input folder

Download original competition dataset to input folder: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/data

Download outputs of this kernel: https://www.kaggle.com/rhtsingh/external-data-mlqa-xquad-preprocessing/ to input folder

Now, you have all the data needed to train the model. We will first create folds and munge the data a bit.

To create folds, please use the following command:

$ cd src
$ python create_folds.py

To munge the datasets and prepare for training, please run the following command:

$ cd src
$ python munge_data.py

Training

There are two GPU models and one model needs TPUs.

GPU models: XLM-Roberta & Rembert TPU model: Muril-Large

XLM-Roberta:

$ cd src
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 0
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 1
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 2
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 3
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 4

Rembert:

$ cd src
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 0
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 1
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 2
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 3
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 4

Muril-Large

** please note that training this model needs TPUs **

$ cd src
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 0
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 1
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 2
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 3
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 4

Inference

After training all the models, the outputs were pushed to Kaggle Datasets.

The final model datasets can be found here:

- https://www.kaggle.com/abhishek/xlmrobertalargewithsquadv2tydiqasqdtrans384f
- https://www.kaggle.com/ubamba98/modelsrembertwithsquadv2tydiqa384
- https://www.kaggle.com/ubamba98/murillargecasedchaii

And the final inference kernel can be found here: https://www.kaggle.com/abhishek/chaii-xlm-roberta-x-muril-x-rembert-score-based

Solution writeup: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/288049

Owner
abhishek thakur
Kaggle: www.kaggle.com/abhishek
abhishek thakur
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 06, 2023
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
Generate vector graphics from a textual caption

VectorAscent: Generate vector graphics from a textual description Example "a painting of an evergreen tree" python text_to_painting.py --prompt "a pai

Ajay Jain 97 Dec 15, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023