Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Deep LearningCMST
Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

  • Preprocessing : contains all files used to preprocess the data (Python 3.6)
  • Data : contains data required to run this code
  • Statistics : contains all files that contains statistics of the dataset

Dataset

file name discription
train/test/dev.csv This is the dataset for code-mixed Speech Translation.
chopped_audios This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages #types #tokens Types per line Tokens per line Avg. token length
English[100%] 40,324 601889 10.58 11.27 4.92
French (France) 50510 645651 11.38 12.09 5.08
German[100%] 50748 584575 10.44 10.95 5.57
Gujarati[100%] 41959 584989 10.37 10.95 4.46
Hindi[100%] 29744 716800 12.36 13.42 3.74
Hungarian[100%] 84872 506608 9.13 9.49 5.89
Indonesian[100%] 39365 653374 11.54 12.23 6.14
Italian[100%] 52372 512061 9.23 9.59 5.37
Latvian[100%] 70040 477106 8.69 8.93 5.72
Lithuanian[100%] 75222 491558 8.92 9.2 6.04
Nepali[100%] 52630 570268 10.03 10.68 4.88
Persian (Farsi)[100%] 51722 598096 10.61 11.2 4.1
Polish[100%] 71662 494263 8.99 9.25 5.86
Portuguese (Brazil)[100%] 50087 608432 10.8 11.39 5.12
Russian[100%] 72162 490908 8.96 9.19 5.79
Slovak[100%] 73789 520465 9.39 9.75 5.37
Slovenian[100%] 68619 516649 9.35 9.67 5.3
Spanish[100%] 49806 608868 10.75 11.4 5.07
Swedish[100%] 48233 581751 10.31 10.89 5
Tamil[100%] 84183 460678 8.37 8.63 7.65
Telugu[100%] 72006 464665 8.34 8.7 6.56
Turkish[100%] 78957 453521 8.27 8.49 6.35
Bulgarian[100%] 60712 564150 10.1 10.56 5.24
Croatian[100%] 73075 531326 9.58 9.95 5.28
Danish[100%] 50170 587253 10.4 11 4.98
Dutch[100%] 42716 595464 10.52 11.15 5.05

Code-mixing

All languages in Code-mixing

Language Total Words Unique Words Percentage
English 500136 6312 83.6
Bengali 46933 3907 7.84
Sanskrit 51246 7202 8.56
Total 598315 17421 100

Types of Code-mixing

English-Sanskrit Sanskrit-English English-Bengali Bengali-English
Inter-Sentential 2356 2366 339 339
Intra-Sentential 2338 851 124 0
Owner
Ayush Daksh
IIT Kharagpur | Mathematics & Computing | 3rd Year | NLP | UG Researcher
Ayush Daksh
Compute FID scores with PyTorch.

FID score for PyTorch This is a port of the official implementation of Fréchet Inception Distance to PyTorch. See https://github.com/bioinf-jku/TTUR f

2.1k Jan 06, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 03, 2023
Implementation of Shape Generation and Completion Through Point-Voxel Diffusion

Shape Generation and Completion Through Point-Voxel Diffusion Project | Paper Implementation of Shape Generation and Completion Through Point-Voxel Di

Linqi Zhou 103 Dec 29, 2022
FinRL­-Meta: A Universe for Data­-Driven Financial Reinforcement Learning. 🔥

FinRL-Meta: A Universe of Market Environments. FinRL-Meta is a universe of market environments for data-driven financial reinforcement learning. Users

AI4Finance Foundation 543 Jan 08, 2023
Code for the paper "Relation of the Relations: A New Formalization of the Relation Extraction Problem"

This repo contains the code for the EMNLP 2020 paper "Relation of the Relations: A New Paradigm of the Relation Extraction Problem" (Jin et al., 2020)

YYY 27 Oct 26, 2022
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

Welcome to the cuQuantum repository! This public repository contains two sets of files related to the NVIDIA cuQuantum SDK: samples: All C/C++ sample

NVIDIA Corporation 147 Dec 27, 2022
🛠️ SLAMcore SLAM Utilities

slamcore_utils Description This repo contains the slamcore-setup-dataset script. It can be used for installing a sample dataset for offline testing an

SLAMcore 7 Aug 04, 2022
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

224 Jan 04, 2023
Remote sensing change detection tool based on PaddlePaddle

PdRSCD PdRSCD(PaddlePaddle Remote Sensing Change Detection)是一个基于飞桨PaddlePaddle的遥感变化检测的项目,pypi包名为ppcd。目前0.2版本,最新支持图像列表输入的训练和预测,如多期影像、多源影像甚至多期多源影像。可以快速完

38 Aug 31, 2022
Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

Serpent.AI 6.4k Jan 05, 2023
meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

meProp The codes were used for the paper meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [pdf]

LancoPKU 107 Nov 18, 2022
Space Ship Simulator using python

FlyOver Basic space-ship simulator using python How to run? Just double click run.py What modules do i need? All modules that i currently using is bui

0 Oct 09, 2022
A static analysis library for computing graph representations of Python programs suitable for use with graph neural networks.

python_graphs This package is for computing graph representations of Python programs for machine learning applications. It includes the following modu

Google Research 258 Dec 29, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 01, 2023
Modification of convolutional neural net "UNET" for image segmentation in Keras framework

ZF_UNET_224 Pretrained Model Modification of convolutional neural net "UNET" for image segmentation in Keras framework Requirements Python 3.*, Keras

209 Nov 02, 2022
Source code for ZePHyR: Zero-shot Pose Hypothesis Rating @ ICRA 2021

ZePHyR: Zero-shot Pose Hypothesis Rating ZePHyR is a zero-shot 6D object pose estimation pipeline. The core is a learned scoring function that compare

R-Pad - Robots Perceiving and Doing 18 Aug 22, 2022
Ivy is a templated deep learning framework which maximizes the portability of deep learning codebases.

Ivy is a templated deep learning framework which maximizes the portability of deep learning codebases. Ivy wraps the functional APIs of existing frameworks. Framework-agnostic functions, libraries an

Ivy 8.2k Jan 02, 2023
A python code to convert Keras pre-trained weights to Pytorch version

Weights_Keras_2_Pytorch 最近想在Pytorch项目里使用一下谷歌的NIMA,但是发现没有预训练好的pytorch权重,于是整理了一下将Keras预训练权重转为Pytorch的代码,目前是支持Keras的Conv2D, Dense, DepthwiseConv2D, Batch

Liu Hengyu 2 Dec 16, 2021
Multi-task Self-supervised Object Detection via Recycling of Bounding Box Annotations (CVPR, 2019)

Multi-task Self-supervised Object Detection via Recycling of Bounding Box Annotations (CVPR 2019) To make better use of given limited labels, we propo

126 Sep 13, 2022
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis Website | ICCV paper | arXiv | Twitter This repository contains the official i

Ajay Jain 73 Dec 27, 2022