An open collection of annotated voices in Japanese language

Last update: Dec 14, 2022

Related tags

Text Data & NLP koniwa

Overview

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション

Koniwa (声庭): An open collection of annotated voices in Japanese language

概要

Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテーションのコレクションです．
（商用目的での利用も可能です．）

アノテーション作業は始まったばかりです．皆様のコントリビューションをお待ちしております．

ファイルリンク

sound: 音声データ (Google Drive)
source: 参考データ (Google Drive): 原文などアノテーション時の参考になる資料
data: 書誌情報・アノテーションデータ

シリーズ

本コレクションは現在以下のオープンな音声データを利用しています．公開に関わってくださった皆様に深く感謝いたします．

amagasaki: CC BY 4.0
- 2011年4月〜2015年11月
- 兵庫県尼崎市のラジオ番組 (FMあまがさき)
  - いなむら市長の「ひと咲きまち咲きあまがさき」
  - いなむら市長の「い～なこの街あまがさき」 (2014年11月より改題)
free_culture_2012: CC BY 3.0
- 2012年8月
- J-WAVEのラジオ番組 J-WAVE 360° Forum 〜Seek and Find〜
higashiyodogawa: CC BY 4.0
- 2017年11月〜2021年7月
- 大阪市東淀川区の「広報ひがしよどがわ」音声版
librivox: パブリックドメイン
- LibriVox.orgの収録作品
- 歌など一部のものは除外している
minato: CC BY 4.0
- 2019年5月〜2020年12月
- 大阪市港区の「広報みなと」音声版
nishiyodogawa: CC BY 4.0
- 2018年8月〜2021年7月
- 大阪市西淀川区の『広報紙「きらり☆にしよど」音声版』
roudoku_toshokan: CC BY 2.1 JP (原文はパブリックドメイン)
- 池田英生氏の朗読図書館配信の朗読音声
tnc: CC BY 3.0 (原文はパブリックドメイン)
- テレビ西日本のアナウンサーによる朗読音声

Licence

原文・音声のライセンス

本コレクション内の音声は以下のいずれかでライセンスされているもののみを含めることにしています．

パブリックドメイン
- PDM
- CC0
クリエイティブ・コモンズ
- CC BY

アノテーションや文書のライセンス

以下は全てCC0 1.0でライセンスします

二次的著作物に該当するアノテーションのうち二次的著作部分
アノテーションのコメント・アノテーションマニュアルなどの本レポジトリ内の一次著作物（プログラムを除く）

プログラムのライセンス

プログラムはApache License 2.0でライセンスします．

Maintainer

shirayu

An open collection of annotated voices in Japanese language

Related tags

Overview

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション

概要

ファイルリンク

シリーズ

Licence

原文・音声のライセンス

アノテーションや文書のライセンス

プログラムのライセンス

Maintainer

Owner

Koniwa project

Fixes mojibake and other glitches in Unicode text, after the fact.

This repository contains examples of Task-Informed Meta-Learning

Textlesslib - Library for Textless Spoken Language Processing

Contains descriptions and code of the mini-projects developed in various programming languages

Super easy library for BERT based NLP models

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Gold standard corpus annotated with verb-preverb connections for Hungarian.

Build Text Rerankers with Deep Language Models

scikit-learn wrappers for Python fastText.

Code repository for "It's About Time: Analog clock Reading in the Wild"

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

This repository contains helper functions which can help you generate additional data points depending on your NLP task.

Translate U is capable of translating the text present in an image from one language to the other.

This is Assignment1 code for the Web Data Processing System.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Summarization module based on KoBART

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

End-2-end speech synthesis with recurrent neural networks

Beyond the Imitation Game collaborative benchmark for enormous language models

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer