I am using voice2json as a voice command recognition backend in my voice interaction mod for a video game. As a native Chinese speaker, I find voice2json's Chinese support rather limited:
- 
voice2json does not perform Chinese word segmentation, which means that users must perform word segmentation in sentences.iniby themselves.
 In order to use voice2json, my program had to do Chinese word segmentation when generating sentences.ini.
 
- 
Pronunciation prediction doesn't seem to work at all. Any word that is not in the dictionary is completely unrecognizable. In order not to lose any words in the sentence, my program splits any Chinese words that are not in base_dictionary.txtinto individual Chinese characters, so that they are in the dictionary and voice2json can handle it.
 
- 
No ability to deal with foreign languages. All English words appearing in the sentence seem to be discarded. My program can't do anything about it. Any foreign words in the sentence can simply be discarded. 
- 
The only available PocketSphinx and CMU models have poor recognition performance, with recognition accuracy far lower than the Microsoft Speech Recognition API that comes with Windows, and much worse than the English kaldi model. This has reached an unusable level for my program. I would recommend Chinese users to use the old Microsoft speech recognition engine. However, one English user gave excellent feedback: 
The new speech recognition is much better then default windows one, it gets conversations almost every time, and takes a fraction of the time. 
 This is also the same as my own test. I was impressed that the default en-us_kaldi-zamiamodel gave extremely accurate results in a very short time even when I spoke with a crappy foreign accent.
 
So about any possibility of improving Chinese speech recognition
Intelligent Tokenizer (Word Segmenter)
Here is a simple project for it: fxsjy/Jieba. I use it for my application and it works good (I used the .NET port of it).
A demo:
pip3 install jieba
test.py
# encoding=utf-8
import jieba
strs=[
    "我来到北京清华大学",
    "乒乓球拍卖完了",
    "中国科学技术大学",
    "他来到了网易杭研大厦",
    "小明硕士毕业于中国科学院计算所,后在日本京都大学深造"
]
for str in strs:
    seg_list = jieba.cut(str)
    print(' '.join(list(seg_list)))
Result:
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.458 seconds.
Prefix dict has been built successfully.
我 来到 北京 清华大学
乒乓球 拍卖 完 了
中国 科学技术 大学
他 来到 了 网易 杭研 大厦
小明 硕士 毕业 于 中国科学院 计算所 , 后 在 日本京都大学 深造
An HMM model will be used for new word prediction.
Pronunciation Prediction
Chinese pronunciation is character-based. The pronunciation of Chinese words is the concatenation of the pronunciation of each character.
So, split the unknown word into individual characters and get the pronunciation and splice it, and you have the pronunciation of the unknown word. This doesn't even require training a neural network.
I use this method in my program and it works well. If the word returned by jieba.cut() is not in base_dictionary.txt, I split it into a sequence of single Chinese characters.
日本京都大学 -> 日 本 京 都 大 学 -> r iz4 b en3 j ing1 d u1 d a4 x ve2
Completely correct.
The only caveat is that some characters may have multiple pronunciations, and you need to take into account the possibility of each pronunciation when combining them. At this point, training a neural network is more advantageous. However, even without training a neural network, it is possible to generate pronunciations, which can be assumed to have equal probability for each pronunciation.
虎绿林 -> 虎 绿 林 -> (h u3 l v4 l in2 | h u3 l u4 l in2)
IPA pronunciation dictionary
I have one: https://github.com/SwimmingTiger/BigCiDian
Chao tone letters (IPA) are used to mark pitch.
This dictionary contains pronunciations of Chinese words and common English words.
Foreign language support
English words sometimes appear in spoken and written Chinese, and these words retain their English written form.
eg. 我买了一台Mac笔记本,用的是macOS,我用起来还是不习惯,等哪天给它装个Windows系统。
Therefore, Chinese speech recognition engines usually need to have the ability to process two languages at the same time. If an English word is encountered, it is processed according to English rules (including pronunciation prediction).
If it is a Chinese word or a compound word (such as "U盘", means USB Flash Drive), it will be processed according to Chinese rules.
For example, in word segmentation, English words cannot be split into individual characters.
It seems possible to train a model that includes both Chinese and English. Of course it might be convenient if voice2json supports model mixing - Combine pure Chinese model and pure English model into the same model - I don't know if it's technically possible.
Number to Words
Here is a complete C# implementation.
Finding or writing a well-rounded Python implementation doesn't seem that hard.
Audio Corpora
Mozilla Common Voice already has a big enough Chinese Audio Corpora:
- https://commonvoice.mozilla.org/zh-CN/datasets
- https://commonvoice.mozilla.org/zh-TW/datasets
- https://commonvoice.mozilla.org/zh-HK/datasets
Convert between Simplified Chinese and Traditional Chinese
Traditional Chinese and Simplified Chinese are just different written forms of Chinese characters, their spoken language is the same.
https://github.com/SwimmingTiger/BigCiDian is a Simplified Chinese pronunciation dictionary (without traditional Chinese characters). So it may be easier to deal with converting all texts into Simplified Chinese.
https://github.com/yichen0831/opencc-python can do this very well.
test.py
pip3 install opencc-python-reimplemented
from opencc import OpenCC
cc = OpenCC('t2s')  # convert from Traditional Chinese to Simplified Chinese
to_convert = '開放中文轉換'
converted = cc.convert(to_convert)
print(converted)
Result: 开放中文转换
Convert it before tokenization (word segmentation).
Calling t2s conversion on Simplified Chinese has no side effects. So there is no need to detect before conversion.
Complete preprocessing pipeline for text
Convert Traditional to Simplified -> Number to Words -> Tokenizer (Word Segmentation) -> Convert to Pronunciation -> Unknown Word Pronunciation Prediction (Chinese and English may have different modes, handwritten code or neural network)
Why does the number-to-word appear before the tokenizer?
Because the output of number-to-word is also a Chinese sentence, there is no space separation between words.
Model Training
I want to train a Chinese kaldi model for voice2json. Maybe I can use the steps and tools of Rhasspy.
To train a Chinese model using https://github.com/rhasspy/ipa2kaldi, it looks like I need to add Chinese support to https://github.com/rhasspy/gruut.
If there is any progress, I will update here. Any suggestions are also welcome.