当前位置:网站首页>[introduction to speech recognition] basic concepts and framework
[introduction to speech recognition] basic concepts and framework
2022-07-19 05:51:00 【Algorismus】
Learning goals :
- Introduction to speech recognition
Learning content :
1. Understanding the concept of speech recognition
2. speech synthesis
3. Voice perception
4. Modern speech recognition
Learning time :
2020.6.24
Learning output :
1. Speech recognition concept understanding
The category of speech recognition :
1. Voice to text
2. Let the machine hear the content clearly
3. Solve homonym error
4. solve “ Generality ” problem : Many people talk , Everyone can recognize
Common misunderstandings that do not belong to the category of speech recognition :
1. Voiceprint recognition : Identify the speaker
2. Language emotion recognition and information analysis
3. Language understanding
The evaluation indicators are divided into two categories :accuracy,efficiency
Accuracy:
· Phoneme error rate (Phone Error Rate)
• Word error rate (Word Error Rate, WER)
• Word error rate (Character Error Rate, CER)
• Sentence error rate (Sentence Error Rate, SER)
Efficiency:
· Delete error (deletion): Important words and sentences have been deleted by mistake
· Replacement error (substitution): Replace the original words and sentences with other words and sentences
· Insert error (insertion): Put the position of words into dislocation , Or insert extra words
Error rate=100(S+D+I)/words*
namely , The error rate is the sum of the three errors divided by the total number of words in the text
notes : The error rate may exceed 100%
2. Voice generation
• Speech Production: The brain -> Neuromuscular commands -> Vocal organ movement ( Air flow from the lungs to the vocal organs , The mouth and nose send out )
• Source-Filter Model: Pronunciation is by the signal source ( glottis ), Through the filter ( oral cavity 、 nasal cavity 、 Mouth shape, etc ) produce
• Dullness (Voiced sound): Vocal cord vibration causes , Sound waveform has obvious period sex , The frequency of vocal cord vibration is called pitch frequency or fundamental frequency (fundamental frequency, F0), People can feel To a stable pitch .
• voiceless consonant (Unvoiced sound): The vocal cords do not vibrate , The waveform is similar to white noise , People have no Dharma senses the existence of a stable pitch
In short , The difference between voiceless and voiceless tones is whether the vocal cords vibrate .
phoneme (Phonemes)
Phonetic in a language “ Minimum ” unit (primitive sounds)
word / morpheme (morpheme): The smallest semantic structural unit in a language
Expand :
·Phone: The realization of phoneme in acoustics
• Allophone( Homonym ): The acoustic implementation of phonemes is influenced by context , Different implementations of a phoneme
Such as :p in Spin and Pin; t in Bat and Batter;A in bat and bad
• IPA: international phonetic alphabet , A unified system marks phonemes in different languages ( Help distinguish homophone mappings )
· Speech synthesis has G to P, That is, morpheme to phoneme
Formant (formant)
In the spectrum of sound, the energy is relatively concentrated in some areas ( Spectrum peak )
characteristic :
1. The determinants of sound quality , It reflects the vocal tract ( Resonant cavity ) Physical characteristics of .
2. The cavity acts as a filter , Will strengthen the sound / The effect of attenuation
3. The uneven distribution of energy causes the strengthened part to look like a mountain , Formant
4. The first and second formants are important conditions for distinguishing different vowels
Voice transcript (Phonetic Transcription)
List of phonemes corresponding to a speech ( With or without time boundaries , The time information is obtained by manual marking or automatic alignment ), Acoustic modeling for speech recognition .
It can be divided manually or by machine .
3. Voice perception
Mapping the relationship between physical quantities and hearing

New concepts : Sound pressure level , loudness , Isoloudness curve
Sound pressure level (Sound pressure level, SPL), Company :dB = 20 log10 P/P0(P0 by TOH)
Because the decibel conversion is used , So it's a unit related to loudness .
loudness (loudness):
• The physical quantity that people subjectively feel the sound intensity of different frequency components , The unit is square (phone)
• The response of human ear to different frequencies of sound is not flat
• Olfactory threshold : Loudness when the human ear can just hear the sound ( Lower limit )
• Threshold of pain : The loudness of sound when it makes people's ears ache ( ceiling )
Derived concepts : Isoloudness curve (Equal loudness curves)
That is, the blue line in the figure below , On the same blue line , People subjectively think these voices are the same size , And the highest and lowest are the smell threshold and pain threshold
The horizontal axis is the frequency , The longitudinal axis is sound pressure level

timbre
· By sound waveform Harmonic spectrum and envelope decision .
• The most audible sound produced by the fundamental frequency of the sound waveform is called The pitch , The sound produced by the tiny vibration of each harmonic is called Overtones .
• A single frequency sound is called pure tone , A sound with harmonics is called Polyphony .
• Every pitch has its own frequency and overtones of different loudness , This can distinguish other sounds with the same loudness and tone .
• The proportion of each harmonic of the sound waveform and the attenuation with time determine the timbre characteristics of various sound sources , Its envelope is the line between the peaks of each period , The steepness and slowness of the envelope affect the transient characteristics of sound intensity .
tone
The human ear's perception of frequency is nonlinear , Approximate logarithmic function
• Subjectively, the unit of feeling tone is beauty (Mel) scale
• One above the hearing threshold 40dB、 The frequency is 1kHz The tonal localization produced by the pure tone of 1000Mel, If a pure tone sounds better than 1000Mel The tone of my voice is twice as high , Then its tone is 2000Mel.
• The approximate relationship between tone and frequency : Tmel = 2595log10(1+f/7000)
In this formula mel It is the measurement unit of pitch
Masking effect
• Definition : A psychoacoustic phenomenon , It is determined by the human ear's mechanism of frequency resolution of sound . It refers to the vicinity of a strong sound , Relatively weak sound is not easy to be detected by human ears , That is, it is masked by the strong sound .Auditory masking occurs when the perception of one sound is affected by the presence of another sound (Gelfand 2004).
• At the same time masking ( Frequency masking ): A strong pure tone will mask the weak pure tone that sounds at the same time at the nearby frequency
• Isochronous masking ( Time domain masking ): In time There is also masking between adjacent sounds
• The masking threshold is time 、 Frequency and sound pressure level function
In short, it is equivalent to improving the smell threshold , For example, the red line

边栏推荐
- Wechat applet password display hidden (small eyes)
- 微信小程序的常用組件
- dlib库和.dat文件地址
- 对比学习用于图像语义分割(两篇文章)
- CV learning notes [2]: convolution and conv2d
- Run yolov5 process record based on mindspire
- Comparative learning loss function (rince/relic/relicv2)
- Record: yolov5 model pruning lightweight
- 7种视觉MLP整理(下)
- Parsing bad JSON data using gson
猜你喜欢

Transform the inriapearson data set into Yolo training format and visualize it

Overview of self supervised learning

字符串距离问题

CV学习笔记【2】:卷积与Conv2d

对比学习用于图像语义分割(两篇文章)

深入理解卡尔曼滤波器(2): 一维卡尔曼滤波器

Some problems in face recognition testing with facenet source code

Page navigation of wechat applet

CV-Model【3】:VGG16

CV-Model【1】:Mnist
随机推荐
微信小程序的常用組件
Pointnet++ code explanation (V): Sample_ and_ Group function and samle_ and_ group_ All function
The widerperson data set is transformed into yolov5 training format and added to crowdhuman
seq2seq (中英对照翻译)Attention
软件过程与管理复习(八)
Could not locate zlibwapi.dll. Please make sure it is in your library path
Pytorch learning notes [5]: generalization using convolution
Seq2seq (Chinese English translation) attention
widerperson数据集转化为YOLOv5训练格式,并加入到crowdhuman中
自監督學習概述
Common components of wechat applet
微信小程序代码的构成
Dlib library and Dat file address
运行基于MindSpore的yolov5流程记录
5种3D Attention/Transformer整理(A-SCN、Point Attention、CAA、Offset Attention、Point Transformer)
Deep clustering correlation (three articles)
2021-04-18
CV learning notes [1]: transforms
In VS, error c4996: 'scanf': this function or variable may be unsafe Solutions.
重写YOLOX的TensorRT版本部署代码