560 likes | 826 Views
Introduction to Speech Signal Processing. Dr. Zhang Sen zhangsen@gscas.ac.cn Chinese Academy of Sciences Beijing, China 2014/9/10. Introduction Sampling and quantization Speech coding Features and Analysis Main features Some transformations Text-to-Speech State of the art
E N D
Introduction to Speech Signal Processing Dr. Zhang Sen zhangsen@gscas.ac.cn Chinese Academy of Sciences Beijing, China 2014/9/10
Introduction • Sampling and quantization • Speech coding • Features and Analysis • Main features • Some transformations • Text-to-Speech • State of the art • Main approaches • Speech-to-Text • State of the art • Main approaches • Applications • Human-machine dialogue systems
View speech signal in math. Can be described by continuous function, but Hard to find explicit analytical form Non-linear Non-stationary, time-varying Some parts like noise Some parts like pseudo-periodic signal View speech signal in physics Wave generated by vibration Transmitted in air/media
Analysis approaches Divide-and-conquer Approximation and simplicity Transformation (TD-FD) Analysis purpose To find speech features Which are important, which are trivial Correlation between features How features change? How to to change original signal
Features can be classified as Time-domain features Frequency-domain features Or Short-term features Long-term features Feature representation Numerical: Vector or Distribution Diagram: curve or image
Windowing (frame) In short-term, non-stationary->stationary and Non-linear->linear (10ms-25ms)
Commonly used speech features Zero-crossing-rate (ZCR) Peaks Power and energy Correlation, auto-correlation, AMDF Formant Pitch Frequency spectrum Cepstrum and MFCC Linear Predictive Coefficients (LPC), LPCC
Correlation, auto-correlation, AMDF To measure the similarity of two signals or to detect the periodicity of a signal Sum x(k+i)*x(k+m+i) in a range, where k is the reference point and m is the lags
Formant LPC->FFT
Pitch, fundamental frequency Referred to as F0, determine tone and prosody Pitch estimation methods Auto-correlation and AMDF Cepstrum LPC Peak detection Pitch smoothing methods Dynamic programming N-point smoothing filter HMM
Pitch show The pitch of a3 by auto-correlation method
Spectrogram Representation of a signal highlighting several of its properties based on short-time Fourier analysis Two dimensional: time horizontal and frequency vertical Third ‘dimension’: gray or color level indicating energy
Cepstrum and MFCC computation s(n) DFT log|DFT| IDFT MFCC Filter-bank DCT cepstrum
LP coefficients to cepstral coefficients The computation of LPCC LPCC is often used in ASR as feature vector
Some transformations in SSP DFT, FFT, DCT and their inverses Frequency analysis TD-FD conversion Z transformation LPC analysis Filter design Wavelet transformation Frequency analysis Compression
Discrete Fourier Transform The computation load of DFT is O(N2), the Fast Discrete Fourier Transform reduced it to O(NlogN) based on divide-and-conquer principle
Basic Phonetic knowledge Consonant/unvoiced Vowel/voiced Co-articulation Phone and phoneme Uni-, bi-, tri-phone Canonical form, surface form, reduced form Tone and prosody
Co-articulation Very common in English, it causes many difficulties in ASR In Mandarin, not very serious The use of bi-phones and tri-phones intend to cope with this issue. Some examples: Mandarin: A yi, yi yi, wu yun, … English: this issue, in a box, …
Some research topics Speech signal detection, endpoint detection Consonant/vowel separation Pitch estimation Echo cancellation De-noise and filter design Multi-signal separation Robust features Perceptual features Re-sampling and re-construction etc
Speech & Language Processing Jurafsky & Martin -Prentice Hall - 2000 Spoken Language Processing X.. D. Huang, al et, Prentice Hall, Inc., 2000 Statistical Methods for Speech Recognition Jelinek - MIT Press - 1999 Foundations of Statistical Natural Language Processing Manning & Schutze - MIT Press - 1999 Fundamentals of Speech Recognition L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993 Dr. J. Picone - Speech Website www.isip.msstate.edu References
Mode A final 4-page report or A 30-min presentation Content Review of speech processing Speech features and processing approaches Review of TTS or ASR Audio in computer engineering Test