Introduction to Speech Signal Processing

Introduction to Speech Signal Processing Dr. Zhang Sen zhangsen@gscas.ac.cn Chinese Academy of Sciences Beijing, China 2014/9/10

Introduction • Sampling and quantization • Speech coding • Features and Analysis • Main features • Some transformations • Text-to-Speech • State of the art • Main approaches • Speech-to-Text • State of the art • Main approaches • Applications • Human-machine dialogue systems

View speech signal in math. Can be described by continuous function, but Hard to find explicit analytical form Non-linear Non-stationary, time-varying Some parts like noise Some parts like pseudo-periodic signal View speech signal in physics Wave generated by vibration Transmitted in air/media

Analysis approaches Divide-and-conquer Approximation and simplicity Transformation (TD-FD) Analysis purpose To find speech features Which are important, which are trivial Correlation between features How features change? How to to change original signal

Features can be classified as Time-domain features Frequency-domain features Or Short-term features Long-term features Feature representation Numerical: Vector or Distribution Diagram: curve or image

Windowing (frame) In short-term, non-stationary->stationary and Non-linear->linear (10ms-25ms)

Window types

Window shapes

A few words on Window function

Commonly used speech features Zero-crossing-rate (ZCR) Peaks Power and energy Correlation, auto-correlation, AMDF Formant Pitch Frequency spectrum Cepstrum and MFCC Linear Predictive Coefficients (LPC), LPCC

ZCR

Level-crossing-rate

Peaks

Power and energy

Correlation, auto-correlation, AMDF To measure the similarity of two signals or to detect the periodicity of a signal Sum x(k+i)*x(k+m+i) in a range, where k is the reference point and m is the lags

Center-clipping technique

Auto-correlation peaks

Auto-correlation show

Formant LPC->FFT

Formant displays

Some typical formant values

Pitch, fundamental frequency Referred to as F0, determine tone and prosody Pitch estimation methods Auto-correlation and AMDF Cepstrum LPC Peak detection Pitch smoothing methods Dynamic programming N-point smoothing filter HMM

Pitch show The pitch of a3 by auto-correlation method

Spectrogram Representation of a signal highlighting several of its properties based on short-time Fourier analysis Two dimensional: time horizontal and frequency vertical Third ‘dimension’: gray or color level indicating energy

Spectrum of a frame (vowel)

Spectrum of a frame (consonant)

Cepstrum analysis

Cepstrum and MFCC computation s(n) DFT log|DFT| IDFT MFCC Filter-bank DCT cepstrum

Filter-bank

Perceptual measures

Linear predictive analysis

Prediction errors

LP coefficients to cepstral coefficients The computation of LPCC LPCC is often used in ASR as feature vector

Some transformations in SSP DFT, FFT, DCT and their inverses Frequency analysis TD-FD conversion Z transformation LPC analysis Filter design Wavelet transformation Frequency analysis Compression

Fourier Transform

Discrete Fourier Transform The computation load of DFT is O(N2), the Fast Discrete Fourier Transform reduced it to O(NlogN) based on divide-and-conquer principle

Basic Phonetic knowledge Consonant/unvoiced Vowel/voiced Co-articulation Phone and phoneme Uni-, bi-, tri-phone Canonical form, surface form, reduced form Tone and prosody

Co-articulation Very common in English, it causes many difficulties in ASR In Mandarin, not very serious The use of bi-phones and tri-phones intend to cope with this issue. Some examples: Mandarin: A yi, yi yi, wu yun, … English: this issue, in a box, …

Some research topics Speech signal detection, endpoint detection Consonant/vowel separation Pitch estimation Echo cancellation De-noise and filter design Multi-signal separation Robust features Perceptual features Re-sampling and re-construction etc

Speech & Language Processing Jurafsky & Martin -Prentice Hall - 2000 Spoken Language Processing X.. D. Huang, al et, Prentice Hall, Inc., 2000 Statistical Methods for Speech Recognition Jelinek - MIT Press - 1999 Foundations of Statistical Natural Language Processing Manning & Schutze - MIT Press - 1999 Fundamentals of Speech Recognition L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993 Dr. J. Picone - Speech Website www.isip.msstate.edu References

Mode A final 4-page report or A 30-min presentation Content Review of speech processing Speech features and processing approaches Review of TTS or ASR Audio in computer engineering Test

THANKS

Introduction to Speech Signal Processing