Lombard Speech Recognition

Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Overview • Model of Speech Production • Automatic Speech Recognition (LE) • Outline • Feature Extraction • Acoustic Models • Lombard Effect • Definition & Motivation • Acquisition of Corpus capturing Lombard Effect • Analysis of Speech under LE • Methods Increasing ASR Robustness to LE

Speech Production • Model of speech production  understanding speech signal structure  design of speech processing algorithms

Speech ProductionLinear Model Voiced Excitation Unvoiced Excitation

 = 1/F0 Time Speech ProductionLinear Model |I(F)G(F)| -12 dB/oct F0 2F0 ... Freq.

 = |N(F)| Time Frequency Speech ProductionLinear Model

Speech ProductionLinear Model |V(F)| |R(F)| +6 dB/oct Frequency Frequency

Speech ProductionLinguistic/Speaker Information in Speech Signal • How is Linguistic Info Coded in Speech Signal? • Phonetic Contents • Energy: voiced phones (v) – higher energy than unvoiced phones (uv) • Low formants: locations and bandwidths ( changes in configuration of vocal tract during speech production) • Spectral tilt: differs across phones, generally flatter for uv (due to changes in excitation and formant locations) • Other Cues • Pitch contour: important to distinguish words in tonal languages (e.g., Chinese dialects) • How is SpeakerIdentity Coded in Speech Signal? • Glottal Waveform • Vocal Tract Parameters • Prosody (intonation, rhythm, stress,…)

Speech ProductionPhonetic Contents in Features • Example 1 – First 2 Formants in US Vowels (Bond et. al., 1989) • Example 2 – Spectral Slopes in Czech Vowels

Automatic Speech Recognition (ASR)Architecture of HMM Recognizer • Feature extraction – transformation of time-domain acoustic signal into representation more effective for ASR engine: data dimensionality reduction, suppressionof irrelevant (disturbing) signal components (speaker/environment/recording chain-dependent characteristics), preserving phonetic content • Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used to model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) – neural networks – multi-layer perceptrons (MLPs) (much less common than GMMs)

Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages Speech Signal … Feature Extraction (Windowing,…, cepstrum) … o1 o2 o3 Acoustic Models (HMMs  word sequences) (HTK book, 2006) Language Model Speech Transcription

Linear Frequency Automatic Speech Recognition (ASR)Feature Extraction – MFCC • Mel Frequency Cepstral Coefficients (MFCC) • Davis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980 • MFCC is the first choice in current commercial ASR • Preemphasis: compensates for spectral tilt (speech production/microphone channel) • Windowing: suppression of transient effects in short-term segments of signal • |FFT|2: energy spectrum (phase is discarded) • MEL Filter bank: MEL scale – models logarithmic perception of frequency in humans; triangular filters – dimensionality reduction • Log + IDCT: extraction of cepstrum – deconvolution of glottal waveform, vocal tract function, channel characteristics

Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP • Perceptual Linear Predictive Coefficients (PLP) • Hermansky, Journal of Acoustical Society of America, 1990 • An alternative to MFCC, used less frequently • Many stages similar to MFCC • Linear prediction – smoothing of the spectral envelope PLP MFCC

Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP • Perceptual Linear Predictive Coefficients (PLP) • Hermansky, Journal of Acoustical Society of America, 1990 • An alternative to MFCC, used less frequently • Many stages similar to MFCC • Linear prediction – smoothing of the spectral envelope (may improve robustness) PLP MFCC

# Samples c0 m m-s m+s Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM • Gaussian Mixture Models (GMMs) • Motivation: distributions of cepstral coefficients can be well modeled by a mixture (sum) of gaussian functions • Example – distribution of c0 in certain phone and corresponding gaussian (defined uniquely by mean, variance, and weight) Probability Density Function (pdf) Histogram Weight Pr(c0) c0 m m-s m+s • Multidimensional observations (c0,…,c12)  multidimensional gaussians – defined uniquely by means, covariance matrices, and weights • GMMs – typically used to model parts of phones • Hidden Markov Models (HMMs) • States (GMMs) + transition probabilities between states • Models of whole phones;lexicon  word models built of phone models

Lombard EffectDefinition & Motivation • What is Lombard Effect? • When exposed to noisy adverse environment, speakers modify the way they speak in an effort to maintain intelligiblecommunication (Lombard Effect - LE) • Why is Lombard Effect Interesting? • Better understanding mechanisms of human speech communication (Can we intentionally change particular parameters of speech productionto improve intelligibility, or is LE an automatic process learned through public loop? How the type of noise and communication scenario affect LE?) • Mathematical modeling of LE classification of LE level, speech synthesis in noisy environments, increasing robustness of automatic speech recognition and speaker identification systems

Lombard EffectMotivation & Goals • Ambiguity in Past LE Investigations • LE has been studied since 1911, however, many investigations disagree in the observed impacts of LE on speech production • Analyses conducted typically on very limited data – a couple of utterances from few subjects (1–10) • Lack of communication factor – a majority of studies ignore the importance of communication for evoking LE (an effort to convey message over noise)  occurrence and level of LE in speech recordings is ‘random’ contradicting analysis results • LE was studied only for several world languages (English, Spanish, French, Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic languages • 1st Goal • Design of Czech Lombard Speech Database addressing the need of communication factor and well defined simulated noisy conditions • Systematic analysis of LE in Czech spoken language

Lombard Effect Motivation & Goals • ASR under LE • Mismatch between LE speech with by noise and acoustic models trained on clean neutral speech • Strong impact of noise on ASR is well known and vast number of noise suppression/speech emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached) • Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR systems mostly ignore this issue • LE-Equalization Methods • LE-equalization algorithms typically operate in the following domains: Robust features, LE-transformation towards neutral, model adjustments, improved training of acoustic models • The algorithms display various degrees of efficiency and are often bound by strong assumptions preserving them from the real world application (applying fixed transformations to phonetic groups, known level of LE, etc.) • 2nd Goal • Proposal of novel LE-equalization techniques with a focus on both level of LE suppression and extent of bounding assumptions

LE Corpora • Available Czech Corpora • Czech SPEECON – speech recordings from various environments including office and car • CZKCC – car recordings – include parked car with engine off and moving car scenarios • Both databases contain speech produced in quiet in noise  candidates for study of LE, however, not good ones, shown later  Design/acquisition of LE-oriented database – Czech Lombard Speech Database‘05 (CLSD‘05) • Goals – Communication in simulated noisy background  high SNR • Phonetically rich data/extensive small vocabulary material • Parallel utterances in neutral and LE conditions

Data AcquisitionRecording Setup • Simulated Noisy Conditions • Noise samples mixed with speech feedback and produced to the speaker and operator by headphones • Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible, operator asks the subject to repeat it  speakers are required to convey message over noise  communication LE • Noises: mostly car noises from Car2E database, normalized to 90 dB SPL • Speaker Sessions • 14 male/12 female speakers • Each subject recorded both in neutral and simulated noisy conditions

Data Acquisition Recording Setup • Simulated Noisy Conditions • Noise samples mixed with speech feedback and produced to the speaker and operator by headphones • Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible, operator asks the subject to repeat it  speakers are required to convey message over noise  real LE • Noises: mostly car noises from Car2E database, normalized to 90 dB SPL • Speaker Sessions • 14 male/12 female speakers • Each subject recorded both in neutral and simulated noisy conditions

Data AcquisitionImpact of Headphones • Environmental Sound Attenuation by Headphones • Attenuation characteristics measured on dummy head • Source of wide-band noise, measurement of sound transfer to dummy head’s auditory canals when not wearing/wearing headphones • Attenuation characteristics – subtraction of the transfers

Data AcquisitionImpact of Headphones • Environmental Sound Attenuation by Headphones • Directional attenuation – reflectionless sound booth • Real attenuation in recording room

Speech Production under Lombard Effect • Speech Features affected by LE • Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises

Speech Production under Lombard Effect • Speech Features affected by LE • Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises • Vocal tract transfer function: center frequencies of low formants increase, formant bandwidths reduce

Speech Production under Lombard Effect • Speech Features affected by LE • Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises • Vocal tract transfer function: center frequencies of low formants increase, formant bandwidths reduce • Vocal effort (intensity) increase

Speech Production under Lombard Effect • Speech Features affected by LE • Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises • Vocal tract transfer function: center frequencies of low formants increase, formant bandwidths reduce • Vocal effort (intensity) increase • Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…

Analysis of Speech Features under LEFundamental Frequency

Analysis of Speech Features under LEFormant Locations

Analysis of Speech Features under LEFormant Bandwidths • SPEECON, CZKCC: no consistent BW changes • CLSD‘05: significant BW reduction in many voiced phonemes

Analysis of Speech Features under LEPhoneme Durations • Significant increase in duration in some phonemes, especially voiced phonemes • Some unvoiced consonants – duration reduction • Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC

Lombard EffectInitial ASR Experiments • Digit Recognizer • Monophone HMM models • 13 MFCC + ∆ + ∆∆ • 32 Gaussian mixtures per model state • D – word deletions • ASR Evaluation – WER (Word Error Rate) • S – word substitutions • I – word insertions • D – word deletions Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR) Clean recordings (LE - 40.9 dB SNR)

LE Suppression in ASR Model Adaptation • Model Adaptation • Often effective when only limited data from given conditions are available • Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per class, acoustically close classes are grouped and transformed together • Maximum a posteriori approach (MAP) – initial models are used as informative priors for the adaptation • Adaptation Procedure • First, neutral speaker-independent (SI) models transformed by MLLR, employing clustering (binary regression tree) • Second, MAP adaptation – only for nodes with sufficient amount of adaptation data

LE Suppression in ASR Model Adaptation • Adaptation Schemes • Speaker-independent adaptation (SI) – group dependent/independent • Speaker-dependent adaptation (SD) – to neutral/LE

LE Suppression in ASR Data-Driven Design of Robust Features • Filter Bank Approach • Analysis of importance of frequency components for ASR • Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress disturbing components • Initial FB uniformly distributed on linear scale – equal attention to all components • Consecutively, a single FB band is omitted  impact on WER? • Omitting bands carrying more information will result in considerable WER increase • Implementation • MFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters without overlap

Data-Driven Design of Robust FeaturesImportance of Frequency Components 1 20

Data-Driven Design of Robust FeaturesImportance of Frequency Components • Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important for neutral speech, F1–F2 for LE speech recognition • Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech  tradeoff • Next step – how much of the low frequency content should be omitted for LE ASR?

1 19 Lombard Effect Optimizing Filter Banks – Omitting Low Frequencies

1 19 Data-Driven Design of Robust FeaturesOmitting Low Frequencies

Data-Driven Design of Robust FeaturesOmitting Low Frequencies 1 19

Data-Driven Design of Robust FeaturesOmitting Low Frequencies • Effect of Omitting Low Spectral Components • Increasing FB low cut-off results in almost linear increase of WER on neutral speech while considerably enhancing ASR performance on LE speech • Optimal low cut-off found at 625 Hz

Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution • Increasing Frequency Resolution • Idea – emphasize high information portion of spectrum by increasing FB resolution • Experiment – FB decimation from 1912 bands (decreasing computational costs) • Increasing number of filters at the peak of information distribution curve • deterioration of LE ASR (17.2 %  26.9 %) • Slight F1–F2 shifts due to LE affect cepstral features • No simple recipe on how to derive efficient FB from the information distribution curves 625 Hz

Lombard Speech Recognition

Lombard Speech Recognition

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech recognition

Speech Recognition

Speech recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

SPEECH RECOGNITION

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Lombard Speech Recognition