Radboud University Nijmegen

How to handlepronunciation variation in ASR:By storing episodes in memory?Helmer StrikCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands Radboud University Nijmegen

Overview • Contents : • Variation, invariance problem • ASR : Automatic Speech Recognition • HSR : Human Speech Recognition • ESR : Episodic Speech Recognition

Invariance problem (1) • One of the main issues in speech recognition is the large amount of variability present in speech. • SRIV2006: ITRW on Speech Recognition and Intrinsic Variation • Invariance problem: • Variation in stimuli, invariant percept • Also visual, tactile, etc. • Studied in many fields, no consensus • 2 paradigms • Invariant • Episodic

Invariance problem (1) • Example 1: Speech • Dutch word: “natuurlijk” (naturally, ‘of course’) • [natyrlk] • [natylk] • … • [tyk] • Multiword expressions (MWEs): • lot of reduction • many variants

Invariance problem (2) • Example 2: Writing (vision) • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • Familiar ‘styles’ (fonts, handwriting) • are recognized better natuurlijk natuurlijk

ASR - Paradigm • Invariant, symbolic approach : • utterance • sequence of words • sequence of phonemes • sequence of states • parametric description : pdf’s / ANN

ASR - Paradigm • Same paradigm (HMMs), since 70’s • Assumptions : incorrect, questionable • Insufficient performance • ASR vs. HSR : error rates 8-80x higher • Slow progress (ceiling effect?) • Simply using more and more data is not sufficient (Moore, 2001)  A new paradigm is needed! However, only few attempts

HSR - Indexical information • Speech - 2 types of information : • Verbal info. : what, contents • Indexical info. : how, form e.g. environmental and speaker-specific aspects (pitch, loudness, speech rate, voice quality)

HSR - Indexical information • Traditional ASR model: • Verbal information is used • Indexical information • Noise, disturbances • Preprocessing: • Strip off • Normalization (VTLN, MLLR, etc.) • And in HSR?

HSR - Indexical information • HSR : Strip off indexical information? • No! • Familiar voices and accents : • recognize and mimic • Indexical information • is perceived and encoded

HSR - Indexical information • Verbal & indexical information : • processed independently? • No! • Familiar ‘voices’ are recognized better • Facilitation, also with ‘similar’ speech

HSR - Indexical and detailed information • Experimental results: • indexical information and • fine phonetic detail (Hawkins et al.) • influence perception • Difficult to explain / integrate in the traditional, invariant model • New models: episodic models, • for auditive and visual perception

ESR - Basic idea • A new paradigm for ASR is needed: • An episodic model !!?? • Training : • Store trajectories - (representatives of) episodes • Recognition : • Calculate distance between X and sequences of stored trajectories (DTW) • Take the one with minimum distance : the recognized word

ESR – Invariant vs. episodic • phone-based HMM ESR • ------------------------------------------------------------- • Unit: • [ Phone Syllable, word, … ] • Representation: • States - pdf’s or ANN Trajectories • Compare: • Trajectory (X) & states Trajectory (X) & Trajectories • Parsimonious representation Extensive representation • Complex mapping Simple mapping • ‘Variation is noise’ Variation contains info. • Normalization Use variation

Representation pdf’s (Gaussians)  Much detail, dynamic information is lost Trajectories: details Phone ‘aj’ from ‘nine’. X = begin 3 parts: aj(, aj|, aj)

Unit: phone(me) • Switchboard (Greenberg et al.): • deletion: 25% of the phones • substitution: 30% of the phones • together 55%!! • Difficult for a model based on ‘sequences of phones’. • Syllables: less than 1% deleted • Phonetic transcriptions and their evaluation : • Large differences between humans • What is the ‘golden reference’? • Speech – a sequence of symbols?

Unit: Multiword expressions (MWEs) • MWEs (see poster) : • A lot of reduction; many phonemes deleted, or substituted • Many variants (= sequences of phonemes) more than 90 for 2 MWEs studied • Difficult to handle in ASR systems with current methods for pronunciation variation modeling. • Reduction, e.g. for a MWE: 4 words with 7 syllables reduced to ‘1 entity’ with 2 syllables What should be stored? Units of various lenghts?

An episodic approach for ASR • Advantages: • More information during search: dynamic, indexical, fine phonetic detail • Continuity constraints can be used (reduces the trajectory folding problem) • Model is simpler • Disadvantage: • More information during search: complexity • Brain: a lot of storage and ‘CPU’ • Computers: more and more powerful

An episodic approach for ASR • Strik (2003) ITC-irst, Trento, Italy; ICPhS, Barcelona • De Wachter et al. (2003) Interspeech-2003 • Axelrod & Maison (2004) ICASSP-2004 • Maier & Moore (2005) Interspeech-2005 • Aradilla, Vepa, Bourlard (2005) Interspeech-2005 • Matton, De Wachter, et al. (2005) SPECOM-2005 • Promising results • The computing power and memory that are needed to investigate the episodic approach to speech recognition are (becoming) available

The HSR-ASR gap • HSR & ASR – 2 different communities • Different people, departments, journals, terminology, goals, methodologies • Goals, evaluation • HSR: simulate experimental findings • ASR: reduce WER

The HSR-ASR gap • Marr (1982) – 3 levels of modeling: • Computational • Algorithmic • Implementational • HSR - (larger) differences at higher levels • ASR – implementations, end-to-end models using real speech signals as input • Thousands of exp.: WER has been gradually reduced • However, essentially the same model • New model: performance (WER), funding, etc.

The HSR-ASR gap - bridge • Use same evaluation metric for HSR & ASR systems: reaction times (Cutler & Robinson, 1992) • Use knowledge or components from the other field (Scharenborg et al., 2003). • Use models that are suitable for HSR & ASR research • Evaluation from HSR & ASR point of view • S2S – Sound to Sense (Sarah Hawkins) • Marie Curie Research Training Network (MC-RTN) • Recently approved by the EU

Episodic speech recognition THE END

ESRASA model

ESRASA model • ESRASA • Episodic Speech Recognition And Structure Acquisition • The ESRASA model is inspired by several previous models, especially • model described in Johnson (1997) • WRAPSA (Jusczyk, 1993), and • CGM (Nosofsky, 1986) • The ESRASA model is a feedforward neural network with two sets of weights: atTention weights Tn and assoCiation weights Cew. Besides these two sets of weights, words, episodes (for speech units), and their base activation levels (Bw and Be, respectively) will be stored in memory.

X L items in lexicon L items in lexicon Preselection Preselection S items in subset S items in subset Competition Competition 1 item, the winner 1 item, the winner ESRRecognition

ESRPreselection • Why preselection? • Reduce CPU & memory • Increase performance • Also used in DTW-based pattern recognition applications • Used in many HSR models

ESRCompetition • Recognize unknown word X : • Calculate distance between X and sequences of stored episodes (DTW) • Take the one with minimum distance : the recognized word • Use continuity constraints (as in TTS)

ESRDTW: Dynamic Time Warping

ESR – ResearchPreselection ? • Best method? • Compare: • kNN – k nearest neighbor • Lower bound distance : Ddtw Dlb d • Build an index for the lexicon • Is preselection needed? • Compare: with & without preselection

ESR – ResearchUnits for preselection ? • Compare : • Syllable • Word • Begin (window of fixed length)

ESR - ResearchUnits for competition ? • Compare : • Syllables • Words • In combination with multisyllables? • Multisyllables (reduction, resyllabification) • Ik weet het niet -> kweeni • Op een gegeven moment -> pgeefment • Zeven-en -> ze-fnen

ESR - ResearchExemplars ? • How to select exemplars : • DTW distances + hierarchical clustering • VQ : LVQ & K-means • Trade-off normalization & (size) lexicon • Compare normalization techniques : • TDNR, MVN, HN • VTLN

ESR - ResearchFeatures ? • Compare : • Spectral features : MFCC, PLP, LPC • Articulatory features (ANN) • Combine spectral & articulatory feat. • Different features for preselection & competition?

ESR - Research Distance metrics ? • Compare (frame-based metrics) : • Euclidean • Mahalanobis • Itakura (for LPC) • Perceptually-based? • Distance metric for trajectories?

HMM-based ASR Information sources • HMM-based ASR, roughly 3 ways : • Class-specific HMMs • Multistream • 2-pass decoding Disadvantages : • Many classes • Synchronization & recombination • Pass 1 : no / less knowledge

ESR - ResearchInformation sources • ESR : compare 2 trajectories • All details are available during search, e.g. context & dynamic information • Compare shape + timing of feat. contours • F0 rise: early or final, half or complete • Tags can be added to the lexicon • + continuity constraints

HSR - Foreign English Examples • Conversation about Italy. [ FEE 1 ] By parachute? I was robbed in Milan. dropped / robbed

HSR - Indexical information • HSR : Strip off indexical information? • No! • Familiar voices and accents : • recognize and mimic [ FEE 2 ] • Indexical information • is perceived and encoded

HSR - Indexical information • Verbal & indexical information : • processed independently? No! • Familiar ‘voices’ are recognized better • [ FEE 3 ] • Facilitation, also with ‘similar’ speech • [ FEE 4 ]

ASR - Pronunciation variation • SRIV2006: • ITRW on Speech Recognition and Intrinsic Variation • Pronunciation variation modeling for ASR : • Improvements, but generally small • Current ASR paradigm : suitable? • Phonetic transcriptions and their evaluation : • Large differences between humans • What is the ‘golden reference’? • Speech – a sequence of symbols?

Radboud University Nijmegen