1 / 42

Radboud University Nijmegen

How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands. Radboud University Nijmegen. Overview. Contents : Variation, invariance problem

afigueroa
Download Presentation

Radboud University Nijmegen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to handlepronunciation variation in ASR:By storing episodes in memory?Helmer StrikCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands Radboud University Nijmegen

  2. Overview • Contents : • Variation, invariance problem • ASR : Automatic Speech Recognition • HSR : Human Speech Recognition • ESR : Episodic Speech Recognition

  3. Invariance problem (1) • One of the main issues in speech recognition is the large amount of variability present in speech. • SRIV2006: ITRW on Speech Recognition and Intrinsic Variation • Invariance problem: • Variation in stimuli, invariant percept • Also visual, tactile, etc. • Studied in many fields, no consensus • 2 paradigms • Invariant • Episodic

  4. Invariance problem (1) • Example 1: Speech • Dutch word: “natuurlijk” (naturally, ‘of course’) • [natyrlk] • [natylk] • … • [tyk] • Multiword expressions (MWEs): • lot of reduction • many variants

  5. Invariance problem (2) • Example 2: Writing (vision) • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • natuurlijknatuurlijk • Familiar ‘styles’ (fonts, handwriting) • are recognized better natuurlijk natuurlijk

  6. ASR - Paradigm • Invariant, symbolic approach : • utterance • sequence of words • sequence of phonemes • sequence of states • parametric description : pdf’s / ANN

  7. ASR - Paradigm • Same paradigm (HMMs), since 70’s • Assumptions : incorrect, questionable • Insufficient performance • ASR vs. HSR : error rates 8-80x higher • Slow progress (ceiling effect?) • Simply using more and more data is not sufficient (Moore, 2001)  A new paradigm is needed! However, only few attempts

  8. HSR - Indexical information • Speech - 2 types of information : • Verbal info. : what, contents • Indexical info. : how, form e.g. environmental and speaker-specific aspects (pitch, loudness, speech rate, voice quality)

  9. HSR - Indexical information • Traditional ASR model: • Verbal information is used • Indexical information • Noise, disturbances • Preprocessing: • Strip off • Normalization (VTLN, MLLR, etc.) • And in HSR?

  10. HSR - Indexical information • HSR : Strip off indexical information? • No! • Familiar voices and accents : • recognize and mimic • Indexical information • is perceived and encoded

  11. HSR - Indexical information • Verbal & indexical information : • processed independently? • No! • Familiar ‘voices’ are recognized better • Facilitation, also with ‘similar’ speech

  12. HSR - Indexical and detailed information • Experimental results: • indexical information and • fine phonetic detail (Hawkins et al.) • influence perception • Difficult to explain / integrate in the traditional, invariant model • New models: episodic models, • for auditive and visual perception

  13. ESR - Basic idea • A new paradigm for ASR is needed: • An episodic model !!?? • Training : • Store trajectories - (representatives of) episodes • Recognition : • Calculate distance between X and sequences of stored trajectories (DTW) • Take the one with minimum distance : the recognized word

  14. ESR – Invariant vs. episodic • phone-based HMM ESR • ------------------------------------------------------------- • Unit: • [ Phone Syllable, word, … ] • Representation: • States - pdf’s or ANN Trajectories • Compare: • Trajectory (X) & states Trajectory (X) & Trajectories • Parsimonious representation Extensive representation • Complex mapping Simple mapping • ‘Variation is noise’ Variation contains info. • Normalization Use variation

  15. Representation pdf’s (Gaussians)  Much detail, dynamic information is lost Trajectories: details Phone ‘aj’ from ‘nine’. X = begin 3 parts: aj(, aj|, aj)

  16. Unit: phone(me) • Switchboard (Greenberg et al.): • deletion: 25% of the phones • substitution: 30% of the phones • together 55%!! • Difficult for a model based on ‘sequences of phones’. • Syllables: less than 1% deleted • Phonetic transcriptions and their evaluation : • Large differences between humans • What is the ‘golden reference’? • Speech – a sequence of symbols?

  17. Unit: Multiword expressions (MWEs) • MWEs (see poster) : • A lot of reduction; many phonemes deleted, or substituted • Many variants (= sequences of phonemes) more than 90 for 2 MWEs studied • Difficult to handle in ASR systems with current methods for pronunciation variation modeling. • Reduction, e.g. for a MWE: 4 words with 7 syllables reduced to ‘1 entity’ with 2 syllables What should be stored? Units of various lenghts?

  18. An episodic approach for ASR • Advantages: • More information during search: dynamic, indexical, fine phonetic detail • Continuity constraints can be used (reduces the trajectory folding problem) • Model is simpler • Disadvantage: • More information during search: complexity • Brain: a lot of storage and ‘CPU’ • Computers: more and more powerful

  19. An episodic approach for ASR • Strik (2003) ITC-irst, Trento, Italy; ICPhS, Barcelona • De Wachter et al. (2003) Interspeech-2003 • Axelrod & Maison (2004) ICASSP-2004 • Maier & Moore (2005) Interspeech-2005 • Aradilla, Vepa, Bourlard (2005) Interspeech-2005 • Matton, De Wachter, et al. (2005) SPECOM-2005 • Promising results • The computing power and memory that are needed to investigate the episodic approach to speech recognition are (becoming) available

  20. The HSR-ASR gap • HSR & ASR – 2 different communities • Different people, departments, journals, terminology, goals, methodologies • Goals, evaluation • HSR: simulate experimental findings • ASR: reduce WER

  21. The HSR-ASR gap • Marr (1982) – 3 levels of modeling: • Computational • Algorithmic • Implementational • HSR - (larger) differences at higher levels • ASR – implementations, end-to-end models using real speech signals as input • Thousands of exp.: WER has been gradually reduced • However, essentially the same model • New model: performance (WER), funding, etc.

  22. The HSR-ASR gap - bridge • Use same evaluation metric for HSR & ASR systems: reaction times (Cutler & Robinson, 1992) • Use knowledge or components from the other field (Scharenborg et al., 2003). • Use models that are suitable for HSR & ASR research • Evaluation from HSR & ASR point of view • S2S – Sound to Sense (Sarah Hawkins) • Marie Curie Research Training Network (MC-RTN) • Recently approved by the EU

  23. Episodic speech recognition THE END

  24. ESRASA model

  25. ESRASA model • ESRASA • Episodic Speech Recognition And Structure Acquisition • The ESRASA model is inspired by several previous models, especially • model described in Johnson (1997) • WRAPSA (Jusczyk, 1993), and • CGM (Nosofsky, 1986) • The ESRASA model is a feedforward neural network with two sets of weights: atTention weights Tn and assoCiation weights Cew. Besides these two sets of weights, words, episodes (for speech units), and their base activation levels (Bw and Be, respectively) will be stored in memory.

  26. X L items in lexicon L items in lexicon Preselection Preselection S items in subset S items in subset Competition Competition 1 item, the winner 1 item, the winner ESRRecognition

  27. ESRPreselection • Why preselection? • Reduce CPU & memory • Increase performance • Also used in DTW-based pattern recognition applications • Used in many HSR models

  28. ESRCompetition • Recognize unknown word X : • Calculate distance between X and sequences of stored episodes (DTW) • Take the one with minimum distance : the recognized word • Use continuity constraints (as in TTS)

  29. ESRDTW: Dynamic Time Warping

  30. ESR – ResearchPreselection ? • Best method? • Compare: • kNN – k nearest neighbor • Lower bound distance : Ddtw Dlb d • Build an index for the lexicon • Is preselection needed? • Compare: with & without preselection

  31. ESR – ResearchUnits for preselection ? • Compare : • Syllable • Word • Begin (window of fixed length)

  32. ESR - ResearchUnits for competition ? • Compare : • Syllables • Words • In combination with multisyllables? • Multisyllables (reduction, resyllabification) • Ik weet het niet -> kweeni • Op een gegeven moment -> pgeefment • Zeven-en -> ze-fnen

  33. ESR - ResearchExemplars ? • How to select exemplars : • DTW distances + hierarchical clustering • VQ : LVQ & K-means • Trade-off normalization & (size) lexicon • Compare normalization techniques : • TDNR, MVN, HN • VTLN

  34. ESR - ResearchFeatures ? • Compare : • Spectral features : MFCC, PLP, LPC • Articulatory features (ANN) • Combine spectral & articulatory feat. • Different features for preselection & competition?

  35. ESR - Research Distance metrics ? • Compare (frame-based metrics) : • Euclidean • Mahalanobis • Itakura (for LPC) • Perceptually-based? • Distance metric for trajectories?

  36. HMM-based ASR Information sources • HMM-based ASR, roughly 3 ways : • Class-specific HMMs • Multistream • 2-pass decoding Disadvantages : • Many classes • Synchronization & recombination • Pass 1 : no / less knowledge

  37. ESR - ResearchInformation sources • ESR : compare 2 trajectories • All details are available during search, e.g. context & dynamic information • Compare shape + timing of feat. contours • F0 rise: early or final, half or complete • Tags can be added to the lexicon • + continuity constraints

  38. HSR - Foreign English Examples • Conversation about Italy. [ FEE 1 ] By parachute? I was robbed in Milan. dropped / robbed

  39. HSR - Indexical information • HSR : Strip off indexical information? • No! • Familiar voices and accents : • recognize and mimic [ FEE 2 ] • Indexical information • is perceived and encoded

  40. HSR - Indexical information • Verbal & indexical information : • processed independently? No! • Familiar ‘voices’ are recognized better • [ FEE 3 ] • Facilitation, also with ‘similar’ speech • [ FEE 4 ]

  41. ASR - Pronunciation variation • SRIV2006: • ITRW on Speech Recognition and Intrinsic Variation • Pronunciation variation modeling for ASR : • Improvements, but generally small • Current ASR paradigm : suitable? • Phonetic transcriptions and their evaluation : • Large differences between humans • What is the ‘golden reference’? • Speech – a sequence of symbols?

More Related