S2S Nijmegen Workshop February 10-14, 2008

Combining Abstract and Exemplar Modelsin ASRDirk Van Compernolle Kris Demuynck, Mathias De Wachter S2S Nijmegen Workshop February 10-14, 2008

Overview • PART I: Example based models in ASR: • Motivation • Proof-of-Concept • Baseline Results • Required Extensions • PART II: Bottom-up vs. Top-down processing in ASR • Do we care ? • A top-down search engine with bottom-up phonetic scoring • A combined template matching and HMM recognizer Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

PART IExample Based ASR Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Example Based ASR • Example based ASR was successful in Speaker-Dependent Isolated Word Recognition. It was abandoned when technology moved to continuous speaker independent recognition Why re-activate an approach that smoothly died 25 yrs ago ? • Psycho-linguistics and intuition give evidence of the existence of individual memory traces (spanning many phonemes) in: • human speech recognition in general • music/song memory & recognition • second language learning • Success of concatenative Text-to-Speech • Acknowledgement of limitations to model based (HMM based) ASR • Computing demands for continuous large vocabulary recognition were essential bottlenecks – that may not be relevant any longer today. Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Today’s Prototypical HMM based ASRBeads-on-a-String Model Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Sj1 Sj2 Sj3 Phone Modeling with HMMs short-time spectral representations ph(j)_1 MANY EXAMPLES of phone ‘j’ ph(j)_2 ph(j)_Nj TRAINING means + variances + duration model MULTI-STATE MODEL for phone ‘j’ Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

speech database dictionary phone set reference HMM phone level transcription Phone HMMs Viterbi Alignment State (sub-phone) Segmentation Re-estimation Iterative HMM Training word level transcription Feature Extraction words-to-phones Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMM Model Building • Based on a 2-D LDA projection of mel-cepstra optimized for digit recognition • ‘S1,S2,S3’ represent the 3 CD HMM-states of the central vowel in “f I ve” • Ellipses indicate the ‘1-sigma’ boundaries Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMMs – Strengths • Strong mathematical framework • statistical pattern matching / Bayesian Classification • optimal strategy under the assumption of a perfect model with sufficient training data !! • Fully automatic training (inner loop) • Ability to (optimally) combine the information from thousands of hours of example speech • Highly scalable: more data leads to better results • allows for training a more refined model with more parameters that gets closer to reality (model assumptions) • a better trained model that is more robust to intrinsic variability Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMMs - Weaknesses • Model is intrinsically flawed, because of: • within state (i.e. short-term) stationarity assumption • 1st order Markov assumption: state independence • presumed frame by frame independence • This implies • no guaranteed optimality for Bayesian Classification / Maximum Likelihood paradigm • continuous effort to improve (patch) the model • best performance with discriminative training procedures Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMMs30 yrs of improvements on the basic model “If the model was correct, then would be better than HMMs and Maximum Likelihood Training. So, let’s stick to the concept and fix the model” • Multi-state Context-Dependent models • Multi-gaussian modeling of the observation densities • Derivative Features “For this we only need bigger computers and more data that allows us“ • To make these complex models with more degrees of freedom • To do a proper training of these hundreds of millions of parameters • To perform recognition with them in real-time Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMMs30 yrs of improvements on the basic model …. Then we will reach nirvana, unless … … after 30yrs the model is still basically flawed • because of poor segmental modeling … more training data does not seem to result in better models any longer • because requirements for further improvement seem to grow logarithmically • because for smaller languages more data is just not feasible … so, today computers have more power than we know what to do with it Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Example Trajectories and HMM states • Trajectories contain more information than the HMM state sequence !! • Trajectories show a very different picture than the ‘cloud’ of points underlying HMM state training Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Aligning of individual trajectories to HMMs Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMM viewpoint: red and black sequence of observations yield identical scores • Segmental viewpoint: black trajectory is significantly more plausible than the red one = > HMM vs. Segmental S1 S2 Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Segmental Modeling in HMMs • Segmental properties are obviously important within phonemes and across multiple phonemes • HMMs loose this longer time-scale view despite the modifications made to the model over the years • Attempts to make segmental statistical models have not been very successful so far • Detailed trajectory properties were well preserved in the old template matching DTW systems • ….. Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Is example based large vocabulary continuous recognition a viable alternative to model state based (HMM) recognition ? Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Example Based ASRResearch Agenda • Proof-of-concept phase • To build a baseline mid/large vocabulary system with medium sized databases • Show similar recognition performance to HMM systems • Competitive phase • To build systems that can handle huge databases • Build systems that go beyond the naïve extrapolation of today’s HMMs • Improve on performance at acceptable cost Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

HMM vs. Exemplar Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Example Based LVCSR: How ?Baseline System • Speech Database [ = “Memory” ] • same databases as used for training statistical systems • collection of long stretches of acoustic vectors • annotated at multiple levels: phone, syllable, word, .. • any of the annotations (incl. segmentation) can serve as a “template” • Recognition Paradigm • Find that sequence of templates that best matches a given input by using Dynamic Time Warping (DTW) • Use the ‘Template Transition Cost’ concept to control template transitions • Borrow other components from existing HMM technology • Token passing Time Synchronous Beam search • N-gram language modeling Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Aligning of trajectories OBSERVE: the “closest matching template” is by no means the sequence of nearest neighbors for each frame Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Issues in Example Based ASR Local Distance Metric • The utterance based distortion is the sum of local (frame based) distances • One of the great advances of HMM systems was the use of more complex metrics than previously used in DTW • class (phone state) dependent • multi-gaussian distributions with many parameters • It is possible to transfer some of the HMM improvements to the DTW framework, but not all and not in a trivial manner: • Local Mahalanobis distance • further improvements by applying other ideas from non-parametric statistics: outlier correction, data sharpening, adaptive kernel Mahalanobis, … [see papers ICASSP07, INTERSPEECH07] • Weakness of (our) current system • score is based on a sequence of reference templates • From a KNN perspective the score should be based on group voting • …. ongoing research Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

15.3 hours of speech from 84 speakers only 14 segments of 2 seconds relevant to the search are shown # # I! t s tI! l @ n klI! r # # # Solution Input

# # I! t s tI! l @ n klI! r # # # Concatenation of chosen templates Concatenation + dynamic time warping (DTW) Input Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Controlling Template Concatenation Using a concatenation cost model, based on: • natural successor templates in the reference database • phonetic context • gender, accent, recording condition, … has great impact on  selected segment length  naturalness of resynthesized reference  lowering the error Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Experiments - Task descriptions • TIMIT • Test and train sets hand-labeled • 1.6 hours of training material, 462 speakers • Resource Management (RM) • 991 word lexicon: CMU v0.4 [train and test highly matched !] • 3.8 hours of noise-free training speech • 150.000 phone templates • Wall Street Journal (WSJ0) • Automatically segmented and labeled by HMM system based on sentence transcription • 15.3 hours of training material, 84 speakers • 4986 words • 450.000 phone templates (?) Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Phone string recognition TIMIT Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Phone string recognition WSJ0 Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Sentence recognition Resource Management Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Sentence recognition WSJ0 Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Example based ASR:Discussion • For some task (medium sized problems) we were able to build a system that matches or exceeds performance of state-of-the-art HMM systems PROOF OF CONCEPT: OK • Success is critically dependent on the ability to use multi-phone segments • frame based distance metric is not as powerful (yet) as with HMMs ! [ single nearest paradigm instead of KNN ?] • potentially better modeling of phone transitions than CD-HMMs [ i.e. NO modeling ! ] • Challenges to move to large vocabulary tasks and very large databases: Richness of the database: very many contexts by very many speakers • Move away from the naive HMM-like top-down search engine • Make better use of the available data : normalize for speaker (VTLN), acoustics Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Issues in Example Based ASR Search Space Explosion • any allophone can be represented by any of its examples • search space keeps on growing with larger example databases: factor 100, 1.000, 10.000, …. • large amount of redundant information • hence a large inefficiency • traditional pruning approaches will not be efficient • early data driven (bottom-up) pruning is essential [ this was applied in all experiments, but not discussed ] Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

PART II:ASR Search TechniquesTop-Down and Bottom-Up Combined Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Top-Down Search StrategyConcept • hypothesize: • all possible sentences allowed by the language model • find: • the one that best matches the observed acoustics (spectral like frame based parameters) Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Top-Down Search StrategyTime-Synchronous Implementation • initialize: start with the dummy ‘start sentence’ word • loop: • extend all hypotheses • that are at or near word-end positions with all next possible words • find phone/template string equivalents for these extensions • fetch a new segment (frame) of data • incrementally compute the matching score between all hypotheses active in the search and the observed acoustics • order the hypotheses according to score and prune away • hypotheses that are ‘significantly’ worse than the best one • hypotheses that fall below the Top-N • end : accept Top-1 as your final result (best guess) Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Top-Down Search StrategyWhy has it been so successful ? • The Language Model constraints are very restrictive • therefore it makes sense to apply them first • strong (overweighted) language models have been an essential ingredient for many commercial successes in ASR • The top-down search is very tolerant for errors in one the weakest chains in the beads-on-a-strong model: • errors in the phonetic dictionary are abundant • pronunciation dictionaries don’t contain all possible pronunciations • people don’t talk the way they are supposed to talk • but substituted/missing/inserted phone segments are absorbed by • forcing ‘a few’ frames to align with the presumed phone • this mismatch cost may not be so big, because • HMM scores smoothly decay as points are further from the class centroid • HMMs will stretch or compress segments to their own benefit Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

At the opposite end:The Intuitive Bottom-Up Recognition Paradigm words word/sentence recognition phones phone(tic) recognition phonetic features spectral analysis, feature extraction, speech signal processing noise suppression, … Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Bottom-Up Why did it fail ? • Intuitive bottom-up was the paradigm of choice in the early days of speech recognition (1970’s) • The prototypical implementation has been: • recognize the next layer on the basis of the layer immediately below • acknowledge that recognition will be imperfect, for this develop statistical pattern matching techniques that allow for insertions/deletions/substitutions • The biggest failure in this paradigm • is to use a single best recognition as information carrier between two layers • As errors • propagate at great speed and with great prosperity throughout the search network • the acoustic-phonetic recognition is not good enough to allow any error correction paradigm to function well Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Bottom-Up and Top-DownEssential Weaknesses • Bottom-Up • difficult to recover from recognition errors in lower layers • the correct hypothesis might never get activated • Top-down • the linguistic universe is limited to a restrictive predefined language model • difficult (impossible) to discover new things • practically impossible for an LVCSR example based system Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

What’s in the middle of all this ?The phoneme concept Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Are Phone(me)s Real ? conceptual level quite unambiguously recognized on the basis of the acoustics words (morphemes) a convenient intermediate level both for humans and machines ill-defined and highly ambiguous phones (allophones, phonemes) speech signal is ‘given’ thus unambiguous but contains massive amounts of non-phoneticinformation/noise Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Recognition Models with Early Abstraction words (morphemes) top-down search for best possible word sequence on the basis of uncertain phonetic information phone graph bottom-up recognition of low-level abstract units Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Recognition Modelwith Early Abstraction words Top-down search engine (driven by LM + phonetic dict.) with phone graph as input phonemes probabilistic phone recognition phone graph spectral analysis, feature extraction, speech signal processing noise suppression, … Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

It could make sense • Bottom-up / Early abstraction is required for many skills • “fast match” • new word recognition • nonsense word recognition • Fully top-down was/is an engineering/economic necessity • Phone recognition is influenced by top-down linguistic processes wrt. • recognition speed • linguistic overrules Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

If we can get it to workPhone Graph Quality • phone graph error rate should be low (a few %) • phone graph density should be moderate • search on the phone graph should not be slower than on the frame data • very bad matches should NOT be included • as their acoustic scores make little or no sense • a more abstract ‘substitution/insertion/deletion’ score will make more sense • Error model • should serve to overcome genuine phone errors • dictionary mistakes • gross pronunciation mistakes • should be gentle on the search effort Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

If we can get it to workError Model • should serve to overcome genuine phone errors • dictionary mistakes • gross pronunciation mistakes • should be gentle on the search effort • generic insertion/deletion/substitution will again make the search explode • “single error” model: each error should be embedded between 2 phones found in the graph Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

New Opportunities • Assuming a quite dense phone graph with few errors • total search effort significantly smaller than in fully top-down [ FLAVOR !! ] • possibility to model more complex linguistic knowledge sources • a way out for controlling the search problem of example based systems !! Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

Experiments with combined system(constrained lexicon on RM) Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR

S2S Nijmegen Workshop February 10-14, 2008