1 / 35

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition. Dr. Eric Fosler-Lussier Presentation for CiS 788. Overview. Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech

benjamin
Download Presentation

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition Dr. Eric Fosler-Lussier Presentation for CiS 788

  2. Overview • Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech • Two basic approaches for modeling pronunciationvariation • Encoding linguistic knowledge to pre-specify possiblealternative pronunciations of words • Deriving alternatives directly from a pronunciation corpus. • Purposes of this tutorial • Explain basic linguistic concepts in phonetics and phonology • Outline several pronunciation modeling strategies • Summarize promising recent research directions.

  3. Pronunciations & Pronunciation Modeling

  4. Pronunciations & Pronunciation Modeling • Why sub-word units? • Data sparseness at word level • Intermediate level allows extensible vocabulary • Why phone(me)s? • Available dictionaries/orthographies assume this unit • Research suggests humans use this unit • Phone inventory more manageable than syllables, etc. (in e.g., English)

  5. Statistical Underpinnings for Pronunciation Modeling • In the whole-word approach, we could find the most likely utterance (word-string) M* given the perceived signal: M* =

  6. Statistical Underpinnings for Pronunciation Modeling • With independence assumptions, we can use the following approximation: • Argmax P(M|X)

  7. Statistical Underpinnings for Pronunciation Modeling • PA(X|Q): the acoustic model • continuous sound (vector)s to discrete phone (state)s • Analogous to “categorical perception” in human hearing • PQ(Q|M): the pronunciation model • Probability of phone states given words • Also includes context-dependence & duration models • PL(M): the language model • The prior probability of word sequences

  8. Statistical Underpinnings for Pronunciation Modeling The three models working in sequence:

  9. Linguistic Formalisms & Pronunciation Variation • Phones & Phonemes • (Articulatory) Features • Phonological Rules • Finite State Transducers

  10. Linguistic Formalisms & Pronunciation Variation • Phones & Phonemes • Phones: Types of (uttered) segments • E.g., [p] unaspirated voiceless labial stop [spik] • Vs. [ph] aspirated voiceless labial stop [phik] • Phonemes: Mental abstractions of phones • /p/ in speak = /p/ in peak to naïve speakers • ARPABET: between phones & phonemes • SAMPAbet: closer to phones, but not perfect…

  11. Selected Consonants (arpa) tS chin tSIn (ch) dZ gin dZIn (jh) T thin TIn (th) D this DIs (dh) Z measure "mEZ@` (zh) N thing TIN (ng) j yacht jAt (y) 4 butter bV4@` (dx) Selected Vowels (arpa) { pat p{t (ae) A pot pAt (aa) V cut kVt (uh) ! U put pUt (uh) ! aI rise raIz (ay) 3` furs f3`z (er) @ allow @laU (ax) @` corner kOrn@` (axr) SAMPA for American English

  12. Linguistic Formalisms & Pronunciation Variation • (Articulatory) Features • Describe where (place) and how (manner) a sound is made, and whether it is voiced. • Typical features (dimensions) for vowels include height, backness, & roundness • (Acoustic) Features • Vowel features actually correlate better with formants than with actual tongue position

  13. From Hume-O’Haire & Winters (2001)

  14. Linguistic Formalisms & Pronunciation Variation • Phonological Rules • Used to classify, explain, and predict phonetic alternations in related words: write (t) vs. writer (dx) • May also be useful for capturing differences in speech mode (e.g., dialect, register, rate) • Example: flapping in American English

  15. Linguistic Formalisms & Pronunciation Variation • Finite State Transducers • (Same example transducer as on Tuesday)

  16. Linguistic Formalisms & Pronunciation Variation • Useful properties of FSTs • Invertible (thus usable in both production & recognition) • Learnable (Oncina, Garcia, & Vidal 1993, Gildea & Jurafsky 1996) • Composable • Compatible with HMMs

  17. ASR Models: Predicting Variation in Pronunciations • Knowledge-Based Approaches • Hand-Crafted Dictionaries • Letter to Sound Rules • Phonological Rules • Data-Driven Approaches • Baseform Learning • Learning Pronunciation Rules

  18. ASR Models: Predicting Variation in Pronunciations • Hand-Crafted Dictionaries • E.g., CMUdict, Pronlex for American English • The most readily available starting point • Limitations: • Generally only one or two pronunciations per word • Does not reflect fast speech, multi-word context • May not contain e.g., proper names, acronyms • Time-consuming to build for new languages

  19. ASR Models: Predicting Variation in Pronunciations • Letter to Sound Rules • In English, used to supplement dictionaries • In e.g., Spanish, may be enough by themselves • Can be learned (e.g. by DTs, ANNs) • Hard-to-catch Exceptions: • Compound-words, acronyms, etc. • Loan words, foreign words • Proper names (Brands, people, places)

  20. ASR Models: Predicting Variation in Pronunciations • Phonological Rules • Useful for modeling e.g., fast speech, likely non-canonical pronunciations • Can provide basis for speaker-adaptation • Limitations: • Requires labeled corpus to learn rule probabilities • May over-generalize, creating spurious homophones • (Pruning minimizes this)

  21. Examples of Fast-Speech Rules

  22. ASR Models: Predicting Variation in Pronunciations • Automatic Baseform Learning 1) Use ASR with “dummy” dictionary to find “surface” phone sequences of an utterance 2) Find canonical pronunciation of utterance (e.g., by forced-Viterbi) 3) Align these two (w/ dynamic programming) 4) Record “surface pronunciations” of words

  23. ASR Models: Predicting Variation in Pronunciations • Limitations of Baseform Learning • Limited to single-word learning • Ignores multi-word phrases, cross word-boundary effects (e.g., Did you  “didja”) • Misses generalizations across words (e.g., learns flapping separately for each word)

  24. ASR Models: Predicting Variation in Pronunciations • Learning Pronunciation Rules • Each word has a canonical pronunciation c1 c2 …cj…cn. • Each phone cj in a word can be pronounced by some sj. • Set of surface pronunciations S: {Si = si1, …, sin} • Taking canonical tri-phone and last surface phone into account, the probability of a given Si can be estimated:

  25. ASR Models: Predicting Variation in Pronunciations • (Machine) Learning Pronunciation Rules • Typical ML techniques apply: CART, ANNs, etc. • Using features (pre-specified or learned) helps • Brill-type rules (e.g., Yang & Martens 2000): • A  B // C __ D with P(B|A,C,D) positive rule • A  not B // C __ D with 1 - P(B|A,C,D) neg. rule (Note: equivalent to Two-level rule types 1 & 4)

  26. ASR Models: Predicting Variation in Pronunciations • Pruning Learned Rules & Pronunciations • Vary # of allowed pronunciations by word-frequency E.g., f (count(w)) = k log(count(w)) • Use probability threshold for candidate pronunciations • Absolute cutoff • “Relmax” (relative to maximum) cutoff • Use acoustic confidence C(pj,wi) as measure

  27. Online Transformation-Based Pronunciation Modeling • In theory, a dynamic dictionary could halve error-rates • Using an “oracle dictionary” for each utterance in switchboard reduces error by 43% • Using e.g., multi-word context, hidden speaking-mode states may capture some of this. • Actual results less dramatic, of course!

  28. Online Transformation-Based Pronunciation Modeling

  29. Five Problems Yet to Be Solved • Confusability and Discriminability • Hard Decisions • Consistency • Information Structure • Moving Beyond Phones as Basic Units

  30. Five Problems Yet to Be Solved • Confusability and Discriminability • New pronunciations can create homophones not only with other words, but with parts of words. • Few exact metrics exist to measure confusion

  31. Five Problems Yet to Be Solved • Hard Decisions • Forced-Viterbi throws away good, but “second-best” representations. • N-best would avoid this (Mokbel and Jouvet), but problematic for large-vocabulary • DTs also introduce hard decisions and data-splitting

  32. Five Problems Yet to Be Solved • Consistency • Current ASR works word-by-word w/o picking up on long-term patterns (e.g., stretches of fast speech, consistent patterns like dialect, speaker) • Hidden speech-mode variable helps, but data is perhaps too sparse for dialect-dependent states.

  33. Five Problems Yet to Be Solved • Information Structure • Language is about the message! • Hence, not all words are pronounced equal • Confounding variables: • Prosody & intonation (emphasis, de-accenting) • Position of word in utterance (beginning or end) • Given vs. new information; Topic/focus, etc. • First-time use vs. repetitions of a word

  34. Five Problems Yet to Be Solved • Moving Beyond Phones as Basic Units • Other types of units • “Fenones” • Hybrid phones [x+y] for //x///y/ rules • Detecting (changes in) distinctive features • E.g., [ax]  {[+voicing,+nasality], [+voicing,+nasality,+back], [+voicing,+back], …} • (cf. Autosegmental & Non-linear phonology?)

  35. Conclusions • An ideal model would: • Be dynamic and adaptive in dictionary use • Integrate knowledge of previously heard pronunciation patterns from that speaker • Incorporate higher-level factors (e.g., speaking rate, semantics of the message) to predict changes from the canonical pronunciation • (Perhaps) operate on a sub-phonetic level, too.

More Related