1 / 33

Jennifer Cole Dept. of Linguistics

Jennifer Cole Dept. of Linguistics. Mark Hasegawa-Johnson Dept. of Electrical and Computer Engineering. Bringing Prosody into Automatic Speech Recognition: Improving word recognition and advancing linguistic science. Prosody: the rhythmic and intonational patterns of spoken language

Download Presentation

Jennifer Cole Dept. of Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jennifer Cole Dept. of Linguistics Mark Hasegawa-Johnson Dept. of Electrical and Computer Engineering Bringing Prosody into Automatic Speech Recognition:Improving word recognition and advancing linguistic science

  2. Prosody: the rhythmic and intonational patterns of spoken language • A benefit for comprehension: prosody cues phrasing and information status of words. • A cost for speech recognition: prosody conditions acoustic variation.

  3. Sources of variation in speech • The acoustic features of a speech sound vary as a function of: • Phonological context (assimilations, deletions, insertions) • Phonetic context (coarticulation, masking) • Speaker voice • Speaking style and tempo • Prosodic factors: accent, phrasal position

  4. Modeling acoustic variation in ASR • Acoustic variation that results from local phonological and phonetic context can be accomodated in ASR through the use of “diphone” and “triphone” models. • Variation due to speaker, speech style, and prosody are not determined by local phone context, and not explicitly modeled in most ASR systems.

  5. Variation and confusability • Prosodically-conditioned variation causes greater overlap between contrastive phones in acoustic space. • Greater overlap between phones can result in greater confusability, and is a likely source of error in word recognition.

  6. Accent leads to greater overlap: acoustic cues to consonant voicing voiced voiceless Combining accent conditions results in greater overlap between p/b. = unaccented = accented VOT P p B p/b b P/B

  7. Separating accented from unaccented phones yields better distinctions within accent category Unaccented Accented No overlap between voiced/voiceless phones within accent category Separate models for accented and unaccented consonants should result in better recognition. VOT p p b b

  8. Immediate goal: • By modeling prosodic distinctions in speech recognition we expect to achieve more accurate phone and word recognition. Improving ASR If we do, then… • Results from our speech recognition experiments will tell us about the prosodic effects that occur in “non-lab” (natural) spoken English.Advancing linguistic science

  9. Future goals: • Recognition of pitch and durational cues to pragmatic meaning and discourse structure. • An approach to modeling other sources of systematic variation: “foreign” or dialectal accent, speech disorders, child speech…

  10. Our approach to prosody-dependent phone modeling • Determine which prosodic features condition confusion-inducing variation for which kinds of phones. • Train a speech recognizer on prosodically specified phones. • Requires a training corpus of prosodically-labeled and phone-labeled speech. • Develop an approach to recognize prosodic features from acoustic cues.

  11. Linguistic models of acoustic variation • Research in phonetics shows that prosodic factors are a significant source of variation. • In English: • Lexical stress • Nuclear (phrasal) accent • Phrase position (initial/medial/final)

  12. Prosodic effects on articulation • Sounds in stressed or accented syllables are more strongly articulated, • Stressed/accented speech gestures are • Faster • Longer • Bigger (greater displacement) Cho 2001; deJong 1991, 1995; Edwards & Beckman 1988; Edwards et al 1991; Beckman et al 1992; Harrington et al 1995; Cooper 1991

  13. Prosodic effects on acoustics (English) • Vowels in unstressed syllables are reduced (centralized) and shortened compared to stressed vowels (Lindblom 1963). • Segment durations are longer • In accented syllables (Beckman & Edwards 1994) • In phrase-initial syllables (Fougeron & Keating 1997) • In phrase-final syllables (Edwards & Beckman 1988; Crystal & House 1988; Wightman et al 1994)

  14. Limitations of prior studies • Findings are from laboratory studies of controlled speech, produced in absence of real discourse context • Focus is on supralaryngeal articulations (C and V place and manner); • The bulk of the evidence comes from articulatory studies.

  15. Research questions • What are the full range of effects of prosodic factors on acoustic features? • How are laryngeal features affected? • How far does prosody influence acoustic variation in non-laboratory speech?

  16. Dual methods for investigating prosodic effects • Acoustic analysis provides a direct measure of prosodically-conditioned variation. • Speech recognition experiments provide indirect evidence for prosodic effects • Recognition is improved when prosodic context is explicitly modeled, so prosodic effects must have decreased the distinctiveness of contrastive phones in the speech corpus studied.

  17. Phase I: Boston University Radio News speech corpus • What are the effects of accent on the acoustic cues that distinguish voiced from voiceless stops in American English? • /p,t,k/ vs. /b,d,g/ • Does accent condition a significant degree of variation in this speech corpus? • We begin by looking at the acoustic cues for the voicing contrast for stops in V# 'CV contexts: • C is onset in word-initial, stressed syllable • Comparing Cs in accented and unaccented syllables

  18. Why Radio News Speech? • Speech not controlled for purposes of phonetics research. • Speech produced under real communicative context. • Speech produced by professional radio news announcers. (good? bad?) • Multiple speakers reading same news story • Speech is prosodically labeled based on ToBI labeling standards (Beckman-Pierrehumbert model) … saves us lots of time/work!

  19. Timit database Confusion Matrix *A *A=Actual phoneme, *R= Recognized phoneme

  20. Acoustic cues for the phonological voicing contrast • VOT • voiceless > voiced • closure (lead) voicing for voiced stops • F0 (measured at onset of following vowel) • voiceless > voiced • Closure duration • voiceless > voiced

  21. Hypothesized Accentual Effects Paradigmatic strengthening Syntagmatic strengthening Unaccented Accented Unaccented Accented Acoustic values Acoustic values Contrastive Pairs Contrastive Pairs

  22. Predicted effects of accent • Paradigmatic Strengthening: greater acoustic distinctions between voiced and voiceless stops for all measures • not a problem for ASR • Syntagmatic Strengthening: similar effects for both voiced and voiceless stops: • increase in VOT and Closure Duration • increase in acoustic energy, resulting in higher F0

  23. ANOVA Results for effects of Voicing and Accent on means of acoustic measures • Voicing was a significant factor for all three measures.  These cues signal voicing • Significant effects of Accent found for VOT, F0 and Closure Duration. • Accent effects: • Increased VOT for all stops except /g/; • Raised F0 for all stops except /b/; • Increased Closure Duration for all stops except /g/.

  24. k t p g b d

  25. K k P T t Region of overlap of voiced and voiceless groups within an accent category p g B G D d b

  26. K k P T t For all 3 Places: A greater overlap of voiced and voiceless groups when accent conditions are pooled. p g B G D d b

  27. k d t p b g

  28. K T t P k p d D g B b G Region of overlap of voiced and voiceless groups within an accent category

  29. K T t P k p d D g B b G For bilabials and velars: A greater overlap of voiced and voiceless groups when accent conditions are pooled.

  30. b d p k t g

  31. B K D P T b g d p t k G Region of overlap of voiced and voiceless groups within an accent category

  32. B K D P T b g d G p t For bilabials and alveolars: A greater overlap of voiced and voiceless groups when accent conditions are pooled. k

  33. Summary of results from acoustic study • VOT, F0 and Closure duration: accent induces increased values for both voiced and voiceless stops  syntagmatic strengthening • VOT and F0: effects are bigger and more consistent for voiceless stops than for voiced stops  paradigmatic strengthening

More Related