1 / 80

Phonetic perspectives on modelling information in the speech signal

Speech Analysis and Processing for Knowledge Discovery. Phonetic perspectives on modelling information in the speech signal. Sarah Hawkins University of Cambridge sh110@cam.ac.uk ISCA ITRW Workshop, Aalborg, 4-6 June 2008. Contents*.

kyle-dennis
Download Presentation

Phonetic perspectives on modelling information in the speech signal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Analysis and Processing for Knowledge Discovery Phonetic perspectives on modelling information in the speech signal Sarah Hawkins University of Cambridge sh110@cam.ac.uk ISCA ITRW Workshop, Aalborg, 4-6 June 2008

  2. Contents* • This ppt presentation contains all the slides shown during my talk at the ITRW, in the order in which they were shown. • But there are additional slides interspersed with those that were shown, which amplify the points made during the talk. The additional slides are marked with an asterisk (*) at the end of the title line, as on this slide • Some connecting comments are added in the Notes sections

  3. Speech perception: common assumptions • first focus is on phonological or lexical form • identify the words • ‘do grammar’ and all that stuff afterwards • ‘essence’ + variation • preserve the essence (phonemes or features) • discard the variation • use “higher-order knowledge” to deal with resultant ambiguities in phoneme strings Fine for where ASR models tend to start: small vocabulary,one speech style, a few speakers......but does not generalise

  4. Overview • history: how we got to those assumptions • “phonetic detail” (the sensory signal) tells us much more than phonological form • how brains seem to use the sensory signal • implications for modelling

  5. Main message: speech research • legacy of simplification and search for essence • but essence became defined as phonological GOAL became phonological form • but the real goals are: • understanding meaning (broadly defined) • successful interaction • so we need a different framework

  6. Goals:meaning and successful interaction A different framework, different prime units: • functional (  “polysystemic”) • adaptive to changing circumstances much of the information is in the signal: “phonetic detail” (PD) the brain structures that information (perception = creation of illusions?)

  7. Salience of PD is task specific: situation and function of response I don’t know I dunno dunno

  8. I do not know I don’t know I dunno dunno

  9. I…do…not…know I do not know I don’t know I dunno dunno

  10. I…do…not…know I do not know I don’t know I dunno dunno [an] []

  11. I…do…not…know I do not know I don’t know I dunno dunno [an] [] but not *[]or *[mmm]

  12. Speech perception research: history • 1950-1965 Broad-based exploration • 1965-1990s Narrowed to focus on thesearch for invariance in the relationshipbetween speech signal and its percept: THEORY • 1995…. broader focus again • to include ‘discrepant’ data & new understanding • which requires changes in conceptualization of • task goals • processes involved • THEORY Hawkins (2004) Puzzles and patterns in 50 years of research on speech perception. http://www.rle.mit.edu/soundtosense/conference/pages/invited.htm

  13. Early work: Glorious Discovery • often looked at effects on the whole signal • but as puzzles arose, and we looked more closely, then attention became focused on small domains in an effort both to simplify and to clarify

  14. Examples of early work • observations relevant to memory, attention, transitional probability, speaker vs message Source separation: Cherry (1953) JASA 25, 975-979 Cocktail party effect / multi-talker perception • observations relevant to memory, attention, transitional probability, speaker vs message Source integration: Sumby & Pollack (1954) JASA 26: 212-215 • audiovisual presentation increases intelligibility (visual contribution is relative to available auditory contribution) • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighbourhoods…cohorts…) Source integration: Sumby & Pollack (1954) JASA 26: 212-215 • audiovisual presentation increases intelligibility (visual contribution is relative to available auditory contribution) • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighbourhoods…cohorts…) Intelligibility: Importance of context and meaning • possible responses, immediate phonetic context; preceding (and following) context..... (many studies) Intelligibility: Importance of context and meaning • possible responses, immediate phonetic context; preceding (and following) context..... (many studies) Memory:relationships;recoding into larger units Miller (1956) Memory: relationships;recoding into larger units Miller (1956)

  15. Early work: source separation* Cocktail party effect / multi-talker perception Cherry (1953) • continuous natural speech, with different types of content, presented in different ways • a huge wealth of observations relevant to • memory • attention • transitional probabilities • speaker vs message Cherry (1953) JASA 25, 975-979

  16. Early work: source separation* Cocktail party effect / multi-talker perception Broadbent & Ladefoged (1957) • separate synthetic formants fuse to sound like a single vowel when presented to the same or different ears, only if they have the same f0 • compared ‘natural’ and ‘sustained’ formants • extensions to theories of hearing (e.g. Licklider) Broadbent & Ladefoged (1957) JASA 29, 708-710 Darwin (1981) QJEP 33, 185-207 Bregman (1990) Auditory Scene Analysis ASA special session, 2004 Cooke & Ellis (2001) Sp. Comm. 35, 141–177

  17. Early work: source integration* Sumby & Pollack(1954) Especially in high levels of noise: • audiovisual presentation increases intelligibility (visual contribution is relative to the available auditory contribution) Sumby & Pollack (1954) JASA 26: 212-215 Massaro (1998) Perceiving Talking Faces WidespreadAV groups and applications

  18. Early work: source integration* Sumby & Pollack(1954) Especially in high levels of noise: • audiovisual presentation increases intelligibility (visual contribution is relative to the available auditory contribution) • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighborhoods…cohorts…) Sumby & Pollack (1954) JASA 26: 212-215 Massaro (1998) Perceiving Talking Faces WidespreadAV groups and applications Richard Warren, Paul Luce, Marslen-Wilson

  19. Early work: memory* Miller (1956) • short term memory span for unrelated items The Magical Number Seven ± Two • can increase this span by: • making relative rather than absolute judgments • increasing the number of dimensions • chunking into larger items • recoding is a crucial process Miller (1956) Psychological Review63, 81-97 Serial learning and recall (e.g. Underwood) Lashley (1951) Serial order in behavior Pisoni (1973) and later

  20. Early work: intelligibility*Context of Possible Responses Miller, Heise & Lichten (1951) • monosyllables • size of test vocabulary affects identification • 2…256…all monsylls • though presumably there are limits: • two vs six • five vs nine ! Miller, Heise & Lichten, (1951) J.Exp.Psych. 41, 329-335

  21. Early work: intelligibility*Phonetic Context Pickett & Pollack (1963) • excerpts from connected speechmust be≥ 800 ms long to be fully intelligible • regardless ofrate: • faster rates need more syllables to be understood (slowing the speech down does not help)  crucial role of coarticulation & style (‘connected speech processes’) Pickett & Pollack (1963) Language & Speech 6, 165-171

  22. Early work: preceding context* affects the interpretation of the current sound Ladefoged and Broadbent (1957) • "Please say what this word is: bit bet bat but F1 of CARRIER 200-380 Hz 380-660 Hz bet bit Ladefoged and Broadbent (1957) JASA 29, 98-104

  23. Early work: immediate context* determines the interpretation of the current stimulus Synthesizing bursts and transitionless vowels Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606

  24. Early work: immediate context* determines the interpretation of the current stimulus Identification of bursts and transitionless vowels: the CV is identified as the minimal acoustic unit Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606

  25. Early work: immediate context* determines the interpretation of the current stimulus Identification of burstless stops with different vowels: transitions areall you need! Delattre, Liberman, & Cooper (1955) JASA 27, 769-773

  26. b d g Categorical Perception*of obstruent consonants Equal acoustic changes  unequal auditory percepts place of articulation of stops: /b/ vs /d/ vs /g/ Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368

  27. Middle period: search for essence • Impose order on the chaos! • Narrow focus: non-linearity between variation in acoustic signal and perceptual response (categorical perception) • together with a theoretical bias in favour ofbinary oppositions (categorial) • encouraged a focused search for simple transformations from the encoded signal to an unambiguous, formal linguistic mental representation

  28. This narrower focus • required clear conceptualisation of • identity of the important unit(s) of perception • process of abstraction • On the whole, the units and levels of linguistic description were rather uncritically adopted

  29. …units of linguistic description were rather uncritically adopted “we….had undertaken to find the ‘invariants’of speech, a term which implies, at least in its simplest interpretation, a one-to-one correspondence between something half-hidden in the spectrogram and the successive phonemes of the message.” Cooper, Delattre, Liberman, Borst & Gerstman,Perception of synthetic speech soundsJASA (1952) 24, 604-5

  30. …though not without some misgivings “…one should not expect always to be able to find acoustic invariants for the individual phonemes…we are trying to [compile] the code book, one in which there is one column for acoustic entries and another column for message units, whether these be phonemes, syllables, words, or whatever.” Cooper, Delattre, Liberman, Borst & Gerstman,Perception of synthetic speech soundsJASA (1952) 24, 604-5

  31. to discover the crucial—invariant—properties requires a view of what is fundamental • The basic syllable! ba • CV • in isolation • stressed • possibly with only one V if we’re looking at Cs, and only one C if we’re looking at Vs Context became seen as variability,so was controlled for ever more stringently

  32. Imposing order on chaos • The basic syllable: ba (context: silence) • What was lost? • polysyllables • unstressed syllables • prosody • accounting for rate changes • connected speech • informativeness of variationespecially in connected speech • meaning • communication • (most things really)

  33. change little change Quantal Theory / acoustic invariance theory +consonantal -consonantal Stevens & Blumstein (1978) ……. Stevens (2002) • For each DF there is a binary response to an invariantacoustic or auditory property • e.g. particular changes inspectral shape overshort time periods atcrucial parts of the signal • segment boundaries • vowel steady states Stevens (2002) JASA 111, 1872-1891 Stevens & Blumstein (1978) JASA64, 1358-1368

  34. Acoustic/Auditory invariance theory +strident -strident Stevens (2002) • landmarks: • islands of reliability • dynamic (relational) • context-sensitive(local context only) • connected speech… Stevens (2002) JASA 111, 1872-1891

  35. Moving on...... Good But Stevens (2002) • landmarks: • islands of reliability • dynamic (relational) • context-sensitive(local context only) • connected speech… • dist. features map to phonology: we need to map to meaning • and exploit other systematic regularities of sound-meaning in connected speech • need other prime units!

  36. Phonetic detail can be very informative (cf. “dunno”) So phonology should not have privileged status in models of speech processing that concern meaning A musical analogy

  37. Conventional notation “Not the music” “Phonology” Original score Courtesy Sarah Knight Played by human Meaningful signal “The music” “Speech” Transcribed by‘Sibelius’

  38. all these can help lexical access they provide cues about meaning inmutually-influencingways Phonetic detail (PD) indicates: • position in syllable; word boundaries • long-domain cues to phonemes/features • grammatical status • discourse function of ‘same’ words • gross and subtle indexical information some groups: York (Local, Ogden); Cambridge (Hawkins); MPI Nijmegen (Ernestus, Schroeder); Christchurch, NZ (Hay); Newcastle (Docherty) • places you can join in a conversation • other things crucial to talk in interaction

  39. they are all typesof invariant Phonetic detail (PD) indicates: • position in syllable; word boundaries • long-domain cues to phonemes/features • grammatical status • discourse function of ‘same’ words • gross and subtle indexical information some groups: York (Local, Ogden); Cambridge (Hawkins); MPI Nijmegen (Ernestus, Schroeder); Christchurch, NZ (Hay); Newcastle (Docherty) indicating someform or function each operating within a distinctsubsystem • places where you can join in a conversation • other things crucial to talk in interaction

  40. What are these subsystems? • anything that conveys meaning • affect unit selection • examples • discourse • function vs content words • grammar (morphemic status)

  41. Stand-alone “so” in American English holding-‘so’ louder, higher f0, final glottal closure.... the same speaker continues with moreon-topic talk trailoff-‘so’ quieter, lower f0, no final glottal closure.... there may be a change in who talks Details in Local (2007) ICPhS http://www.icphs2007.de/(click on Plenary Speakers)

  42. Connected speech processes for //in function words differ from those involving similar sequences in content words ban thatch [banθatʃ] ban that [banat] [baθatʃ] ban zips [banzɪps] ban this [banɪs] [bazɪps] [bazɪps] • content function *[banatS] *[banɪps]

  43. banthese baniz allthatlatH courtesy John Localalso Manuel 1995 JPhon

  44. phonemes Syllable Rhyme Onset patpa t tapta p Nucleus Coda

  45. Syllable Rhyme Onset patpʰa ʔt taptʰa ʔp Nucleus Coda phonemes allophones

  46. Syllable patpʰa ʔt tap tʰa ʔp Rhyme Onset Syllable Syllable Rhyme Onset Rhyme Onset Nucleus Coda Nucleus Coda Nucleus Coda phonemes allophones Syllable Rhyme Onset Nucleus Coda exPATriateɛ ks pʰ a ʔtʰ r iə t GesTAPo ɡə s t a p əʊ

  47. Syllable patpʰa ʔt tap tʰa ʔp Rhyme Onset Syllable Syllable Rhyme Onset Rhyme Onset Nucleus Coda Nucleus Coda Nucleus Coda phonemes allophones morphemes Syllable Rhyme Onset Nucleus Coda exPATriateɛ ks pʰ a ʔtʰ r iə t GesTAPo ɡə s t a p əʊ

  48. Productive Unproductive True and pseudo prefixes (Morphemic structure of words in natural sentences) mistimes mistakes

  49. Pr Ps True and pseudo prefixes • Same phonemes: /mɪst/but • PD systematically reflects morphological status: • properties of periodic part • duration • abruptness of [mɪ] boundary • F2 frequency (etc) • relative durations e.g.: • periodicity : aperiodicity (fric) • fricative : silence (closure) • VOT (Morphemic structure of words in natural sentences) Baker, Smith, Hawkins ICPhS 2007; Baker 2008 PhD dissertation

  50. True and pseudo prefixes (Morphemic structure of words in natural sentences) • intelligibility in noise worse when mismatched details of sound patterns can signal morphemic status implications for ASR (in adverse conditions?) and synthesis (attentional demands if rhythm disrupted) logistic regression Rachel Baker, PhD 2008; Baker, Hawkins and Smith, 2007, ASA; Wurm, 1997

More Related