800 likes | 861 Views
Speech Analysis and Processing for Knowledge Discovery. Phonetic perspectives on modelling information in the speech signal. Sarah Hawkins University of Cambridge sh110@cam.ac.uk ISCA ITRW Workshop, Aalborg, 4-6 June 2008. Contents*.
E N D
Speech Analysis and Processing for Knowledge Discovery Phonetic perspectives on modelling information in the speech signal Sarah Hawkins University of Cambridge sh110@cam.ac.uk ISCA ITRW Workshop, Aalborg, 4-6 June 2008
Contents* • This ppt presentation contains all the slides shown during my talk at the ITRW, in the order in which they were shown. • But there are additional slides interspersed with those that were shown, which amplify the points made during the talk. The additional slides are marked with an asterisk (*) at the end of the title line, as on this slide • Some connecting comments are added in the Notes sections
Speech perception: common assumptions • first focus is on phonological or lexical form • identify the words • ‘do grammar’ and all that stuff afterwards • ‘essence’ + variation • preserve the essence (phonemes or features) • discard the variation • use “higher-order knowledge” to deal with resultant ambiguities in phoneme strings Fine for where ASR models tend to start: small vocabulary,one speech style, a few speakers......but does not generalise
Overview • history: how we got to those assumptions • “phonetic detail” (the sensory signal) tells us much more than phonological form • how brains seem to use the sensory signal • implications for modelling
Main message: speech research • legacy of simplification and search for essence • but essence became defined as phonological GOAL became phonological form • but the real goals are: • understanding meaning (broadly defined) • successful interaction • so we need a different framework
Goals:meaning and successful interaction A different framework, different prime units: • functional ( “polysystemic”) • adaptive to changing circumstances much of the information is in the signal: “phonetic detail” (PD) the brain structures that information (perception = creation of illusions?)
Salience of PD is task specific: situation and function of response I don’t know I dunno dunno
I do not know I don’t know I dunno dunno
I…do…not…know I do not know I don’t know I dunno dunno
I…do…not…know I do not know I don’t know I dunno dunno [an] []
I…do…not…know I do not know I don’t know I dunno dunno [an] [] but not *[]or *[mmm]
Speech perception research: history • 1950-1965 Broad-based exploration • 1965-1990s Narrowed to focus on thesearch for invariance in the relationshipbetween speech signal and its percept: THEORY • 1995…. broader focus again • to include ‘discrepant’ data & new understanding • which requires changes in conceptualization of • task goals • processes involved • THEORY Hawkins (2004) Puzzles and patterns in 50 years of research on speech perception. http://www.rle.mit.edu/soundtosense/conference/pages/invited.htm
Early work: Glorious Discovery • often looked at effects on the whole signal • but as puzzles arose, and we looked more closely, then attention became focused on small domains in an effort both to simplify and to clarify
Examples of early work • observations relevant to memory, attention, transitional probability, speaker vs message Source separation: Cherry (1953) JASA 25, 975-979 Cocktail party effect / multi-talker perception • observations relevant to memory, attention, transitional probability, speaker vs message Source integration: Sumby & Pollack (1954) JASA 26: 212-215 • audiovisual presentation increases intelligibility (visual contribution is relative to available auditory contribution) • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighbourhoods…cohorts…) Source integration: Sumby & Pollack (1954) JASA 26: 212-215 • audiovisual presentation increases intelligibility (visual contribution is relative to available auditory contribution) • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighbourhoods…cohorts…) Intelligibility: Importance of context and meaning • possible responses, immediate phonetic context; preceding (and following) context..... (many studies) Intelligibility: Importance of context and meaning • possible responses, immediate phonetic context; preceding (and following) context..... (many studies) Memory:relationships;recoding into larger units Miller (1956) Memory: relationships;recoding into larger units Miller (1956)
Early work: source separation* Cocktail party effect / multi-talker perception Cherry (1953) • continuous natural speech, with different types of content, presented in different ways • a huge wealth of observations relevant to • memory • attention • transitional probabilities • speaker vs message Cherry (1953) JASA 25, 975-979
Early work: source separation* Cocktail party effect / multi-talker perception Broadbent & Ladefoged (1957) • separate synthetic formants fuse to sound like a single vowel when presented to the same or different ears, only if they have the same f0 • compared ‘natural’ and ‘sustained’ formants • extensions to theories of hearing (e.g. Licklider) Broadbent & Ladefoged (1957) JASA 29, 708-710 Darwin (1981) QJEP 33, 185-207 Bregman (1990) Auditory Scene Analysis ASA special session, 2004 Cooke & Ellis (2001) Sp. Comm. 35, 141–177
Early work: source integration* Sumby & Pollack(1954) Especially in high levels of noise: • audiovisual presentation increases intelligibility (visual contribution is relative to the available auditory contribution) Sumby & Pollack (1954) JASA 26: 212-215 Massaro (1998) Perceiving Talking Faces WidespreadAV groups and applications
Early work: source integration* Sumby & Pollack(1954) Especially in high levels of noise: • audiovisual presentation increases intelligibility (visual contribution is relative to the available auditory contribution) • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighborhoods…cohorts…) Sumby & Pollack (1954) JASA 26: 212-215 Massaro (1998) Perceiving Talking Faces WidespreadAV groups and applications Richard Warren, Paul Luce, Marslen-Wilson
Early work: memory* Miller (1956) • short term memory span for unrelated items The Magical Number Seven ± Two • can increase this span by: • making relative rather than absolute judgments • increasing the number of dimensions • chunking into larger items • recoding is a crucial process Miller (1956) Psychological Review63, 81-97 Serial learning and recall (e.g. Underwood) Lashley (1951) Serial order in behavior Pisoni (1973) and later
Early work: intelligibility*Context of Possible Responses Miller, Heise & Lichten (1951) • monosyllables • size of test vocabulary affects identification • 2…256…all monsylls • though presumably there are limits: • two vs six • five vs nine ! Miller, Heise & Lichten, (1951) J.Exp.Psych. 41, 329-335
Early work: intelligibility*Phonetic Context Pickett & Pollack (1963) • excerpts from connected speechmust be≥ 800 ms long to be fully intelligible • regardless ofrate: • faster rates need more syllables to be understood (slowing the speech down does not help) crucial role of coarticulation & style (‘connected speech processes’) Pickett & Pollack (1963) Language & Speech 6, 165-171
Early work: preceding context* affects the interpretation of the current sound Ladefoged and Broadbent (1957) • "Please say what this word is: bit bet bat but F1 of CARRIER 200-380 Hz 380-660 Hz bet bit Ladefoged and Broadbent (1957) JASA 29, 98-104
Early work: immediate context* determines the interpretation of the current stimulus Synthesizing bursts and transitionless vowels Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606
Early work: immediate context* determines the interpretation of the current stimulus Identification of bursts and transitionless vowels: the CV is identified as the minimal acoustic unit Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606
Early work: immediate context* determines the interpretation of the current stimulus Identification of burstless stops with different vowels: transitions areall you need! Delattre, Liberman, & Cooper (1955) JASA 27, 769-773
b d g Categorical Perception*of obstruent consonants Equal acoustic changes unequal auditory percepts place of articulation of stops: /b/ vs /d/ vs /g/ Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368
Middle period: search for essence • Impose order on the chaos! • Narrow focus: non-linearity between variation in acoustic signal and perceptual response (categorical perception) • together with a theoretical bias in favour ofbinary oppositions (categorial) • encouraged a focused search for simple transformations from the encoded signal to an unambiguous, formal linguistic mental representation
This narrower focus • required clear conceptualisation of • identity of the important unit(s) of perception • process of abstraction • On the whole, the units and levels of linguistic description were rather uncritically adopted
…units of linguistic description were rather uncritically adopted “we….had undertaken to find the ‘invariants’of speech, a term which implies, at least in its simplest interpretation, a one-to-one correspondence between something half-hidden in the spectrogram and the successive phonemes of the message.” Cooper, Delattre, Liberman, Borst & Gerstman,Perception of synthetic speech soundsJASA (1952) 24, 604-5
…though not without some misgivings “…one should not expect always to be able to find acoustic invariants for the individual phonemes…we are trying to [compile] the code book, one in which there is one column for acoustic entries and another column for message units, whether these be phonemes, syllables, words, or whatever.” Cooper, Delattre, Liberman, Borst & Gerstman,Perception of synthetic speech soundsJASA (1952) 24, 604-5
to discover the crucial—invariant—properties requires a view of what is fundamental • The basic syllable! ba • CV • in isolation • stressed • possibly with only one V if we’re looking at Cs, and only one C if we’re looking at Vs Context became seen as variability,so was controlled for ever more stringently
Imposing order on chaos • The basic syllable: ba (context: silence) • What was lost? • polysyllables • unstressed syllables • prosody • accounting for rate changes • connected speech • informativeness of variationespecially in connected speech • meaning • communication • (most things really)
change little change Quantal Theory / acoustic invariance theory +consonantal -consonantal Stevens & Blumstein (1978) ……. Stevens (2002) • For each DF there is a binary response to an invariantacoustic or auditory property • e.g. particular changes inspectral shape overshort time periods atcrucial parts of the signal • segment boundaries • vowel steady states Stevens (2002) JASA 111, 1872-1891 Stevens & Blumstein (1978) JASA64, 1358-1368
Acoustic/Auditory invariance theory +strident -strident Stevens (2002) • landmarks: • islands of reliability • dynamic (relational) • context-sensitive(local context only) • connected speech… Stevens (2002) JASA 111, 1872-1891
Moving on...... Good But Stevens (2002) • landmarks: • islands of reliability • dynamic (relational) • context-sensitive(local context only) • connected speech… • dist. features map to phonology: we need to map to meaning • and exploit other systematic regularities of sound-meaning in connected speech • need other prime units!
Phonetic detail can be very informative (cf. “dunno”) So phonology should not have privileged status in models of speech processing that concern meaning A musical analogy
Conventional notation “Not the music” “Phonology” Original score Courtesy Sarah Knight Played by human Meaningful signal “The music” “Speech” Transcribed by‘Sibelius’
all these can help lexical access they provide cues about meaning inmutually-influencingways Phonetic detail (PD) indicates: • position in syllable; word boundaries • long-domain cues to phonemes/features • grammatical status • discourse function of ‘same’ words • gross and subtle indexical information some groups: York (Local, Ogden); Cambridge (Hawkins); MPI Nijmegen (Ernestus, Schroeder); Christchurch, NZ (Hay); Newcastle (Docherty) • places you can join in a conversation • other things crucial to talk in interaction
they are all typesof invariant Phonetic detail (PD) indicates: • position in syllable; word boundaries • long-domain cues to phonemes/features • grammatical status • discourse function of ‘same’ words • gross and subtle indexical information some groups: York (Local, Ogden); Cambridge (Hawkins); MPI Nijmegen (Ernestus, Schroeder); Christchurch, NZ (Hay); Newcastle (Docherty) indicating someform or function each operating within a distinctsubsystem • places where you can join in a conversation • other things crucial to talk in interaction
What are these subsystems? • anything that conveys meaning • affect unit selection • examples • discourse • function vs content words • grammar (morphemic status)
Stand-alone “so” in American English holding-‘so’ louder, higher f0, final glottal closure.... the same speaker continues with moreon-topic talk trailoff-‘so’ quieter, lower f0, no final glottal closure.... there may be a change in who talks Details in Local (2007) ICPhS http://www.icphs2007.de/(click on Plenary Speakers)
Connected speech processes for //in function words differ from those involving similar sequences in content words ban thatch [banθatʃ] ban that [banat] [baθatʃ] ban zips [banzɪps] ban this [banɪs] [bazɪps] [bazɪps] • content function *[banatS] *[banɪps]
banthese baniz allthatlatH courtesy John Localalso Manuel 1995 JPhon
phonemes Syllable Rhyme Onset patpa t tapta p Nucleus Coda
Syllable Rhyme Onset patpʰa ʔt taptʰa ʔp Nucleus Coda phonemes allophones
Syllable patpʰa ʔt tap tʰa ʔp Rhyme Onset Syllable Syllable Rhyme Onset Rhyme Onset Nucleus Coda Nucleus Coda Nucleus Coda phonemes allophones Syllable Rhyme Onset Nucleus Coda exPATriateɛ ks pʰ a ʔtʰ r iə t GesTAPo ɡə s t a p əʊ
Syllable patpʰa ʔt tap tʰa ʔp Rhyme Onset Syllable Syllable Rhyme Onset Rhyme Onset Nucleus Coda Nucleus Coda Nucleus Coda phonemes allophones morphemes Syllable Rhyme Onset Nucleus Coda exPATriateɛ ks pʰ a ʔtʰ r iə t GesTAPo ɡə s t a p əʊ
Productive Unproductive True and pseudo prefixes (Morphemic structure of words in natural sentences) mistimes mistakes
Pr Ps True and pseudo prefixes • Same phonemes: /mɪst/but • PD systematically reflects morphological status: • properties of periodic part • duration • abruptness of [mɪ] boundary • F2 frequency (etc) • relative durations e.g.: • periodicity : aperiodicity (fric) • fricative : silence (closure) • VOT (Morphemic structure of words in natural sentences) Baker, Smith, Hawkins ICPhS 2007; Baker 2008 PhD dissertation
True and pseudo prefixes (Morphemic structure of words in natural sentences) • intelligibility in noise worse when mismatched details of sound patterns can signal morphemic status implications for ASR (in adverse conditions?) and synthesis (attentional demands if rhythm disrupted) logistic regression Rachel Baker, PhD 2008; Baker, Hawkins and Smith, 2007, ASA; Wurm, 1997