E N D
There are multiple current theories of speech perception and we will not address all of them, nor can we do justice to the complexity of the experimental findings that have been offered in support of, or against, these theories.We begin with a basic distinction in theoretical approaches: what is the “object” of speech perception and how do listeners map from the input signal to the object?
MOTOR THEORYOriginal proposal: Liberman et al., 1967 (Psychological Review 74)Revised version: Liberman & Mattingly, 1985 (Cognition 21) The invariants of perception are the speaker’s intended gestures.In Motor Theory, a gesture is a class of articulatory movements that corresponds to a linguistically significant change in vocal tract configuration. Listeners reconstruct the intended gestures (the actual set of movements may not occur simultaneously due to coarticulation). Speech is perceived in a module specialized to detect the speaker’s intended gestures. Speech perception as modular is seen as consistent with theories of the modularity of language.
From Liberman & Mattingly (1985:6): According to the Motor Theory, “speech perception is not to be explained by principles that apply to perception of sounds in general, but must rather be seen as a specialization for phonetic gestures. Incorporating a biologically based link between perception and production, this specialization prevents listeners from hearing the signal as an ordinary sound, but enables them to use the systematic, yet special, relation between signal and gesture to perceive the gesture. The relation is systematic because it results from lawful dependencies among gestures, articulator movements, vocal-tract shapes, and signal. It is special because it occurs only in speech.”
An initial motivation for the Motor Theory came from the coincidence of the findings that acoustic cues did not map easily onto linguistic percept and that perception appeared to parallel articulation more closely than the acoustic signal.Recall, for example, the general pattern for categorical vs. continuous perception of speech continua, where the phonetic distinctions that are more categorically perceived as those that correspond to more discrete articulations.Does categorical perception of nonspeech continua substantially weaken this interpretation? Not necessarily. More worrisome is the proposal (consistent with the TOT findings, for example, but not with all nonspeech data) that peaks in discrimination functions are simply properties of the auditory system, rather than specific to the phonetic system.
Some evidence cited in support of Motor Theory:Perceptual constancyDiffering acoustic signals (e.g., F2 transitions for /di de da do du/ elicit the same linguistic percept.Categorical perception:Percepts are discrete in ways that correspond to gestures.McGurk effect:Information from the auditory and visual modes are integrated into a unified percept. Listeners are unaware of the source of the information. Trading relationsDuplex perceptionPerception of sinewave speech
Trading relations:Categorical perception tests address whether a given acoustic property is sufficient to signal a distinction between sounds X and Y. But how do listeners use the multiple cues available to them in natural speech?The evidence is that, for a given phonetic distinction, a change in one acoustic property can be perceptually offset by an opposing change in another property: one cue “trades” against the other, yielding an integrated, unitary percept.Fitch et al. (1980, Perception & Psychophysics 27):slit - split continua Continuum 1: /s/ + 8-160 ms silence + formants appropriate to /lIt/ Continuum 2: /s/ + 8-160 ms silence + formants appropriate to /plIt/.Result: listeners required a longer silence interval to hear split for Continuum 1 (i.e., where formant transitions did not signal /p/).
Beddor and Onsuwan: Let’s try this ourselves, listening for the difference between /b/ and /mb/. There will be 36 items. Write your responses in 4 columns, each with 9 items (1-9, 10-18, 19-27, 28-36).
Vowel nasalization Beddor and Onsuwan (results for Ikalanga listeners): % /mb/ responses Increasing duration of nasal murmur
Duplex perception:In this paradigm, a synthetic formant transition (e.g., F3 transition) is presented to one ear while the other ear receives the remainder or “base”. F3 provides a critical cue for the /da-ga/ distinction. When both (A) and (B) are presented, listeners report hearing two simultaneous sounds: /d/ or /g/, depending on which F3 transition is presented, in the ear that gets (B) a non-speech chirp in the ear that gets (A).Of interest to Motor Theorists is that perception is duplex and not triplex. The base syllable either fuses with the the transition for a /d-g/ percept or the transition (chirp) alone is heard, but the base is not heard separately. Thus duplex percepts do not simply indicate that the auditory system can fuse dichotic stimuli. Motor Theorists propose that there are two modules, one for speech and the other for other auditory information, that use the same input to produce two simultaneous representations.
Sinewave speech:Of initial interest is that sinewaves replicas can be heard as speech at all.Check it out athttp://macserver.haskins.yale.edu/haskins/MISC/SWS/SWScore.htmlOf special interest from the perspective of Motor Theory is that sinewave replicas are sometimes heard as speech by some listeners, but as nonspeech by other listeners. When sinewave stimuli are heard as speech, they tend to be perceived categorically and enter into trading relations. For listeners who hear sinewave replicas as nonspeech, perception is more continuous and the relevant acoustic cues do not show a trading relation (e.g., Best, Morrongiello, and Robson, 1981, Perception & Psychophysics 29; but see also the discussion by Hawkins, p. 209).
DIRECT REALISMC. A. Fowler (1986) Journal of Phonetics 14.C. A. Fowler (1996) Journal of the Acoustical Society of America 99. Speech perception is direct, but it is not special. Direct Realism is embedded in Gibson’s more general theory of perception according to which perceivers gain direct information from the world around them; all perception involves direct recovery of the distal source of the event being perceived. Direct Realism shares with Motor Theory the claim that the objects of speech perception are linguistic gestures. However, the objects are not the intended gestures of Motor Theory. Rather, Direct Realism assumes that the actual gestures of the vocal tract have invariant characteristics.
Fowler (1996:1737):In Direct Realism, “as in any viable theory, the acoustic signal plays a pivotal role in speech perception. The acoustic signal is, after all, what the ear transduces; ears do not transduce articulations. [Auditory theories and Direct Realism] do not disagree on this point; they disagree on what the acoustic signal counts as for the perceiver. For acoustic theorists, it counts as a perceptual object; for me it counts as the specifier of speech events.”
Evidence cited in support of Direct Realism:Perceptual constancyMcGurk effectTrading relations“Parsing” of the acoustic signal along gestural lines: perceptual compensation
Perceptual compensation for coarticulationListeners are sensitive to coarticulatory variability. We have already seen this in our perception of [´]s excised from [´bV] contexts: we were reasonably accurate in identifying at least certain articulatory properties of deleted vowels.There is also considerable evidence that listeners accommodate or “compensate” for the coarticulatory influences of one sound on another, attributing the acoustic effects to their coarticulatory source. Consider, for example, the effects of back rounded /u/ on a preceding fricative: anticipation of /u/ lowers the frequency of the noise. Mann & Repp (1980, Perception & Psychophysics 28) found that, when listeners identify members of a /s-∫/ continuum, they offer more /s/ responses in the context of /u/ than in the context of /i/.WHY MIGHT THIS BE THE CASE?
More on perceptual compensation for coarticulationA similar outcome suggesting that listeners attribute coarticulatory variation to its source holds for vowel-to-vowel coarticulation. Let’s try this out on ourselves: For each of these “words”, do you hear [popi] or [pepi]? For this next set of “words”, do you hear [popa] or [pepa]? In which set did you report more [e] responses? WHY?
/popa/-/pepa/ /popi/-/pepi/ Percent /e/ Responses popa-pepa popi-pepi Here’s how a larger group of English listeners responded (Beddor, Harnsberger, and Lindemann, 2002, Journal of Phonetics 30):