210 likes | 443 Views
Theories of Speech Perception Part III.
E N D
H(yperspeech) & H(ypospeech) TheoryB. Lindblom, 1990, “Explaining phonetic variation: a sketch of the H&H theory,” in W. J. Hardcastle & A. Marchal, Speech Production and Speech Modelling. Speech varies along a hyper- / hypo-speech continuum. Hyper-speech is output oriented: the speaker controls production to yield sufficient contrast for the listener. Hypo-speech is production oriented: “unconstrained, a motor system tends to default to a low-cost form of behavior.” Speech production is adaptive (speakers have a choice). Lack of acoustic invariance in speech is due to these adaptations. Speech perception involves understanding meaning, and listeners— like speakers—cooperate in communication, using their knowledge to disambiguate ambiguous information and to fill in missing stimulus information.
Some evidence cited in support of H&H Theory:H&H Theory is as much a theory of production as a theory of perception. Production:Clear speech (Moon & Lindblom, 1989)Vowels produced by speakers who were asked to speak as clearly as possible (overarticulate) show less formant undershoot than non-clear speech vowels.Perception:Factors independent of the acoustic signal (e.g., word frequency and lexical structure) influence perceptual processing.Acoustic separability of stop consonants: large coarticulatory effects across vowels mean that there are no absolute invariants, but the acoustic differences meet the condition of “sufficient contrast.”
H&H vs. other theories that we have considered:Lindblom (1990:431):Motor Theory, Quantal Theory and Direct Realism “have one theme in common: They all share the assumption that the ultimate solution will be found in the signal. In contrast, the H&H theory … says that—whether communicatively successful or not—[the] adaptive behavior [of the speaker] is the reason for the alleged lack of invariance in the speech signal. Hence it predicts that the quest for signal-based definitions of invariance will continue to remain unsuccessful as a matter of principle. In the H&H model the need to solve the invariance issue disappears. But the problem is replaced by another…: That of describing the class of speech signals that satisfy the condition of “sufficient discriminative power.”
Thumbnail Sketch Motor Theory Speech is special; object of perception is intended gesture Invariants not in acoustic signal or actual articulation, but in intended gestures Direct Realism Speech isn’t special; object of perception is actual gesture Invariant gestures are hypothesized to cause acoustic specifiers or invariants. Auditory Enhancement Theory Speech isn’t special; object of perception is acoustic-auditory signal Does not assume acoustic invariants Quantal Theory Speech isn’t special; object of perception is acoustic-auditory signal Hypothesizes acoustic-auditory invariants H&H Theory Speech isn’t special; no one primary unit or object of perception No absolute invariants in the (acoustic/auditory or articulatory) signal
Let’s shift our focus now to theoretical approaches to speech perception as they relate to the structure of phonological systems.These theoretical approaches differ considerably in terms of what they aim to account for, so a direct comparison in terms of how they “stack up” against the empirical evidence is not appropriate. Instead, our emphasis will be to understand the motivation for the approach, how it works, and its strengths and weaknesses.We will start with the theories we have already considered, and then will move on to alternative perspectives which are not so much theories of perception as theories of the role of listeners in sound systems.
AUDITORY THEORIESSome of the strongest evidence in favor of, for example, auditory enhancement theory is phonological: distinct vocal tract gestures that give rise to similar acoustic-auditory properties co-occur in (some aspects of) consonant and vowel systems.…We’ll revisit this shortly. H&H THEORYStrong evidence of the notion of sufficient perceptual contrast comes from the structure of vowel systems. In general, the vowels that occur in systems of a certain size (e.g., 5-vowel, 7-vowel, 9-vowel) systems are reasonably well-predicted by a theory of vowel “dispersion”.
GESTURAL THEORIESMotor Theory and Direct Realism are theories of speech perception and not theories of, for example, the role of the listener in phonological structure or the organization of phonological systems. However, a criticism that has been leveled against gestural theories is that, by taking the (intended or actual) gesture as the object of speech perception, gesturalists cannot account for the many clear examples of phonological patterns triggered by acoustic/auditory properties. Such patterns would seem to be especially problematic for Direct Realism, where perception involves direct recovery of the distal source of the event being perceived—i.e., the vocal tract gestures.
A few examples of auditory/acoustic constraints in sound systems: The co-occurrence of front/unround and back/round, with the back/round combination yielding especially low F2 frequencies. A similar point holds for the co-occurrence of labial and velar constrictions in various consonantal articulations: again, this articulatory combination yields particularly low-frequency F2. Of /p t k/, /p/ is the most often absent in stop inventories (weak release burst plus F1 cutback provide few stop cues). Compare the prevalence of high-intensity fricatives /s ∫/. (From Ohala, 1996, JASA 99.) Maximal dispersion in vowel inventories (Lindblom, Ohala, others). Confusions due to acoustic/auditory similarity, rather than articulatory similarity, that clearly result in sound changes. From Ohala, 1996: palatalized labials (mj , pj) > apicals (n, t).
There are many more examples that could be offered along these lines.Does evidence of the role of acoustic/auditory properties in phonological systems pose a significant problem for gestural accounts? Fowler: No.Fowler (1996, JASA):“In the theory of direct perception, auditory perception in general, and speech perception in particular, can be only as successful as the specifying information provided by the acoustic signal. If two similar gestures structure the air in very distinctive ways, then listeners will have no difficulty knowing which was produced. If a gesture does not structure the air in noticeable ways, it is likely to go unnoticed. … The reason why /p/ is more often omitted from language inventories than other voiceless consonants may be exactly because the information for it in the signal is weak.”
Let’s explore a clear example of acoustic/auditory similarity in phonological systems—nasal vowel height—from a gesturalist account. We’ve noted that nasal vowels often differ in height from their oral vowel counterparts in a language. French provides extreme examples of this (fine [fin] / fin [fE)]; brune [b“yn] / brun [b“ø)]), but many languages show similar height effects.We’ve also noted why: FN/F1 proximity lowers the perceived height of non-low vowels and raises the perceived height of low vowels.Put another way, changes in tongue body height and velum position can have similar effects on the F1 region of the vowel spectrum (or on the auditory processing of this region). How would a gesturalist handle this?
Key in a gesturalist account of nasal vowel height patterns is that speech perception can only be as successful as the specifying information in the acoustic/auditory signal. Now let’s consider what this means for a particular population of listeners.Native speakers of English have lots of experience with nasal vowels, but primarily in the context of nasal consonants. If English-speaking listeners hear nasal vowels in the context of a clearly audible nasal consonant, they should be able to parse the signal along the correct gestural line: that is, they would attribute the effects of nasalization on the low-frequency spectrum of the vowel to the consonant. But if listeners hear nasal vowels in an oral context, in the absence of a coarticulatory source for the nasalization, the “specifying information” in the acoustic signal may not be sufficient.
Experimental findings are consistent with this view:In a study I conducted years ago with colleagues at Haskins Labs, English-speaking listeners were asked to identify vowels under 3 conditions:oral [E] - [Q] continuum embedded in a [b_d] context (bed-bad)nasal [E)] - [Q)] continuum embedded in a [b_d] context (bed-bad) nasal [E)] - [Q)] continuum embedded in a [b_nd] context (bend-band)
Beddor, Krakow, & Goldstein (Phonology Yearbook, 1986): Listeners heard the same nasal vowels as lower (i.e., more /E/ than /Q/ responses) in an oral consonant context than in a (coarticulatory) nasal consonant context.
Importantly:The point of this illustration is to show how a gestural account would handle a phonological pattern that is clearly motivated by similarity between the acoustic-auditory effects of two very different vocal tract gestures.But other accounts are also compatible with these findings.A “strong” auditory account would not interpret the phenomenon in terms of coarticulation, but would rather proceed along the lines of the example that we considered yesterday for the /alg-da/ — /arg-da/ stimuli (although anticipatory coarticulation in the case of the nasal vowel poses something of a problem here). Another auditory approach is to view the phenomenon as coarticulatory in origin, but to nonetheless conclude that speech perception does not involve recovery of vocal tract gestures. This approach is advocated by John Ohala.
J. J. Ohala: The listener as a source of sound change1981, CLS Parasession.The acoustic signal is inherently ambiguous with regard to vocal tract configuration. The acoustic signal is also highly variable. Listeners are normally able to “factor out” the acoustic distortions, as long as they detect the source of the distortion:Speaker Listener /ut//ut/distorted by vocal tract intoreconstructed as [yt] heard as [yt]
Ohala: But if listeners do not detect the source of the distortion, it will not be factored out. This “mini-sound change”, if copied by other speakers, could become a systematic change in a language (as appears to happen for the scenario below, as well as for nasal vowel height shifts):Speaker Listener /ut/ /y/distorted by vocal tract intointerpreted as [y(t)] heard as [y] [y] later produced as
later produced as Listeners may also “correct” for an imagined distortion, potentially introducing a sound change. Such sound changes would emerge as dissimilations.Speaker Listener /yt/ /ut/distorted by vocal tract intointerpreted as [yt] heard as [yt] [ut]
Ohala (1981:196-197): … “the listener plays an important role in sound change. First, the listener recognizes and thus factors out of the speech signal inherent phonetic variability that would, except for his vigilance, have led to sound change. Second, the listener unknowingly participates in sound change by faithfully copying inherent phonetic variation. Third, in a few cases the listener triggers sound change by misapplying the reconstructive rules that serve to correct phonetic variability. In all of these cases teleology has been reduced to a bare minimum: I assume only that speaker and hearer are interested in communicating and will pronounce words as they have heard them (or think they have heard them).”