Chapter 14: Models and Theories of Speech Production and Perception

Chapter 14: Models and Theories of Speech Production and Perception Perry C. Hanavan, Au.D.

Theory A proposed description, explanation, or model of the manner of interaction of a set of natural phenomena, capable of predicting future occurrences or observations of the same kind, and capable of being tested through experiment or otherwise falsified through empirical observation.

Model A model is a conceptual representation of some phenomenon

Speech Perception • Puzzles and Speech Perception by Hawkins

Speech Production • Serial-Order Issue • Order of phonemes determines how the word will be perceived or recognized k a t t a k • Degrees of Freedom • Muscle movement for each sound can vary, yet understanding of sound occurs • Context-Sensitivity Problem • The context in which a sound is made can have significant implication on meaning

Theories of Speech Production • Target • Feedback and • Feedforward • Dynamic Systems • Connectionist

Target Models • “process in which a speaker attempts to attain a sequence of targets corresponding to the speech sounds he is attempting to produce.” • (Borden et al., 1994)

Target Models • Spatial? • Internal map of the vocal tract in the CNS • Coarticulation • varient –movements of the articulator for a specific phoneme must change depending on starting point • ki • ku • Feedback information to brain to regulate fine movements and correct errors

Target Models • Acoustic-auditory? • Goal is acoustic output (targets) • Articulatory movements used to achieve goal • Variations in articulatory movements used when: • Adjacent phonemes vary • Speaker rate varies • Stress variations in utterance

Target Models • Perkell, Matthies, & Svirsky (1995) framework • Ultimate goal: • Articulatory movements result in understandable acoustic events (speech)

Feedback • Feedback was introduced The Basics of Cybernetics", and is especially important to speech production theory. • Gracco and Abbs (1987) are among many to point out that continuous speech involves continuous feedback, that is to say, that the continuous execution of a motor program requires an equally continuous stream of sensory information from muscle and cutaneous senses throughout the respiratory, laryngeal, and orofacial regions.

Feedback • Levelt (1989) types of feedback ..... • Am I saying what I meant to say? • Is this the way I meant to say it? • Is what I am saying socially appropriate? • Am I selecting the right words? • Am I using the right syntax and morphology? • Am I making any phonological errors? • Is my articulation at the right speed and pitch?

Feedback • Successful speech production is a constant battle against error, and those errors can pop up anywhere. • The phrases we then use to interrupt and correct ourselves (phrases such as "sorry", "I mean", "let me put that another way", etc.) are known generically as "editing expressions" (Hockett, 1967). Levelt (1989) summarised the issue thus .....

Feedback • "The major feature of editor theories [of monitoring] is that production results are fed back through a device that is external to the production system. • Such a device is called an editor or a monitor. • This device can be distributed in the sense that it can check in-between results at different levels of processing. • The editor may, for instance, monitor the construction of the preverbal message, the appropriateness of lexical access, the well-formedness of syntax, or the flawlessness of phonological-form access. • There is, so to speak, a watchful little homonculus connected to each processor." (Levelt, 1989, pp467-468; italics original; bold emphasis added)

Feedback • Lee interpreted these findings as evidence of a multiple loop control hierarchy, with four levels of feedback, as follows: • The "Thought Loop": The top control level releases individual thoughts for action, and then monitors that action for successful progress and completion. The highest level feedback loop then monitors the output for what would nowadays be termed its pragmatic appropriacy • The "Word Loop": The second highest loop monitors speech production for word selection accuracy. • The "Voice Loop": The third highest loop monitors speech production at whole-syllable level for morphological accuracy. • The Articulating Loop": Finally, the lowest loop monitors speech production checking that the right phonemes have been used within each syllable.

Feedforward • Feedback is used to detect and correct errors in speech output • Feedforward signals are used to make articulatory adjustment online

Dynamic Systems Models • Speech as a dynamic pattern of trajectories through articulatory • Groups of muscles link up together to perform a particular task • The lip and jaw muscles function as a coordinative unit in bilabial closure

Connectionist Models • Parallel-distributed processing models • Spreading activation models • Non-hierachial models

Speech Perception Issues • Linearity • Segmentation • Speaker Normalization • Basic Unit of Perception • Specialization of speech perception

Categories of Speech Perception Theories • Active vs. Passive • Bottom-up vs. Top-down • Autonomous vs. Interactive

Theories of Speech Perception • Motor • Acoustic Invariance • Direct Realism • TRACE • Cohort • Fuzzy Logical • Logogen • Native Language Magnet

Speech Perception Issues • Linearity • Segmentation • Speaker normalization • Basic unit of perception • Specialization of speech perception

Linearity & Segmentation • Linearity Principle: • A specific sound in a word corresponds to specific phoneme • Segmentation • the ability to break the spoken language signal into the parts that make up words • Thus, these two principles suggest speech perception is based on a linear correspondence between the acoustic signal and the phoneme units • Although we perceive speech as a series of separate and distinct phonemes and words, the acoustic boundaries between phonemes is blurred • eg. /ki/ vs. /ku/ (speech is not invariant)

Speaker Normalization • How are listeners able to recognize speech sounds and words despite wide variations in speaker production? • Speaker variations • Gender differences • Age differences • Normalization is the ability to perceive words spoken by different speakers, at different rates, and in different phonetic contexts as the same.

Basic Unit of Perception • What is the basic unit of speech perceptions? • Acoustic-phoneme features • Allophones • Phonemes • Syllables • Words • Listening in noise (focus on smaller units) • Young children focus on syllables and formant transitions

Specialization of Speech Perception • Is speech perception a specialized function/process in humans • However, animals have been able to demonstrate categorical perception

Specialization of Speech Perception • Perceptual magnet effect not demonstrated in animals (e.g., whereby `good' variants in F1/F2 coordinate space are poorly discriminated from typical vowel prototypes) (A) Formant frequencies of vowels surrounding an American/i/prototype (red) and a Swedish/y/prototype (blue). (B) Results of tests on American and Swedish infants indicating an effect of linguistic experience. Infants showed greater generalization when tested with the native-language prototype. PME, Perceptual magnet effect. [American Association for the Advancement of Science]

Categories of Speech Perception Theories • Active vs. Passive • Bottom up Top Down • Autonomous vs. Interactive

Active vs. Passive • Active theories suggests that speech perception and production are closely related • Listener knowledge of how sounds are produced facilitates recognition of sounds • Passive theories emphasizes the sensory aspects of speech perception • Listeners utilize internal filtering mechanisms • Knowledge of vocal tract characteristics plays a minor role, for example when listening in noise conditions

Bottom up Top Down • Top-down processing works with knowledge a listener has about a language, context, experience, etc. • Listeners use stored information about language and the world to make sense of the speech • Bottom-up processing works in the absence of a knowledge base providing top-down information • listeners receive auditory information, convert it into a neural signal and process the phonetic feature information

Autonomous vs. Interactive • Autonomous theories posit feed-forward processing with lexical influence restricted to post-perceptual decision processes (uni-directional) • Interactive theories posit information and knowledge from many sources available to the listener a re involved at any or all stages of the processing of the signal (bi-directional)

Speech Perception Theories • Motor Theory • Acoustic Invariance Theory • Direct Realism • Trace Model • Logogen Theory • Cohort Theory • Fuzzy Logic Model of Perception • Native Language Magnet Theory

Question This theory postulates speech is perceived by reference to how it is produced: • Motor Theory • Acoustic Invariance Theory • Direct Realism • Trace Model • Logogen Theory • Cohort Theory • Fuzzy Logic Model of Perception • Native Language Magnet Theory

Motor Theory • Postulates speech is perceived by reference to how it is produced • when perceiving speech, listeners access their own knowledge of how phonemes are articulated • Articulatory gestures (such as rounding or pressing the lips together) are units of perception that directly provide the listener with phonetic information Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967

Motor Theory Three main claims: • speech processing is special, • perceiving speech is perceiving gestures, and • the motor system is recruited for perceiving speech. Today, the theory is more closely connected with research and theorizing in the broad context of cognitive science than with research and theorizing in the field of speech.

Question The acoustic properties of the landmarks constitute the basis for establishing the distinctive features: • Motor Theory • Acoustic Invariance Theory • Direct Realism • Trace Model • Logogen Theory • Cohort Theory • Fuzzy Logic Model of Perception • Native Language Magnet Theory

Acoustic Invariance Theory • Listeners inspect the incoming signal for the so-called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. • Gestures are limited by the capacities of humans’ articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model. • The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. • Bundles of the distinctive features uniquely specify phonetic segments (phonemes, syllables, words). Stevens, K.N. (2002). "Toward a model of lexical access based on acoustic landmarks and distinctive features" (PDF). Journal of the Acoustical Society of America111 (4): 1872–1891.

Acoustic Invariance Theory Two principal claims: • There are invariant acoustic patterns in the speech signal which correspond to phonetic features and which remain invariant across speakers phonetic contexts, and languages. • That human perceivers use these properties (invariant acoustic patterns) to provide the phonetic framework for natural language and to process the sounds of speech in ongoing perception

Question Hypothesizes that perception allows listeners to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived. • Motor Theory • Acoustic Invariance Theory • Direct Realism • Trace Model • Logogen Theory • Cohort Theory • Fuzzy Logic Model of Perception • Native Language Magnet Theory

Direct Realism • Hypothesizes that perception allows listeners to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived. • Asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures. • Listeners perceive gestures not by means of a specialized decoder (as in the motor theory) but because information in the acoustic signal specifies the gestures that form it. • Suggests that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception. Fowler, C. A. (1986). "An event approach to the study of speech perception from a direct-realist perspective". Journal of Phonetics14: 3–28.

Direct Realism • In essence, the object of our perceptions (ie., a sound, words, etc.), known as a ‘percept’, is the distal (causal) stimulus, not the proximal (sensory) stimulus • If listening to a question, you don’t necessarily explicitly perceive the rise in the Fo of the auditory signal; instead, you perceive that a question has been asked • Claims there is a strong, top-down effect on perception • Supports a connectionist view of cognition

TRACE Model • Assumes there is a cognitive unit for each feature (for example, nasality) at the feature level, for each phoneme at the phoneme level, and for each word at the word level. • At any given time, all of these units are activated to a greater or lesser extent, as opposed to all or none. • When units are activated above a certain threshold, they may influence other units at the same or different levels. • These effects may be either excitatory or inhibitory; that is, they may increase or decrease the activation of other units. • The entire network of units is referred to as the trace, because “the pattern of activation left by a spoken input is a trace of the analysis of the input at each of the three processing levels” • The network is active and changes with subsequent input. McClelland, J.L., & Elman, J.L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86

TRACE Model • For example, a listener hears the beginning of bald, and the words bald, ball, bad, bill become active in memory. Then, soon after, only bald and ball remain in competition (bad, bill have been eliminated because the vowel sound doesn't match the input). • Soon after, bald is recognized. • TRACE simulates this process by representing the temporal dimension of speech, allowing words in the lexicon to vary in activation strength, and by having words compete during processing. • Figure 1 shows a line graph of word activation in a simple TRACE simulation.

TRACE Model Neural net model • Aims to identify single words • Account for categorical perception, Ganong effect and other traditional phonetic findings that were considered important in 1970s- 1980s • Connectionist model of speech perception (McClelland and Elman, 1986)

Logogen Theory • Model designed to explain word recognition using a new type of unit known as a “logogen" • A critical element, lexicons, or specialized aspects of memory that include semantic and phonemic information about each item that is contained in memory. • A given lexicon consists of many smaller, abstract items known as logogens. • Logogens contain a variety of properties about given word such as their appearance, sound, and meaning. • Logogens do not store words within themselves, they store information that is specifically necessary for retrieval of whatever word is being searched for. Morton, J. (1969). Interaction of information in word recognition. Psychological Review, 76, 165-178

Logogen Theory • A given logogen will become activated by stimuli or contextual information (words) consistent with the properties of that specific logogen and when the logogen's activation level rises to or above its threshold level, the pronunciation of the given word is sent to output system. • Certain stimuli can affect the activation levels of more than one word at a time, usually involving words similar to one another. • When this occurs, whichever words' activation levels reaches the threshold level, it is that word sent to the output system with the listener remaining unaware of any partially excited logogens.

Cohort Theory • Designed specifically to account for auditory word recognition. • Breaks word down. • Model posits that when a word is heard, all words beginning with first sound of target word are activated. • This set of words is considered the Cohort. • Once first cohort has been activated, other information, or sounds in word narrow down choices. • Listener recognizes word when left with a single choice; considered "recognition point." Marslen-Wilson, W. (1987). "Functional parallelism in spoken word recognition." Cognition, 25, 71-102.

Cohort Model • Designed specifically to account for auditory word recognition • Works by breaking a word down when a word is heard--all words that begin with the first sound of the target word are activated • This set of words is considered the “first” cohort • Once the first cohort activated, other sounds in the word narrow down the choices. • Listener recognizes the word when left with a single choice • this is considered the "recognition point".

Cohort Theory

Fuzzy Logic Model of Perception • Proposes that people remember speech sounds in a probabilistic, or graded, way. • Suggests people remember descriptions of the perceptual units of language, called prototypes. • Within each prototype various features may combine. • Features are not binary (true or false) -- there is a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category. Massaro D. The logic of the fuzzy logical model of perception Behavioral and Brain Sciences (1989), 12: 778-794 Cambridge University Press

Chapter 14: Models and Theories of Speech Production and Perception