Can Automatic Speech Recognition Learn from Human Speech Perception?

Perceptual and Neural ModelingAutomatic Speech Attribute Transcription (ASAT) ProjectSorin Dusan Center for Advanced Information ProcessingRutgers UniversityPiscataway, NJProject Kickoff Meeting – Rutgers University 9-13-04

NSF ASAT Project Sorin Dusan Can Automatic Speech Recognition Learn from Human Speech Perception? • Human auditory system as a model (Geisler ’98, Warren ’99, Plomp ’02, Ledoux ‘02) • The neuro-cognitive process of speech perception is still not totally understood • More understanding today about auditory processing and speech perception than 30-50 years ago due to technology advances: functional magnetic resonance imaging (fMRI), positron emission tomography (PET), magneto-encephalography (MEG) • Better models of speech perception that explain the data (e.g., FLMP Oden&Massaro ‘78, TRACE McClelland&Elman ‘86) • View of speech perception as a process related to other processes of perceptions (e.g., reading – Massaro ‘87) • Take an engineering look at recent findings and understandings about auditory system and speech perception from neuroscience and psychology Sept. 13, 2004

NSF ASAT Project Sorin Dusan Automatic Speech Recognition: from Sound to Words • What are the possible levels of perceptual representations in speech: words, phonemes, features? • The use of subword units for ASR is extremely appealing due to the increased efficiency of modeling, but … • Any kind of subword “units” of speech recognition could damage the sound-to-words mapping accuracy • Is it possible to replace the phoneme? Is it the right time to dethrone the phoneme in speech processing? Neural Speech Processing Words Sound Phonemes words Features phonemes features Sept. 13, 2004

NSF ASAT Project Sorin Dusan Automatic Speech Recognition: from Sound to Words Hypothesis 1: • The ASR can be simply seen as a mapping from acoustics to words with no hard-coded intermediate units • Can one build a system to directly map sound or features to lexical representations? (Marslen-Wilson&Warren ’94) • What are the system architectural implications of such a mapping? (levels, complexity, processing time, etc.) Word 1 Measurements Word 2 3 Word 3 Speech Sound 1 2 4 Word N Phonological Features Complexity: 1 -> 2 -> 3 -> 4 Sept. 13, 2004

NSF ASAT Project Sorin Dusan Automatic Speech Recognition: from Sound to Words Hypothesis 2: • Speech recognition could be a heterogeneous process using simultaneously multiple types of phonological representations (features, phonemes, diphones, syllables, words) • Test this hypothesis by building a hybrid system using for example both features and phonemes and compare performance with those of individual systems • Add a top-down structure for context and knowledge integration to the system that uses the same processing principle as the bottom-up structure (Plomp ’02, Massaro ’75) Word 1 Feature-Based Recognizer Word 2 Phoneme-Based Recognizer Speech Sound Fusion Word N Word-Based Recognizer Sept. 13, 2004

NSF ASAT Project Sorin Dusan References • Geisler, C. D., From Sound to Synapse, Oxford University Press, 1998 • Ledoux, J., Synaptic Self: How Our Brains Become Who We Are, New York, 2002 • Massaro, D. W., Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry, LEA Publishers, Hillsdale, London, 1987 • Marslen-Wilson, W. and Warren, P., “Levels of Perceptual Representation and Process in Lexical Access: Words, Phonemes, and Features”, Psychological Review, Vol. 101, Issue 4, pp. 653-675, 1994 • Massaro, D. W., Understanding Language – An Information Processing Analysis of Speech Perception, Reading, and Psycholinguistics, Academic Press, New York, 1975 • McClelland, J. L. and Elman, J. L., “The TRACE Model of Speech Perception”, Cognitive Psychology, Vol. 18, 1-86, 1986 • Oden, G. C. and Massaro, D. W., “Integration of Featural Information in Speech Perception”, Psychological Review, Vol. 85, pp. 172-191, 1978 • Plomp, R., The Intelligent Ear, LEA Publishers, Mahwah, London, 2002 • Warren, R. M., Auditory Perception – A New Analysis and Synthesis, Cambridge University Press, 1999 Sept. 13, 2004

Can Automatic Speech Recognition Learn from Human Speech Perception?