1 / 128

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng steveng@icsi.berkeley.edu. No Scientist is an Island …. IMPORTANT COLLEAGUES

swann
Download Presentation

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng steveng@icsi.berkeley.edu

  2. No Scientist is an Island … IMPORTANT COLLEAGUES ACOUSTIC BASIS OF SPEECH INTELLIGILIBILTY Takayuki Arai, Joy Hollenback, Rosaria Silipo AUDITORY-VISUAL INTEGRATION FOR SPEECH PROCESSING Ken Grant AUTOMATIC SPEECH RECOGNITION AND FEATURE CLASSIFICATION Shawn Chang, Lokendra Shastri, Mirjam Wester STATISTICAL ANALYSIS OF PRONUNCIATION VARIATION Eric Fosler, Leah Hitchcock, Joy Hollenback

  3. Germane Publications STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32. Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176. Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany . Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27. AUTOMATIC PHONETIC TRANSCRIPTION AND ACOUSTIC FEATURE CLASSIFICATION Chang, S. Greenberg, S. and Wester, M. (2001) An elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English), Proceedings of the International. Conference on. Spoken. Language. Processing, Beijing. Shastri, L., Chang, S. and Greenberg, S. (1999) Syllable segmentation using temporal flow model neural networks. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Wester, M. Greenberg, S. and Chang,, S. (2001) A Dutch treatment of an elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001). http://www.icsi.berkeley.edu/~steveng

  4. Germane Publications PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936. Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678. Greenberg, S. and Arai, T. (2001) The relation between speech intelligibility and the complex modulation spectrum. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest. AUDITORY-VISUAL SPEECH PROCESSING Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from processing of asynchronous processing of auditory-visual information. Submitted to the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001). PROSODIC STRESS ACCENT – AUTOMATIC CLASSIFICATION AND CHARACTERIZATION Hitchcock, L. and Greenberg, S. (2001) Vowel height is intimately associated with stress-accent in spontaneous American English discourse. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Silipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous English discourse. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency. Proceedings of the NIST Speech Transcription Workshop, College Park, MD. Silipo, R. and Greenberg, S. (2000) Automatic detection of prosodic stress in American English discourse. Technical Report 2000-1, International Computer Science Institute, Berkeley, CA. http://www.icsi.berkeley.edu/~steveng

  5. PROLOGUE The Central Challenge for Models of Speech Recognition

  6. Language - The Traditional Perspective The “classical” view of spoken language posits a quasi-arbitrary relation between the lower and higher tiers of linguistic organization

  7. The Serial Frame Perspective on Speech • Traditional models of speech recognition assume that the identity of a phonetic segment depends on the detailed spectral profile of the acoustic signal for a given (usually 25-ms) frame of speech

  8. Language - A Syllable-Centric Perspective A more empirical perspective of spoken language focuses on the syllable as the interface between “sound” and “meaning” Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically

  9. Lines of Evidence

  10. Segmentation is crucial for understanding spoken language At the level of the phrase the word the syllable the phonetic segment But …. this linguistic segmentation is inherently “fuzzy” As is the spectral information associated with each linguistic tier The low-frequency (3-25 Hz) modulation spectrum is a crucial acoustic (and possibly visual) parameter associated with intelligibility It provides segmentation information that unites the phonetic segment with the syllable (and possibly the word and beyond) Many properties of spontaneous spoken language differ from those of laboratory and citation speech There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization Take Home Messages

  11. The Central Importance of the Modulation Spectrum and the Syllable for Understanding Spoken Language

  12. Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

  13. Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the temporal and modulation spectral properties of the speech signal The modulation spectrum’s peak is attenuated and shifted down to ca. 2 Hz [based on an illustration by Hynek Hermansky]

  14. Modulation Spectrum Computation

  15. The Modulation Spectrum Reflects Syllables The peak in the distribution of syllable duration is close to the mean - 200 ms The syllable duration distribution is very close to that of the modulation spectrum - suggesting that the modulation spectrum reflects syllables

  16. The Ability to Understand Speech Under Reverberant Conditions (Spectral Asynchrony)

  17. Spectral Asynchrony - Method Output of quarter-octave frequency bands quasi- randomly time-shifted relative to common reference. Maximum shift interval ranged between 40 and 240 ms (in 20-ms steps). Mean shift interval is half of the maximum interval. Adjacent channels separated by a minimum of one-quarter of the maximum shift range. Stimuli – 40 TIMIT Sentences “She washed his dark suit in greasy dish water all year”

  18. Spectral Asynchrony - Paradigm The magnitude of energy in the 3-6 Hz region of the modulation spectrum is computed for each (4 or 7 channel sub-band) as a function of spectral asynchrony The modulation spectrum magnitude is relatively unaffected by asynchronies of 80 ms or less (open symbols), but is appreciably diminished for asynchronies of 160 ms or more Is intelligibility correlated with the reduction in the 3-6 Hz modulation spectrum?

  19. Intelligibility and Spectral Asynchrony Speech intelligibility does appear to be roughly correlated with the energy in the modulation spectrum between 3 and 6 Hz The correlation varies depending on the sub-band and the degree of spectral asynchrony

  20. Speech is capable of withstanding a high degree of temporal asynchrony across frequency channels This form of cross-spectral asynchrony is similar to the effects of many common forms of acoustic reverberation Speech intelligibility remains high (>75%) until this asynchrony (maximum) exceeds 140 ms The magnitude of the low-frequency (3-6 Hz) modulation spectrum is highly correlated with speech intelligibility Spectral Asynchrony - Summary

  21. Understanding Spoken Language Under Very Sparse Spectral Conditions

  22. A Flaw in the Spectral Asynchrony Study Of the 448 possible combinations of four slits across the spectrum (where one slit is present in each of the 4 sub-bands) ca. 10% (i.e. 45) exhibit a coefficient of variation less than 10% - thus, the seeming temporal tolerance of the auditory system may be illusory (if listeners can decode the speech signal using information from only a small number of channels distributed across the spectrum) Intelligibility of spectrally desynchronized speech Distribution of channel asynchrony

  23. Spectral Slit Paradigm Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? The edge of each slit was separated from its nearest neighbor by an octave The modulation pattern for each slit differs from that of the others The four-slit compound waveform looks very similar to the full-band signal + +

  24. Word Intelligibility - Single Slits The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

  25. Word Intelligibility - Roap Map 1. Intelligibility as a function of the number of slits (from one to four)

  26. Word Intelligibility - 1 Slit

  27. Word Intelligibility - 2 Slits

  28. Word Intelligibility - 3 Slits

  29. Word Intelligibility - 4 Slits

  30. Word Intelligibility - Roap Map 2. Intelligibility for different combinations of two-slit compounds The two center slits yield the highest intelligibility

  31. Word Intelligibility - 2 Slits

  32. Word Intelligibility - 2 Slits

  33. Intelligibility - 2 Slits

  34. Intelligibility - 2 Slits

  35. Intelligibility - 2 Slits

  36. Intelligibility - 2 Slits

  37. Word Intelligibility - Roap Map 3. Intelligibility for different combinations of three-slit compounds Combinations with one or two center slits yield the highest intelligibility

  38. Intelligibility - 3 Slits

  39. Intelligibility - 3 Slits

  40. Intelligibility - 3 Slits

  41. Intelligibility - 3 Slits

  42. Word Intelligibility - Roap Map 4. Four slits yield nearly (but not quite) perfect intelligibility of ca. 90% This maximum level of intelligibility makes it possible to deduce the specific contribution of each slit by itself and in combination with others

  43. Intelligibility - 3 Slits

  44. A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility Spectral Slits - Summary

  45. Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency The shape of the modulation spectrum is similar for the three lowest slits, but the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies

  46. Word Intelligibility - Single Slits The intelligibility associated with any single slit ranges between 2 and 9%, suggesting that the shape and magnitude of the modulation spectrum per se is NOT the controlling variable for intelligibility

  47. A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility Spectral Slits - Summary

  48. The Effect of Desynchronizing Sparse Spectral Information on Speech Intelligibility

  49. Modulation Spectrum Across Frequency Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility

  50. Even small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility Asynchrony greater than 50 ms has a profound impact of intelligibility Spectral Slits - Summary

More Related