1 / 27

Speech Segregation Based on Oscillatory Correlation

Speech Segregation Based on Oscillatory Correlation. DeLiang Wang The Ohio State University. Outline of Presentation. Introduction Auditory Scene Analysis (ASA) Problem Binding Problem Oscillatory Correlation Theory LEGION network Multistage Model for Computational ASA (CASA)

august
Download Presentation

Speech Segregation Based on Oscillatory Correlation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University

  2. Outline of Presentation • Introduction • Auditory Scene Analysis (ASA) Problem • Binding Problem • Oscillatory Correlation Theory • LEGION network • Multistage Model for Computational ASA (CASA) • Recent Results • Discussion and Summary

  3. ASA Problem (Bregman’90) • Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source • ASA takes place in two conceptual stages: • Segmentation. Decompose the acoustic signal into ‘sensory elements’ (segments) • Grouping. Combine segments into groups, such that segments in the same group are likely to have arisen from the same environmental source

  4. ASA Problem - continued • The grouping process involves two mechanisms: • Primitive grouping. Innate data-driven mechanisms, consistent with those described by the Gestalt psychologists for visual perception (proximity, similarity, common fate, good continuation etc.) • Schema-driven grouping. Application of learned knowledge about speech, music and other environmental sounds

  5. Binding Problem • Information about acoustic features (pitch, spectral shape, interaural differences, AM, FM) is extracted in distributed areas of the auditory system • How are these features combined to form a whole? • Hierarchies of feature-detecting cells exist, but do not constitute a solution to the binding problem; no evidence for ‘grandmother cells’

  6. Oscillatory Correlation (von der Malsburg & Schneider’86; Wang’96) • Neural oscillators used to represent auditory features • Oscillators representing features of the same source are synchronized (phase-locked with zero phase lag), and are desynchronized from oscillators representing different sources • Supported by experimental findings, e.g. oscillations in auditory cortex measured by EEG, MEG and local field potentials

  7. Oscillatory Correlation Theory FD: Feature Detector

  8. LEGION Architecture for Stream Segregation • LEGION: Locally Excitatory Globally Inhibitory Oscillator Network (Terman & Wang’95)

  9. Single Relaxation Oscillator With stimulus Without stimulus Typical x trace (membrane potential)

  10. LEGION on a Chip The chip area is 6.7mm2 (Core 3mm2) and implements a 16x16 LEGION network (By Jordi Cosp, Polytechnic University of Catalonia, SPAIN)

  11. Computational Auditory Scene Analysis • The ASA problem and the binding problem are closely related; the oscillatory correlation framework can address both issues • Previous work also suggests that: • Representation of the auditory scene is a key issue • Temporal continuity is important (although it is ignored in most frame-based sound processing algorithms) • Fundamental frequency (F0) is a strong cue for grouping

  12. A Multi-stage Model for CASA

  13. Auditory Periphery Model • A bank of gammatone filters • n: filter order (fourth-order is used) • b: bandwidth • H: Heaviside function • Meddis hair cell model converts gammatone output to neural firing

  14. Fourth-order Gammatone Filters - Example Impulse responses of gammatone filters

  15. Auditory Periphery - Example • Hair cell response to utterance: “Why were you all weary?” mixed with phone ringing • 128 filter channels arranged in ERB

  16. Mid-level Auditory Representations • Mid-level representations form the basis for segment formation and subsequent grouping processes • Correlogram extracts periodicity information from simulated auditory nerve firing patterns • Summary correlogram can be used to identify F0 • Cross-correlation between adjacent correlogram channels identifies regions that are excited by the same frequency component

  17. Mid-level Representations - Example Correlogram and cross-correlation for the speech/telephone mixture

  18. Oscillator Network: Segmentation Layer • An oscillator consists of reciprocally connected excitatory variable xij and inhibitory variable yij (Terman & Wang’95): • Stable limit cycle occurs for Iij > 0 • Each oscillator is connected to four nearest neighbors

  19. Segmentation Layer - continued • Horizontal weights are unity, vertical weights are unity if correlation exceeds threshold, otherwise 0 • Oscillators receive input if energy in corresponding channel exceeds a threshold • All oscillators are connected to a global inhibitor, which ensures that different segments are desynchronized from one another • A LEGION network

  20. Segmentation Layer - Example • Output of the segmentation layer to the speech/telephone mixture

  21. Oscillator Network: Grouping Layer • The second layer is a two-dimensional oscillator network without global inhibition, which embodies the grouping stage of ASA • Oscillators in the second layer only receive input if the corresponding oscillator in the first layer is stimulated • At each time frame, a F0 estimate from the summary correlogram is used to classify channels into two categories; those that are consistent with the F0, and those that are not

  22. Grouping Layer - continued • Enforce a rule that all channels of the same time frame within each segment must have the same F0 category as the majority of channels Result of the speech telephone example

  23. Grouping Layer - continued • Grouping is limited to the time window of the longest segment • There are horizontal connections between oscillators in the same segment • Vertical connections are formed between pairs of channels within each time frame; mutual excitation if the channels belong to the same F0 category, otherwise mutual inhibition

  24. Grouping Layer - Example • Two streams emerge from the group layer • Foreground: left (original mixture ) • Background: right

  25. Evaluation • Evaluated on a corpus of 100 mixtures (Cooke’93): 10 voiced utterances x 10 noise intrusions • Noise intrusions have a large variety • Resynthesis pathway allows estimation of SNR after segregation; improvement in SNR after processing for each noise condition

  26. Results of Evaluation Changes in SNR Speech energy retained

  27. Summary • An oscillatory correlation framework has been proposed for ASA • Neurobiologically plausible • Engineering applications - robust automatic speech recognition in noisy environments, hearing prostheses, and speech communication • Key issue is integration of various grouping cues

More Related