110 likes | 152 Views
Explore the critical band energies in short-term spectrum analysis for phoneme detection in cortical receptive fields, leveraging data-guided processing techniques. Understand the significance of frequency-time information processing from a physiological and psychophysical perspective.
E N D
SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,
Radio Rex (1917) Newton l/2 beer Helmholtz /u/ /o/ /a/ /e/ /iy/ • “limited commercial success” • -John Pierce 1969
Short-term spectrum about 20 ms classify frequency time SHORT TERM SPECTRUM
temporal pattern of critical band energies classify window Short-term spectrum 1 sec about 20 ms classify frequency time Phone “boundaries” ASR from TempoRAl Patterns (TRAP)
WHY 200-1000 ms ? 200 – 1000 ms frequency time • because that’s where the information is (coarticulation) • mutual info studies (Bilmes, Yang et al.) • psychophysics of hearing • 200 ms “critical time window” (forward masking, perception of loudness, perception of gaps,… • physiology of hearing • time component of cortical receptive fields (Klein) • because “it works” • ETSI Aurora work
WHY narrow frequency bands? frequency time 1-3 Bark • psychophysics of hearing • independence of processing within critical bands • physiology of hearing • mechanical selectivity of cochlea • cortical receptive fields (e.g. Shamma) • because “it works” • multi-band ASR (Bourlard and Dupont, Hermansky et al,…) • decrease in ASR accuracy for wider frequency spans (Jain and Hermansky - Eurospeech 2003)
Which features? frequency time data-guided processing • no knowledge is better than wrong knowledge • data cannot lie • speech evolved to be heard • data-derived processing is consistent with human-like processing (minus the irrelevant components of the human cognitive processing) features
WHY data-guided processing? frequency time data-guided (trained on data) processing • some function of class posteriors • class posteriors form the most efficient feature set [e.g. Fukunaga] • posteriors of which classes? features
event detection frequency selective hearing signal event detection p(event,frequency) Speech Events class (phoneme?) detection
class posteriors processing ( trained system ) data TRAP TANDEM frequency some function of phoneme posteriors processing ( trained system ) processing ( trained system ) data time