160 likes | 187 Views
Explore statistical and signal processing methods for voicing detection in speech analysis, highlighting recent advances and performance comparisons. Understand the importance of voicing in speech recognition and the challenges posed by different speech characteristics. Discover practical applications and the value of incorporating voicing information in signal processing. Delve into methods such as autocorrelation, harmonic analysis, and statistical decision rules for accurate voicing detection. Gain insights into the future directions of voicing detection research.
E N D
Statistical and Signal Processing Approaches for Voicing Detection Alex Park July 25th, 2003
Overview • Motivation and background for voicing detection • Overview of recent methods • Signal Processing approaches • Statistical approaches • Performance comparison of voicing detection methods • Detection error rates on small task • Example outputs • Conclusions and Future Work Introduction
Motivation • Voicing is not necessary for speech understanding • E.g. Whispered speech – excitation is provided by aspiration • E.g. Sinewave speech – no periodic excitation, resonances produced directly • What is the value of adding voicing to the speech signal? • Separability? Pitch is useful for distinguishing between concurrent speakers and background • Redundancy? Harmonics provide regular structure from which we can detect speech in multiple bands • Robustness? Unvoiced speech has lower SNR than voiced speech • Whispering is intended to prevent unwanted listeners from hearing • Shouting/singing not possible without voicing • Low frequencies less attenuated over distances • Current speech recognition systems typically discard voicing information in the front end because • Energy is environment dependent, pitch is speaker dependent • Vocal tract configuration carries most phonetic information Introduction
Irregular pitch periods Missing Fundamental Background • Voicing produced by periodic vibrations of the vocal folds. • In time, voiced speech consists of repeated segments. • In frequency, spectrum has harmonic structure shaped by formant resonances • Pitch estimation and voicing decision can be made • In time, using repetition rate and similarity of pitch periods • In frequency, using spacing and relative height of harmonic peaks Time Domain Freq Domain Introduction
Signal Processing Approaches • Signal processing approaches marked by lack of training phase • Voicing detection typically paired with pitch extraction • Well known approach: peak-picking (spectral or temporal) • Usually followed by smoothing gross errors via Dynamic Programming • Many proposed solutions: • Spectral • Cepstral Pitch tracking • Harmonic Product Sum • Logarithmic DFT pitch tracker (Wang) • Temporal • Autocorrelation • Sinusoid matching (Saul) • Synchrony (Seneff) • Exotic methods • Image based pitch tracking (Quatieri) Signal Processing
I. Autocorrelation • Temporal domain approach, used in ESPS tool ‘get_f0’ • Compute inner product of signal with shifted version of itself • If is a speech frame, then autocorrelation is Speech Frame Peaks occur at multiples of fundamental period Short Time Autocorrelation Signal Processing
w1*, u1* 25-75 Hz : min(ui*) w4*, u4* 134-407 Hz max(x,0) F0 = 2p w4* p(V) = f(u4*) Low Pass Filter : w8*, u8* 264-800 Hz Half-wave rectify Signal Preconditioning Octave Filterbank (8) Sliding temporal window Output II. Band-limited Sinusoid Fitting (Saul 2002) • Filter bandwidths allow at least one filter to resolve single harmonics • Frames of filtered signals fit with sinusoid of frequency w* and error u* • At each step, lowest u* gives voicing probability, w* gives pitch estimate • Algorithm is fast and gives accurate pitch tracks Signal Processing
Statistical Approaches • Statistical voicing detectors are not strictly dependent on spectral features (but these are the features widely used) • Training data useful for capturing acoustic cues of voicing not explicitly specified in signal processing approaches • Possible classifiers suitable for voicing detection include • GMM classifier (w/ MFCC features) • Structured Bayesian Network (alternative features) • Neural Network classifier • Support Vector Machines Statistical
Decision Rule V GMM p(V|x) p(UV|x) Transcribed speech (Training Data) p(x|UV) p(x|V) p(V) p(x|UV) p(UV) p(x|V) p(x|UV) > > < < “voiced” Unknown frame, x p(x|V) > L < p(x|UV) p(x|V) UV GMM “unvoiced” Training Testing I. GMM Classifier • Train two GMMs, p(x|V) and p(x|UV) using frame-level feature vectors (MFCCs + surrounding frames (for D’s and DD’s)) • 50 mixtures each, dimensions reduced to 50 via PCA • Using Bayes’ rule, voicing score is given by likelihood ratio • Discriminative framework is useful because it uses knowledge of unvoiced speech characteristics in making decision Statistical
s(u) Filter 1 Feature Extraction AND Filter 2 Feature Extraction Half-wave rectify : : : : Filter 24 Voicing Decision Feature Extraction OR max(x,0) Auditory Filterbank (Gammatone) Signal Preconditioning Feature Extraction & Channel Tests AND Layer OR Layer II. Bayesian Network (Saul/Rahim/Allen 1999) • Feature vector constructed for frames of narrowband speech • (Autocorrelation peaks and valleys) & (SNR Estimate) = 5 dims/band/frame • Individual voicing decisions made on each channel • Channel sigmoid decision weights (q’s) trained via EM algorithm • Overall voicing decision triggered by positive example in individual channels Statistical
Reported Operating Point (Saul) Comparison: Matched Conditions • Trained on 410 TIMIT sentences from 40 speakers (126k frames) • Evaluated on 100 TIMIT sentences from 10 speakers (28k frames) • Speech was resampled to 8kHz, phone labels used as voicing ref • Also evaluated on Keele database (laryngograph reference) Results
GMM w/ MFCCs Voiced Bayesian Network Unvoiced Sinusoid Uncertainty Autocorrelation Sample Outputs: Matched Conditions • Some example voicing tracks output by individual methods Results
Comparison: Mismatched Conditions • Evaluated with different kinds of signal corruption • Condition not known a priori => same threshold as before • threshold can be adaptive to environment (same as modifying output prob.) • Overall error rates are unsatisfactory • GMM classifier has best performance on clean data, but unpredictable results in varied conditions GMM Autocorrelation Sinusoid Fit Results
GMM w/ MFCCs Voiced Unvoiced Bayesian Network TIMIT Sinusoid Uncertainty NTIMIT Autocorrelation Sample Outputs: Mismatched Conditions • Voicing tracks on NTIMIT utterance Results
Conclusions and Future Work • Error rates are still high compared with literature • Post processing to remove stray frames • Problem with scoring procedure? • Statistical framework with knowledge based features • Weight contribution of multiple detectors using SNR-based variable • Using same approach, apply to phonetic detectors for voiced speech • Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy • Rounding – Low F1, F2. • Retroflex – Low F3, rising formants. • Combine feature streams with SNR based weight as input to HMM Conclusions
References • L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15. MIT Press: Cambridge, MA. • L. K. Saul, M. G. Rahim, and J. B. Allen (2001).A statistical model for robust integration of narrowband cues in speech.Computer Speech and Language 15(2): 175-194. • C. Wang, and S. Seneff (2000). "Robust Pitch Tracking for Prosodic Modeling in Telephone Speech," In Proc. ICASSP ‘00, Istanbul, Turkey. • S. Seneff (1985). “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D Thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, MA. • T. F. Quatieri (2002). "2-D Processing of Speech with Application to Pitch Estimation," In Proc. ICLSP ’02, Denver, Colorado. Conclusions