180 likes | 386 Views
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition. Nengheng Zheng. Supervised under Professor P.C. Ching. Nov. 26 , 2004. Outline. Speech production and glottal pulse excitation in detail Linear prediction: short-term and Long-term
E N D
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov.26 , 2004
Outline • Speech production and glottal pulse excitation in detail • Linear prediction: short-term and Long-term • Glottal spectrum estimated with long-term prediction and acoustic features • For speaker recognition implementation
Glottal pulses Vocal tract Speech signal Speech Production Discrete time model for speech production A combined transfer function
Acoustic Features of Glottal Pulse • Time domain • pitch period • pitch period perturbation (jitter) • pulse amplitude perturbation (shimmer) • glottal pulse width • abruptness of closure of the glottal flow • aspiration noise • Frequency domain • fundamental frequency (F0) • spectral tilt (slope) • harmonic richness
Glottal Pulse and Voice Quality • Glottal pulse shape plays an important role on the quality of Natural or synthesized vowels [Rosenberg 1971] • The shape and periodicity of vocal cord excitation are subject to large variation • Such variations are significant for preserving the speech naturalness • A typical glottal pulse: asymmetric with shorter falling phase; spectrum with -12dB/octave decay • More variation among different speakers than among different utterance of the same speaker [Mathews 1963] • Such variations have little significance for speech intelligibility but affect the perceived vocal quality [Childers 1991]
Various Glottal Pulses • Some other vocal types breathy falsetto vocal fry • Temporal and spectral characteristics
Some Comments • Generally, to study the glottal pulse characteristics, it is necessary to rebuilding the glottal pulse waveform by inverse filtering technique • Automatically and exactly rebuilding the glottal waveform from real speech is almost impossible, especially, at the transient phase of articulation, or, for high pitched speakers • Fortunately, it is possible to estimate the glottal spectrum from residual signal with pitch prediction
Linear Prediction • Speech waveform: correlation between current and past samples and thus predictable • Short-term correlation: • Occurs within one pitch period • Formant modulation • Classical linear prediction analysis (short-term prediction) • Long-term correlation • occurs across consecutive pitch periods • Vocal cords vibration • Long-term/pitch prediction
Linear Prediction • Short-term predictor <classical linear prediction> • Remove the short-term correlation and result in a glottal excitation signal • Long-term predictor <pitch prediction> • Remove the correlation across consecutive periods
Harmonic Structure of Glottal Spectrum • Two parameters describing the harmonic structure • Harmonic richness factor and Noise-to-harmonic ratio • Harmonic richness factor (HRF) • Noise-to-harmonic ratio (NHR)
Feature Generation • Acoustic features including the following: • Fundamental frequency F0 • Pitch prediction gain g • Pitch prediction coefficients b-1, b0, b1 • HRFn and NHRn <n=1:10> • 10 Mel scale frequency bank • Feature generation process
Experiments Conditions • Speech quality: telephone speech • Subject: 49 male speakers • Training condition: • 3 training session, about 90s speech totally, over 3~6 weeks • 128 GMM • Testing condition: • 12 testing sessions. Over 4~6 months.
Speaker recognition experiments • Identification results with long-term prediction related features • Comparison of glottal source feature with classical features
Summary • Glottal source excitation is important for perceptional naturalness of voice quality and is helpful for distinguishing a speaker from the others. • Linear prediction is a powerful tool for speech analysis. The spectral property of the supraglottal vocal tract system can be estimated by short-term prediction; While the long-term prediction estimates the spectrum of the glottal excitation system • Recognition results show that the glottal source related acoustic features (F0, prediction gain, HRF, NHR, etc.) provide a certain degree of speaker discriminative power.
Other Applications • Speech coding • Speech recognition ? • Speaking emotion recognition !