120 likes | 198 Views
THE FUNDAMENTAL FREQUENCY VARIATION SPECTRUM. FONETIK 2008 Kornel Laskowski , Mattias Heldner and Jens Edlund interACT , Carnegie Mellon University, Pittsburgh PA, USA Centre for Speech Technology, KTH Stockholm, Sweden. Speaker: Hsiao- Tsung. Introduction.
E N D
THE FUNDAMENTAL FREQUENCY VARIATION SPECTRUM FONETIK 2008 KornelLaskowski, MattiasHeldner and Jens Edlund interACT, Carnegie Mellon University, Pittsburgh PA, USA Centre for Speech Technology, KTH Stockholm, Sweden Speaker:Hsiao-Tsung
Introduction • While speech recognition systems have long ago transitioned from formant localization to spectral (vector-valued) formant representations. • Prosodic processing continues to rely squarely on a pitch tracker’s ability to identify a peak, corresponding to the fundamental frequency(f0) of the speaker. • Even if a robust, local, analytic, statistical estimate of absolute pitch were available, applications require a representation of pitch variation and go to considerable additional effort to identify a speaker-dependent quantity for normalization
The Fundamental Frequency Variation Spectrum • Instantaneous variation in pitch is normally computed by determining a single scalar, the F0, at two temporally adjacent instants and forming their difference.
The Fundamental Frequency Variation Spectrum • we propose a vector-valued representation of pitch variation, inspired by vanishing-point perspective(透視) • While the standard inner productbetween two vectors can be viewed as thesummation of pair-wise products with pairs selectedby orthonormal projection onto a point atinfinity F: signal’s spectral content (512-point FFT)
The Fundamental Frequency Variation Spectrum • the proposed vanishing-point productinduces a 1-point perspective projection onto apoint at
The Fundamental Frequency Variation Spectrum • The FFV spectrum is then given by • is undefined over the interval [-T0, +T0]
The Fundamental Frequency Variation Spectrum • A support for which is continuous over • In practice, we compute using magnitude rather than complex spectra
The Fundamental Frequency Variation Spectrum • and are 512-point Fourier transforms, computed every 8 ms. • However, the discrete transforms FL and FR are in general not defind at the corresponding dilate frequencies . • We resort to linear interpolation using the coefficients
The Fundamental Frequency Variation Spectrum Energy independent
Filterbank slowly changing Rapidly changing
Discussion • Initial experiments along these lines show that such HMMs, when trained on dialogue data, corroborate research on human turn-taking behavior in conversations. • does not require peak identification, dynamic time warping, median filtering, landmark detection, linearization, or mean pitch estimation and subtraction • Immediate next steps include fine-tuning the filter banks and the HMM topologies, and testing the results on other tasks where pitch movements are expected to play a role, such as the attitudinal coloring of short feedback utterances, speaker verification, and automatic speech recognition for tonal languages.