250 likes | 351 Views
PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan Niu. Center for Spoken Language Understanding OGI School of Science & Technology at OHSU. OVERVIEW. IMPORTANCE OF SPECTRAL BALANCE MEASUREMENT OF SPECTRAL BALANCE ANALYSIS METHODS
E N D
PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELSJan P.H. van Santen and Xiaochuan Niu Center for Spoken Language Understanding OGI School of Science & Technology at OHSU CENTER FORSPOKEN LANGUAGE UNDERSTANDING
OVERVIEW • IMPORTANCE OF SPECTRAL BALANCE • MEASUREMENT OF SPECTRAL BALANCE • ANALYSIS METHODS • RESULTS • SYNTHESIS • CONCLUSIONS CENTER FORSPOKEN LANGUAGE UNDERSTANDING
1. IMPORTANCE OF SPECTRAL BALANCE • Linguistic Control Factors • Stress-like factors • Positional factors • Phonemic factors • Acoustic Correlates • Traditionally TTS-controlled: • Pitch, timing, amplitude • Demonstrated in natural speech, but usually not TTS-controlled: • Spectral tilt, balance • Formant dynamics • … CENTER FORSPOKEN LANGUAGE UNDERSTANDING
2. MEASUREMENT OF SPECTRAL BALANCE • Data: • 472 greedily selected sentences • Genre: newspaper • Greedy features: linguistic control factors • One female speaker • Manual segmentation • Accent: independent rating by 3 judges • 0-3 score CENTER FORSPOKEN LANGUAGE UNDERSTANDING
2. MEASUREMENT OF SPECTRAL BALANCE • Energy in 5 formant-range frequency bands • B0: 100-300 Hz [~F0] • B1: 300-800 Hz [~F1] • B2: 800-2500 Hz [~F2] • B3: 2500-3500 Hz [~F3] • B4: 3500- max Hz [~fricative noise] • In other words, multidimensional measure • Filter bank Square Average [1 ms rect.] 20 log10(Bi ) • Subtract estimated per-utterance means CENTER FORSPOKEN LANGUAGE UNDERSTANDING
2. MEASUREMENT OF SPECTRAL BALANCE • Details: • Confounding with F0 • Measure pitch-corrected and raw • For certain wave shapes, pitch directly related to fixed-frame energy • Why do both: wave shapes may change in unknown ways • F0 not confined to B0 [female speech] • Vowel formants not quite confined to bands [e.g., F1 for /EE/ and F3 for /ER/] CENTER FORSPOKEN LANGUAGE UNDERSTANDING
2. MEASUREMENT OF SPECTRAL BALANCE • Why not more or different bands? • Multiple interacting Linguistic Control Factors • Need measurements that minimize interactions • 5 bands Different vowels “behave similarly” • Can model vowels as a class • Why not simply spectral tilt? • 5 bands more information than single measure • Supply more information for synthesis CENTER FORSPOKEN LANGUAGE UNDERSTANDING
3. ANALYSIS METHODS • Measures likely to behave like segmental duration: • Multiple interacting, confounded factors: • Interaction: Magnitude of effects on one factor may depend on other factors • Confounding: Unequal frequencies of control factor combinations • “Directional Invariance” • Direction of effects on one factor independent of other factors CENTER FORSPOKEN LANGUAGE UNDERSTANDING
3. ANALYSIS METHODS • Need method that • can handle multiple interacting, confounded factors and • takes advantage of Directional Invariance: • Used: Sums of Products Model: CENTER FORSPOKEN LANGUAGE UNDERSTANDING
3. ANALYSIS METHODS • Special cases: • Multiplicative model: K = {1}, I1 = {0,…,n} • Additive model: K = {0,…,n}, Ii = {i} CENTER FORSPOKEN LANGUAGE UNDERSTANDING
3. ANALYSIS METHODS • Used additive model • Note: Parameter estimates are: • Estimates of marginal means … • … in balanced design: CENTER FORSPOKEN LANGUAGE UNDERSTANDING
3. ANALYSIS METHODS • Pitch correction: • Confounding with F0: Show both <B0, B1, B2, B3, B4> and: <B0 + B1, B2, B3, B4> CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (A) POSITIONAL EFFECTS 5 Bands, not pitch-corrected Solid: right position, dashed: left position. Y-axis: corrected mean CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (A) POSITIONAL EFFECTS 5 Bands, pitch-corrected CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (A) POSITIONAL EFFECTS 4 Bands, not pitch-corrected CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (A) POSITIONAL EFFECTS 4 Bands, pitch-corrected CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (B) STRESS/ACCENT EFFECTS 5 Bands, not pitch-corrected Solid: stressed syllable, dashed: unstressed. Y-axis: corrected mean CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (B) STRESS/ACCENT EFFECTS 5 Bands, pitch-corrected CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (B) STRESS/ACCENT EFFECTS 4 Bands, not pitch-corrected CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (B) STRESS/ACCENT EFFECTS 4 Bands, pitch-corrected CENTER FORSPOKEN LANGUAGE UNDERSTANDING
4. RESULTS: (C) TILT EFFECTS CENTER FORSPOKEN LANGUAGE UNDERSTANDING
5. SYNTHESIS • Use ABS/OLA sinusoidal model: s[n] = sum of overlapped short-time signal frames sk[n] sk[n] = sum of quasi-harmonic sinusoidal components: sk[n] SlAk,lcos(wk,l n + fk,l) • Each frame of unit is represented by a set of quasi-harmonic sinusoidal parameters; • Given the desired F0 contour, pitch shift is applied to the sinusoidal parameter component of the unit to obtain the target parameter Ak,l; CENTER FORSPOKEN LANGUAGE UNDERSTANDING
5. SYNTHESIS • Considering the differences of prosody factors between original and target unit, band differences: • Transform the band difference into weights applying to the sinusoidal parameters: • ,when the j’th harmonic is located in the i'th band; • Spectral smoothing across unit boundaries. CENTER FORSPOKEN LANGUAGE UNDERSTANDING
5. SYNTHESIS 5 Bands modification example [i:] CENTER FORSPOKEN LANGUAGE UNDERSTANDING
CONCLUSIONS • Described simple methods for predicting and synthesizing spectral balance • But: Spectral balance is only one “non-standard acoustic correlate” • Others that remain to be addressed: • Spectral dynamics • Phase CENTER FORSPOKEN LANGUAGE UNDERSTANDING