160 likes | 411 Views
On the Correlation between Energy and Pitch Accent in Read English Speech. Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06. Talk Outline. Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion
E N D
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06
Talk Outline • Introduction to Pitch Accent • Previous Work • Contribution and Approach • Corpus • Results and Discussion • Conclusion • Future Work
Introduction • Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. • As opposed to lexical stress which refers to the most prominent syllable within a word. • Accurate detection of pitch accent is particularly important to many NLU tasks. • Identification of salient or “important” words. • Indication of Information Status. • Disambiguation of Syntax/Semantics. • Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent deaccented accented
Previous Work • (Sluijter and van Heuven 96,97): Accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. • (Heldner, et al. 99,01) and (Fant, et al. 00) found that high-frequency emphasis or spectral tilt strongly correlates with accent in Swedish. • A lot of research attention has been given to the automatic identification of prominent or accented words. • (Tamburini 03,05) used the energy component of the 500Hz-2000Hz band. • (Tepperman 05) used the RMS energy from the 60Hz-400Hz band • And many more...
Contribution and Approach • There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. • We set up a battery of analysis-by-classification experiments varying: • The frequency band: • lower bound frequency ranged from 0 to 19 bark • bandwidth ranged from 1 to 20 bark • upper bound was 20 bark by the 8KHz Nyquist rate • Also, analyzed the first and/or second formants. • The region of analysis: • Full word, only vowels, longest syllable, longest vowel • Speaker: • Each of 4 speakers separately, and all together. • We performed the experiments using J48 -- a java implementation of C4.5.
wordi-2 wordi-1 wordi wordi+1 wordi+2 Contribution and Approach • Local Features: • minimum, maximum, mean, standard deviation and RMS of energy • z score (x – mean / std.dev) of max energy within the word • Context-based Features: • Using 6 windows: • The max and mean energy were normalized by • z score (x – mean / std.dev) and • the energy range within the window (x / (max-min))
Corpus • Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] • Speech elicited from a direction-giving task. • Used only the read portion. • 50 minutes • Fully ToBI labeled • 10825 words • Manually segmented • 4 Speakers: 3 male, 1 female
Variation across subbands • Energy from different frequency regions predict pitch accent differently • Across experiment configurations mean relative improvement of best region over worst: 14.8%
The most predictive subband • The single most predictive subband for all speakers was 3-18bark over full words • Classification Accuracy: 76% (P=71.6,R=73.4) • 57.6% majority class baseline (no accent) • However, performs significantly worse than the best when analyzing the speech of one speaker in particular. • Speakerh2, not the female speaker
The most robust subband • The subband from 2-20bark performs as well as the most discriminative subband in all but one configuration [h1-longest vowel] • Accuracy: 75.5% (P=70.5, R=72.5) • Due to its robustness we consider this band the “best” • The formant-based energyfeatures perform worse than fixed bands • 6.4% mean accuracy reduction from 2-20bark • Attributable to: • Errors in the formant tracking algorithm • The presence of discriminative information in higher formants
wordi-2 wordi-1 wordi wordi+1 wordi+2 Contextual windows • Most predictive features were z-score normalized maximum energy relative to three contextual windows • 1 previous and 1 following word • 2 previous and 1 following word • 2 previous and 2 following words
Combining predictions • There is a relatively small intersection of correct predictions even among similar subbands. • 10823 of 10825 words were correctly classified by at least one classifier. • Using a majority voting scheme: • Accuracy: 81.9% (p=76.7, r=82.5)
Region of analysis • How do the regioning strategies perform? Full Word > Only Vowels > Longest Syllable ~ Longest Vowel • Why does analysis of the full word outperform other regioning strategies? • Syllable/Vowel segmentation algorithms are imperfect • Pitch accents are not neatly placed • Duration is a crude measure of lexical stress
Conclusion • Using an analysis-by-classification approach we showed: • Energy from different frequency bands correlate with pitch accent differently. • The “best” (high accuracy, most robust) frequency region to be 2-20bark (>2bark?) • A voting classifier based exclusively on energy can predict accent reliably.
Future Work • Can we automatically identify which bands will predict accent best for a given word? • We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. • We plan on repeating these experiments on spontaneous speech data.
Thank you {amaxwell, julia}@cs.columbia.edu