1 / 16

On the Correlation between Energy and Pitch Accent in Read English Speech

On the Correlation between Energy and Pitch Accent in Read English Speech. Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06. Talk Outline. Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion

gala
Download Presentation

On the Correlation between Energy and Pitch Accent in Read English Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06

  2. Talk Outline • Introduction to Pitch Accent • Previous Work • Contribution and Approach • Corpus • Results and Discussion • Conclusion • Future Work

  3. Introduction • Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. • As opposed to lexical stress which refers to the most prominent syllable within a word. • Accurate detection of pitch accent is particularly important to many NLU tasks. • Identification of salient or “important” words. • Indication of Information Status. • Disambiguation of Syntax/Semantics. • Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent deaccented accented

  4. Previous Work • (Sluijter and van Heuven 96,97): Accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. • (Heldner, et al. 99,01) and (Fant, et al. 00) found that high-frequency emphasis or spectral tilt strongly correlates with accent in Swedish. • A lot of research attention has been given to the automatic identification of prominent or accented words. • (Tamburini 03,05) used the energy component of the 500Hz-2000Hz band. • (Tepperman 05) used the RMS energy from the 60Hz-400Hz band • And many more...

  5. Contribution and Approach • There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. • We set up a battery of analysis-by-classification experiments varying: • The frequency band: • lower bound frequency ranged from 0 to 19 bark • bandwidth ranged from 1 to 20 bark • upper bound was 20 bark by the 8KHz Nyquist rate • Also, analyzed the first and/or second formants. • The region of analysis: • Full word, only vowels, longest syllable, longest vowel • Speaker: • Each of 4 speakers separately, and all together. • We performed the experiments using J48 -- a java implementation of C4.5.

  6. wordi-2 wordi-1 wordi wordi+1 wordi+2 Contribution and Approach • Local Features: • minimum, maximum, mean, standard deviation and RMS of energy • z score (x – mean / std.dev) of max energy within the word • Context-based Features: • Using 6 windows: • The max and mean energy were normalized by • z score (x – mean / std.dev) and • the energy range within the window (x / (max-min))

  7. Corpus • Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] • Speech elicited from a direction-giving task. • Used only the read portion. • 50 minutes • Fully ToBI labeled • 10825 words • Manually segmented • 4 Speakers: 3 male, 1 female

  8. Variation across subbands • Energy from different frequency regions predict pitch accent differently • Across experiment configurations mean relative improvement of best region over worst: 14.8%

  9. The most predictive subband • The single most predictive subband for all speakers was 3-18bark over full words • Classification Accuracy: 76% (P=71.6,R=73.4) • 57.6% majority class baseline (no accent) • However, performs significantly worse than the best when analyzing the speech of one speaker in particular. • Speakerh2, not the female speaker

  10. The most robust subband • The subband from 2-20bark performs as well as the most discriminative subband in all but one configuration [h1-longest vowel] • Accuracy: 75.5% (P=70.5, R=72.5) • Due to its robustness we consider this band the “best” • The formant-based energyfeatures perform worse than fixed bands • 6.4% mean accuracy reduction from 2-20bark • Attributable to: • Errors in the formant tracking algorithm • The presence of discriminative information in higher formants

  11. wordi-2 wordi-1 wordi wordi+1 wordi+2 Contextual windows • Most predictive features were z-score normalized maximum energy relative to three contextual windows • 1 previous and 1 following word • 2 previous and 1 following word • 2 previous and 2 following words

  12. Combining predictions • There is a relatively small intersection of correct predictions even among similar subbands. • 10823 of 10825 words were correctly classified by at least one classifier. • Using a majority voting scheme: • Accuracy: 81.9% (p=76.7, r=82.5)

  13. Region of analysis • How do the regioning strategies perform? Full Word > Only Vowels > Longest Syllable ~ Longest Vowel • Why does analysis of the full word outperform other regioning strategies? • Syllable/Vowel segmentation algorithms are imperfect • Pitch accents are not neatly placed • Duration is a crude measure of lexical stress

  14. Conclusion • Using an analysis-by-classification approach we showed: • Energy from different frequency bands correlate with pitch accent differently. • The “best” (high accuracy, most robust) frequency region to be 2-20bark (>2bark?) • A voting classifier based exclusively on energy can predict accent reliably.

  15. Future Work • Can we automatically identify which bands will predict accent best for a given word? • We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. • We plan on repeating these experiments on spontaneous speech data.

  16. Thank you {amaxwell, julia}@cs.columbia.edu

More Related