1 / 15

On the Correlation between Energy and Pitch Accent in Read English Speech

On the Correlation between Energy and Pitch Accent in Read English Speech. Andrew Rosenberg Weekly Speech Lab Talk 6/27/06. Talk Outline. Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion Future Work. Introduction.

uma
Download Presentation

On the Correlation between Energy and Pitch Accent in Read English Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06

  2. Talk Outline • Introduction to Pitch Accent • Previous Work • Contribution and Approach • Corpus • Results and Discussion • Conclusion • Future Work

  3. Introduction • Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. • As opposed to lexical stress which refers to the most prominent syllable within a word. • Accurate detection of pitch accent is particularly important to many NLU tasks. • Identification of “important” words. • Indication of Discourse Status and Structure. • Disambiguation Syntax/Semantics. • Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent

  4. Previous Work • Sluijter and van Heuven 96, 97 showed that accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. • Heldner 99,01 and Fant, et al. 00 found that energy in a particular spectral region indicated accent in Swedish. • A lot of researh attention has been given to the automatic identification of prominent or accented words. • Tamburini 03,05 used the energy components of the 500Hz-2000Hz band. • Tepperman 05 used the RMS energy from the 60Hz-400Hz band • Far too many others to mention here.

  5. Contribution and Approach • There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. • We set up a battery of analysis-by-classification experiments varying: • The frequency band: • lower bound frequency ranged from 0 to 19 bark • bandwidth ranged from 1 to 20 bark • upper bound was 20 bark by the 8KHz Nyquist rate • Also, analyzed the first and/or second formants. • The region of analysis: • Full word, only syllable nuclei, longest syllable, longest syllable nuclei • Speaker: • Each of 4 speakers separately, and all together. • We performed the classification using J48 -- a java implementation of C4.5.

  6. Contribution and Approach • Local Features: • minimum, maximum, mean, standard deviation and RMS of energy • z score of max energy within the word • mean slope • energy contour classification {rising, falling, peak, valley} • Context-based Features: • Use 6 contexts: (# previous words, #following words) • (2,2) (1,1) (1,0) (2,0) (0,1) (2,1) • (maxword - meanregion) / std.devregion • (meanword - meanregion) / std.devregion • (maxword - maxregion) / std.devregion • maxword / (maxregion-minregion) • meanword / (maxregion-minregion)

  7. Corpus • Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] • Speech elicited from a direction-giving task. • Used only the read portion. • 50 minutes • Fully ToBI labeled • 10825 words • Manually segmented • 4 Speakers: 3 male, 1 female

  8. Results and Discussion • Energy from different frequency regions predict pitch accent differently • mean relative improvement of best region over worst: 14.8%

  9. Results and Discussion • Our experiments did not confirm previously reported results. • The single most predictive subband for all speakers was 3-18bark over full words • Classification Accuracy: 76% (42.4% baseline) • p=71.6,r=73.4 • However, performs significantly worse than the best for analyzing a single speaker • not the female speaker

  10. Results and Discussion • The subband from 2-20bark is performs significantly worse than the most predicitive in only a single experiment (h1nucl) • Accuracy: 75.5% (p=70.5, r=72.5) • Due to its robustness we consider this band the “best” • The formant-based energyfeatures tend to perform worse • 6.4% mean accuracy reduction from 2-20bark • Attributable to: • Errors in the formant tracking algorithm • The presence of discriminative information in higher formants

  11. Results and Discussion • Most predictive features were normalized maximum energy relative to the mean and standard deviation of three contextual regions • 1 previous and 1 following word • 2 previous and 1 following word • 2 previous and 2 following words

  12. Results and Discussion • There is a relatively small intersection of correct predictions even among similar subbands. • 10823 of 10825 words were correctly classified by at least one classifier. • Using a majority voting scheme: • Accuracy: 81.9% (p=76.7, r=82.5)

  13. Results and Discussion • How do the regioning strategies perform? Full Word > All Nuclei > Longest Syllable ~ Longest Nuclei • Why does analysis of the full word outperform other regioning strategies? • Duration is a crude measure of lexical stress • Syllable/nuclei segmentation algorithms are imperfect • Pitch accents are not neatly placed • More data has the ability to highlight distinctions more easily

  14. Conclusion • Using an analysis-by-classification approach we showed: • Energy from different frequency bands correlate with pitch accent differently. • The “best” (highest accuracy, most robust) frequency region to be 2-20bark (>2bark?) • A voting classifier based exclusively on energy can predict accent reliably.

  15. Future Work • Can we predict which bands will predict accent best for a given word? • We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. • We plan on repeating these experiments on spontaneous speech data.

More Related