150 likes | 242 Views
On the Correlation between Energy and Pitch Accent in Read English Speech. Andrew Rosenberg Weekly Speech Lab Talk 6/27/06. Talk Outline. Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion Future Work. Introduction.
E N D
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06
Talk Outline • Introduction to Pitch Accent • Previous Work • Contribution and Approach • Corpus • Results and Discussion • Conclusion • Future Work
Introduction • Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. • As opposed to lexical stress which refers to the most prominent syllable within a word. • Accurate detection of pitch accent is particularly important to many NLU tasks. • Identification of “important” words. • Indication of Discourse Status and Structure. • Disambiguation Syntax/Semantics. • Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent
Previous Work • Sluijter and van Heuven 96, 97 showed that accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. • Heldner 99,01 and Fant, et al. 00 found that energy in a particular spectral region indicated accent in Swedish. • A lot of researh attention has been given to the automatic identification of prominent or accented words. • Tamburini 03,05 used the energy components of the 500Hz-2000Hz band. • Tepperman 05 used the RMS energy from the 60Hz-400Hz band • Far too many others to mention here.
Contribution and Approach • There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. • We set up a battery of analysis-by-classification experiments varying: • The frequency band: • lower bound frequency ranged from 0 to 19 bark • bandwidth ranged from 1 to 20 bark • upper bound was 20 bark by the 8KHz Nyquist rate • Also, analyzed the first and/or second formants. • The region of analysis: • Full word, only syllable nuclei, longest syllable, longest syllable nuclei • Speaker: • Each of 4 speakers separately, and all together. • We performed the classification using J48 -- a java implementation of C4.5.
Contribution and Approach • Local Features: • minimum, maximum, mean, standard deviation and RMS of energy • z score of max energy within the word • mean slope • energy contour classification {rising, falling, peak, valley} • Context-based Features: • Use 6 contexts: (# previous words, #following words) • (2,2) (1,1) (1,0) (2,0) (0,1) (2,1) • (maxword - meanregion) / std.devregion • (meanword - meanregion) / std.devregion • (maxword - maxregion) / std.devregion • maxword / (maxregion-minregion) • meanword / (maxregion-minregion)
Corpus • Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] • Speech elicited from a direction-giving task. • Used only the read portion. • 50 minutes • Fully ToBI labeled • 10825 words • Manually segmented • 4 Speakers: 3 male, 1 female
Results and Discussion • Energy from different frequency regions predict pitch accent differently • mean relative improvement of best region over worst: 14.8%
Results and Discussion • Our experiments did not confirm previously reported results. • The single most predictive subband for all speakers was 3-18bark over full words • Classification Accuracy: 76% (42.4% baseline) • p=71.6,r=73.4 • However, performs significantly worse than the best for analyzing a single speaker • not the female speaker
Results and Discussion • The subband from 2-20bark is performs significantly worse than the most predicitive in only a single experiment (h1nucl) • Accuracy: 75.5% (p=70.5, r=72.5) • Due to its robustness we consider this band the “best” • The formant-based energyfeatures tend to perform worse • 6.4% mean accuracy reduction from 2-20bark • Attributable to: • Errors in the formant tracking algorithm • The presence of discriminative information in higher formants
Results and Discussion • Most predictive features were normalized maximum energy relative to the mean and standard deviation of three contextual regions • 1 previous and 1 following word • 2 previous and 1 following word • 2 previous and 2 following words
Results and Discussion • There is a relatively small intersection of correct predictions even among similar subbands. • 10823 of 10825 words were correctly classified by at least one classifier. • Using a majority voting scheme: • Accuracy: 81.9% (p=76.7, r=82.5)
Results and Discussion • How do the regioning strategies perform? Full Word > All Nuclei > Longest Syllable ~ Longest Nuclei • Why does analysis of the full word outperform other regioning strategies? • Duration is a crude measure of lexical stress • Syllable/nuclei segmentation algorithms are imperfect • Pitch accents are not neatly placed • More data has the ability to highlight distinctions more easily
Conclusion • Using an analysis-by-classification approach we showed: • Energy from different frequency bands correlate with pitch accent differently. • The “best” (highest accuracy, most robust) frequency region to be 2-20bark (>2bark?) • A voting classifier based exclusively on energy can predict accent reliably.
Future Work • Can we predict which bands will predict accent best for a given word? • We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. • We plan on repeating these experiments on spontaneous speech data.