On the Correlation between Energy and Pitch Accent in Read English Speech

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06

Talk Outline • Introduction to Pitch Accent • Previous Work • Contribution and Approach • Corpus • Results and Discussion • Conclusion • Future Work

Introduction • Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. • As opposed to lexical stress which refers to the most prominent syllable within a word. • Accurate detection of pitch accent is particularly important to many NLU tasks. • Identification of “important” words. • Indication of Discourse Status and Structure. • Disambiguation Syntax/Semantics. • Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent

Previous Work • Sluijter and van Heuven 96, 97 showed that accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. • Heldner 99,01 and Fant, et al. 00 found that energy in a particular spectral region indicated accent in Swedish. • A lot of researh attention has been given to the automatic identification of prominent or accented words. • Tamburini 03,05 used the energy components of the 500Hz-2000Hz band. • Tepperman 05 used the RMS energy from the 60Hz-400Hz band • Far too many others to mention here.

Contribution and Approach • There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. • We set up a battery of analysis-by-classification experiments varying: • The frequency band: • lower bound frequency ranged from 0 to 19 bark • bandwidth ranged from 1 to 20 bark • upper bound was 20 bark by the 8KHz Nyquist rate • Also, analyzed the first and/or second formants. • The region of analysis: • Full word, only syllable nuclei, longest syllable, longest syllable nuclei • Speaker: • Each of 4 speakers separately, and all together. • We performed the classification using J48 -- a java implementation of C4.5.

Contribution and Approach • Local Features: • minimum, maximum, mean, standard deviation and RMS of energy • z score of max energy within the word • mean slope • energy contour classification {rising, falling, peak, valley} • Context-based Features: • Use 6 contexts: (# previous words, #following words) • (2,2) (1,1) (1,0) (2,0) (0,1) (2,1) • (maxword - meanregion) / std.devregion • (meanword - meanregion) / std.devregion • (maxword - maxregion) / std.devregion • maxword / (maxregion-minregion) • meanword / (maxregion-minregion)

Corpus • Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] • Speech elicited from a direction-giving task. • Used only the read portion. • 50 minutes • Fully ToBI labeled • 10825 words • Manually segmented • 4 Speakers: 3 male, 1 female

Results and Discussion • Energy from different frequency regions predict pitch accent differently • mean relative improvement of best region over worst: 14.8%

Results and Discussion • Our experiments did not confirm previously reported results. • The single most predictive subband for all speakers was 3-18bark over full words • Classification Accuracy: 76% (42.4% baseline) • p=71.6,r=73.4 • However, performs significantly worse than the best for analyzing a single speaker • not the female speaker

Results and Discussion • The subband from 2-20bark is performs significantly worse than the most predicitive in only a single experiment (h1nucl) • Accuracy: 75.5% (p=70.5, r=72.5) • Due to its robustness we consider this band the “best” • The formant-based energyfeatures tend to perform worse • 6.4% mean accuracy reduction from 2-20bark • Attributable to: • Errors in the formant tracking algorithm • The presence of discriminative information in higher formants

Results and Discussion • Most predictive features were normalized maximum energy relative to the mean and standard deviation of three contextual regions • 1 previous and 1 following word • 2 previous and 1 following word • 2 previous and 2 following words

Results and Discussion • There is a relatively small intersection of correct predictions even among similar subbands. • 10823 of 10825 words were correctly classified by at least one classifier. • Using a majority voting scheme: • Accuracy: 81.9% (p=76.7, r=82.5)

Results and Discussion • How do the regioning strategies perform? Full Word > All Nuclei > Longest Syllable ~ Longest Nuclei • Why does analysis of the full word outperform other regioning strategies? • Duration is a crude measure of lexical stress • Syllable/nuclei segmentation algorithms are imperfect • Pitch accents are not neatly placed • More data has the ability to highlight distinctions more easily

Conclusion • Using an analysis-by-classification approach we showed: • Energy from different frequency bands correlate with pitch accent differently. • The “best” (highest accuracy, most robust) frequency region to be 2-20bark (>2bark?) • A voting classifier based exclusively on energy can predict accent reliably.

Future Work • Can we predict which bands will predict accent best for a given word? • We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. • We plan on repeating these experiments on spontaneous speech data.

On the Correlation between Energy and Pitch Accent in Read English Speech

On the Correlation between Energy and Pitch Accent in Read English Speech

Presentation Transcript

The Correlation Between the PSAT and SAT

Context in Multilingual Tone and Pitch Accent Recognition

The Elevator Speech/Pitch

Pitch-Accent Analysis

The correlation between wildfires and terrain

Discourse-driven awareness of contrast and the effect of pitch accent on referential resolution

Tonal Speech without Pitch

Nuclear Accent Shape and the Perception of Pitch and Prominence

Context and Learning in Multilingual Tone and Pitch Accent Recognition

Context in Multilingual Tone and Pitch Accent Recognition

Speech Segregation Based on Oscillatory Correlation

Pitch Accent on Discourse Marker and Discourse Construction

Correlation reflectometry for pitch angle measurements on NSTX

Correlation between vocalization and breaching

Pitch range use in speech of Welsh/English bilinguals: Production Study

Nuclear Accent Shape and the Perception of Syllable Pitch

5aSC5. The Correlation between Perceiving and Producing English Obstruents across Korean Learners

English With Russian Accent

American English Accent Trainning

AirRite Mech - Aircon Sustainability – Correlation Between Saving Energy and the Environment

Correlation between Emotional States and Syntactic Speech Characteristics

Abbas Kiarostami on the correlation between life and films