1 / 17

Automatic Prosody Labeling Final Presentation

Automatic Prosody Labeling Final Presentation. Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05. Overview. Project Goal ToBI standard for prosodic labeling Previous Work Method Results Conclusion. Project Goal:.

ashlyn
Download Presentation

Automatic Prosody Labeling Final Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Prosody LabelingFinal Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05

  2. Overview • Project Goal • ToBI standard for prosodic labeling • Previous Work • Method • Results • Conclusion

  3. Project Goal: • Automatic assignment of tones tier elements • Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. • Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones

  4. ToBI Annotation • Tones and Break Index (ToBI) labeling scheme consists of a speech waveform and 4 tiers: • Tones • Annotation of pitch accents and phrasal tones • Orthographic • Transcription of text • Break Index • Pauses between words, rated on a scale from 0-4. • Miscellaneous • Notes about the annotation (e.g., ambiguities, non-speech noise)

  5. ToBI Transcription Example

  6. ToBI Examples • Pitch Accents (made3.wav): • H*, L*, L+H* • Boundary Tones (money.wav): • L-H%, H-H%, L-L%, H-L%, (H-, L-)

  7. Previous Work • Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996 • BU Radio News Corpus (~48 minutes) • Public news broadcasts spoken by 7 speakers • Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification • Employs no acoustic features. • Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi-Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005 • BU Radio News Corpus • Detects stressed syllables (collapsed ToBI labels) and all boundaries. • Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model • Wightman: “Automatic Labeling of Prosodic Patterns” 1994 • Single speaker subset of BNC and ambiguous sentence corpus (read speech). • Like Ross, uses decision tree output as input to HMM • Uses many acoustic features

  8. Method • JRip • Classification rule learner • Better at working with nominal attributes • Easier to read output • Corpus • Boston Direction Corpus • 4 speakers • ~65 minutes of semi-spontaneous speech • Original Plan: • HMMs and SVMs • SVMs took a prohibitive amount of time to learn and performed worse. • HMM implementation problems, and not enough time to implement my own

  9. Method - Features • Min, max, mean, std.dev. F0 and Intensity • # Syllables, Duration, approx. vowel length, POS • F0 slope (weighted) • zscore of max F0 and intensity • Phrase-length F0, intensity and vowel length features • Phrase position

  10. Results - Tasks • Pitch Accent • Identification • Detection • Phrase Tone identification • Boundary Tone identification • Phrase/Boundary Tone • Identification • Detection

  11. Results - Pitch Accent Identification • Accuracy • Relevant Features • # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity *Ross identifies a different subset of ToBI pitch accents

  12. Results - Pitch Accent Detection Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task

  13. Results - Phrase Tone • Accuracy • Relevant Features • Duration of next word, max, min, mean F0. • Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity

  14. Results - Boundary Tone Identification • Accuracy • Relevant Features • Quadratically weighted F0 slope

  15. Results - Phrase/Boundary Tone Identification • Accuracy • Relevant Features • Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, • zscore of difference in max intensity in the current word and the phrase.

  16. Results – Phrase/Boundary Tone Detection • Accuracy • Human agreement (in general): 95% • Best agreement: 93.0% over 77% baseline • Relevant Features • Vowel length (current and next word) • POS of the next word

  17. Conclusion • Relatively low-tech acoustic features and ml algorithms can perform competitively with more complicated NLP approaches • Break index information was not as helpful as initially suspected. • Potential Improvements: • Sequential Modeling (HMM) • Different features • More sophisticated pitch contour feature • Content-based features (similar to Ross)

More Related