170 likes | 323 Views
Automatic Prosody Labeling Final Presentation. Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05. Overview. Project Goal ToBI standard for prosodic labeling Previous Work Method Results Conclusion. Project Goal:.
E N D
Automatic Prosody LabelingFinal Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05
Overview • Project Goal • ToBI standard for prosodic labeling • Previous Work • Method • Results • Conclusion
Project Goal: • Automatic assignment of tones tier elements • Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. • Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones
ToBI Annotation • Tones and Break Index (ToBI) labeling scheme consists of a speech waveform and 4 tiers: • Tones • Annotation of pitch accents and phrasal tones • Orthographic • Transcription of text • Break Index • Pauses between words, rated on a scale from 0-4. • Miscellaneous • Notes about the annotation (e.g., ambiguities, non-speech noise)
ToBI Examples • Pitch Accents (made3.wav): • H*, L*, L+H* • Boundary Tones (money.wav): • L-H%, H-H%, L-L%, H-L%, (H-, L-)
Previous Work • Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996 • BU Radio News Corpus (~48 minutes) • Public news broadcasts spoken by 7 speakers • Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification • Employs no acoustic features. • Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi-Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005 • BU Radio News Corpus • Detects stressed syllables (collapsed ToBI labels) and all boundaries. • Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model • Wightman: “Automatic Labeling of Prosodic Patterns” 1994 • Single speaker subset of BNC and ambiguous sentence corpus (read speech). • Like Ross, uses decision tree output as input to HMM • Uses many acoustic features
Method • JRip • Classification rule learner • Better at working with nominal attributes • Easier to read output • Corpus • Boston Direction Corpus • 4 speakers • ~65 minutes of semi-spontaneous speech • Original Plan: • HMMs and SVMs • SVMs took a prohibitive amount of time to learn and performed worse. • HMM implementation problems, and not enough time to implement my own
Method - Features • Min, max, mean, std.dev. F0 and Intensity • # Syllables, Duration, approx. vowel length, POS • F0 slope (weighted) • zscore of max F0 and intensity • Phrase-length F0, intensity and vowel length features • Phrase position
Results - Tasks • Pitch Accent • Identification • Detection • Phrase Tone identification • Boundary Tone identification • Phrase/Boundary Tone • Identification • Detection
Results - Pitch Accent Identification • Accuracy • Relevant Features • # syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity *Ross identifies a different subset of ToBI pitch accents
Results - Pitch Accent Detection Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task
Results - Phrase Tone • Accuracy • Relevant Features • Duration of next word, max, min, mean F0. • Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity
Results - Boundary Tone Identification • Accuracy • Relevant Features • Quadratically weighted F0 slope
Results - Phrase/Boundary Tone Identification • Accuracy • Relevant Features • Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, • zscore of difference in max intensity in the current word and the phrase.
Results – Phrase/Boundary Tone Detection • Accuracy • Human agreement (in general): 95% • Best agreement: 93.0% over 77% baseline • Relevant Features • Vowel length (current and next word) • POS of the next word
Conclusion • Relatively low-tech acoustic features and ml algorithms can perform competitively with more complicated NLP approaches • Break index information was not as helpful as initially suspected. • Potential Improvements: • Sequential Modeling (HMM) • Different features • More sophisticated pitch contour feature • Content-based features (similar to Ross)