Context in Multilingual Tone and Pitch Accent Recognition

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005

Roadmap • Motivating Context • Data Collections & Processing • Modeling Context for Tone and Pitch Accent • Context in Recognition • Conclusion

Challenges • Tone and Pitch Accent Recognition • Key component of language understanding • Lexical tone carries word meaning • Pitch accent carries semantic, pragmatic, discourse meaning • Non-canonical form (Shen 90, Shih 00, Xu 01) • Tonal coarticulation modifies surface realization • In extreme cases, fall becomes rise • Tone is relative • To speaker range • High for male may be low for female • To phrase range, other tones • E.g. downstep

Strategy • Common model across languages, SVM classifier • Acoustic-prosodic model: no word label, POS, lexical stress info • No explicit tone label sequence model • English, Mandarin Chinese (also Cantonese) • Exploit contextual information • Features from adjacent syllables • Height, shape: direct, relative • Compensate for phrase contour • Analyze impact of • Context position, context encoding, context type • > 20% relative improvement over no context • Preceding context greater enhancement than following

Data Collection & Processing • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually ToBI annotated, aligned, syllabified • Pitch accent aligned to syllables • Unaccented, High, Downstepped High, Low • (Sun 02, Ross & Ostendorf 95) • Mandarin: • TDT2 Voice of America Mandarin Broadcast News • Automatically force aligned to anchor scripts (CUSonic) • High, Mid-rising, Low, High falling, Neutral

Local Feature Extraction • Uniform representation for tone, pitch accent • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Scalar features: • Pitch, Intensity max, mean (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour

Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of preceding, following syllables • Difference features • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Of preceding, following and current syllable • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope

Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • (SVMlight, Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal

Results: Local Context

Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonological analysis (Xu) that coarticulation is carryover, not anticipatory

Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve

Conclusion • Employ common acoustic representation • Tone (Mandarin), pitch accent (English) • Cantonese, recent experiments • SVM classifiers - linear kernel: 76%, 81% • Local context effects: • Up to > 20% relative reduction in error • Preceding context greatest contribution • Carryover vs anticipatory • Phrasal context effects: • Compensation for phrasal contour improves recognition

Current & Future Work • Application of model to different languages • Cantonese, Dschang (Bantu family) • Cantonese: ~65% acoustic only, 85% w/segmental • Integration of additional contextual influence • Topic, turn, discourse structure • HMSVM, GHMM models • http://people.cs.uchicago.edu/~levow/projects/tai • Supported by NSF Grant #: 0414919

Confusion Matrix (English)

Confusion Matrix (Mandarin)

Related Work • Tonal coarticulation: • Xu & Sun,02; Xu 97;Shih & Kochanski 00 • English pitch accent • X. Sun, 02; Hasegawa-Johnson et al, 04; Ross & Ostendorf 95 • Lexical tone recognition • SVM recognition of Thai tone: Thubthong 01 • Context-dependent tone models • Wang & Seneff 00, Zhou et al 04

Pitch Target Approximation Model • Pitch target: • Linear model: • Exponentially approximated: • In practice, assume target well-approximated by mid-point (Sun, 02)

Context in Multilingual Tone and Pitch Accent Recognition

Context in Multilingual Tone and Pitch Accent Recognition

Presentation Transcript

Beats and Tuning Pitch recognition

Dialing Tone Recognition

Context in Multilingual Tone and Pitch Accent Recognition

Pitch-Accent Analysis

Focus = decoding Words In Context And text Organization And tone

Syllables and Accent

Sound: Pitch, Dynamics, and Tone Color

Frequency, Pitch, Tone and Length

Mandarin Tone Recognition using Affine-Invariant Prosodic Features and Tone Posteriorgram

Context and Learning in Multilingual Tone and Pitch Accent Recognition

Nuclear Accent Shape and the Perception of Pitch and Prominence

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

Frequency, Pitch, Tone and Length

Pitch Recognition and Transcription

Are there “Shapers” and “Aligners” ? Individual differences in signalling pitch accent category

Pitch Accent on Discourse Marker and Discourse Construction

Nuclear Accent Shape and the Perception of Syllable Pitch

On the Correlation between Energy and Pitch Accent in Read English Speech

Tone, Accent and Stress

Musical Pitch Recognition and Transcription