350 likes | 459 Views
Context and Learning in Multilingual Tone and Pitch Accent Recognition. Gina-Anne Levow University of Chicago May 18, 2007. Roadmap. Challenges for Tone and Pitch Accent Contextual effects Training demands Modeling Context for Tone and Pitch Accent Data collections & processing
E N D
Context and Learning in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago May 18, 2007
Roadmap • Challenges for Tone and Pitch Accent • Contextual effects • Training demands • Modeling Context for Tone and Pitch Accent • Data collections & processing • Integrating context • Context in Recognition • Asides: More tones and features • Reducing Training Demands • Data collections & structure • Semi-supervised learning • Unsupervised clustering • Conclusion
Challenges: Context • Tone and Pitch Accent Recognition • Key component of language understanding • Lexical tone carries word meaning • Pitch accent carries semantic, pragmatic, discourse meaning • Non-canonical form (Shen 90, Shih 00, Xu 01) • Tonal coarticulation modifies surface realization • In extreme cases, fall becomes rise • Tone is relative • To speaker range • High for male may be low for female • To phrase range, other tones • E.g. downstep
Challenges: Training Demands • Tone and pitch accent recognition • Exploit data intensive machine learning • SVMs (Thubthong 01,Levow 05, SLX05) • Boosted and Bagged Decision trees (X. Sun, 02) • HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson et al, 04,… • Can achieve good results with huge sample sets • SLX05: ~10K lab syllabic samples -> > 90% accuracy • Training data expensive to acquire • Time – pitch accent 10s of times real-time • Money – requires skilled labelers • Limits investigation across domains, styles, etc • Human language acquisition doesn’t use labels
Strategy: Overall • Common model across languages • Common machine learning classifiers • Acoustic-prosodic model • No word label, POS, lexical stress info • No explicit tone label sequence model • English, Mandarin Chinese, isiZulu • (also Cantonese)
Strategy: Context • Exploit contextual information • Features from adjacent syllables • Height, shape: direct, relative • Compensate for phrase contour • Analyze impact of • Context position, context encoding, context type • > 12.5% reduction in error over no context
Data Collections: I • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually ToBI annotated, aligned, syllabified • Pitch accent aligned to syllables • Unaccented, High, Downstepped High, Low • (Sun 02, Ross & Ostendorf 95)
Data Collections: II • Mandarin: • TDT2 Voice of America Mandarin Broadcast News • Automatically force aligned to anchor scripts • Automatically segmented, pinyin pronunciation lexicon • Manually constructed pinyin-ARPABET mapping • CU Sonic – language porting • High, Mid-rising, Low, High falling, Neutral
Data Collections: III • isiZulu: (Govender et al., 2005) • Sentence text collected from Web • Selected based on grapheme bigram variation • Read by male native speaker • Manually aligned, syllabified • Tone labels assigned by 2nd native speaker • Based only on utterance text • Tone labels: High, low
Local Feature Extraction • Uniform representation for tone, pitch accent • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Base features: • Pitch, Intensity max, mean, min, range • (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour
Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of preceding, following syllables • Difference features • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Of preceding, following and current syllable • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope
Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • SVMlight (Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal
Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English/isiZulu: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory
Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve
Context: Summary • Employ common acoustic representation • Tone (Mandarin,isiZulu), pitch accent (English) • SVM classifiers - linear kernel: 76%,76%, 81% • Local context effects: • Up to > 20% relative reduction in error • Preceding context greatest contribution • Carryover vs anticipatory • Phrasal context effects: • Compensation for phrasal contour improves recognition
Aside: More Tones • Cantonese: • CUSENT corpus of read broadcast news text • Same feature extraction & representation • 6 tones: • High level, high rise, mid level, low fall, low rise, low level • SVM classification: • Linear kernel: 64%, Gaussian kernel: 68% • 3,6: 50% - mutually indistinguishable (50% pairwise) • Human levels: no context: 50%; context: 68% • Augment with syllable phone sequence • 86% accuracy: 90% of syllable w/tone 3 or 6: one dominates
Aside: Voice Quality & Energy • w/ Dinoj Surendran • Assess local voice quality and energy features for tone • Not typically associated with tones: Mandarin/isiZulu • Considered: • VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt; Band energy • Useful: Band energy significantly improves • Mandarin: neutral tone • Supports identification of unstressed syllables • Spectral balance predicts stress in Dutch • isiZulu: Using band energy outperforms pitch • In conjunction with pitch -> ~78%
Roadmap • Challenges for Tone and Pitch Accent • Contextual effects • Training demands • Modeling Context for Tone and Pitch Accent • Data collections & processing • Integrating context • Context in Recognition • Reducing Training Demands • Data collections & structure • Semi-supervised learning • Unsupervised clustering • Conclusion
Strategy: Training • Challenge: • Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? • Exploit semi-supervised and unsupervised learning • Semi-supervised Laplacian SVM • K-means and asymmetric k-lines clustering • Substantially outperform baselines • Can approach supervised levels
Data Collections & Processing • English: (as before) • Boston University Radio News Corpus, f2b • Binary: Unaccented vs accented • 4-way: Unaccented, High, Downstepped High, Low • Mandarin: • Lab speech data: (Xu, 1999) • 5 syllable utterances: vary tone, focus position • In-focus, pre-focus, post-focus • TDT2 Voice of America Mandarin Broadcast News • 4-way: High, Mid-rising, Low, High falling • isiZulu: (as before) • Read web sentences • 2-way: High vs low
Semi-supervised Learning • Approach: • Employ small amount of labeled data • Exploit information from additional – presumably more available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05) • Classifier: • Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) • Semi-supervised variant of SVM • Exploits unlabeled examples • RBF kernel, typically 6 nearest neighbors, transductive
Experiments • Pitch accent recognition: • Binary classification: Unaccented/Accented • 1000 instances, proportionally sampled • Labeled training: 200 unacc, 100 acc • 80% accuracy (cf. 84% w/15x labeled SVM) • Mandarin tone recognition: • 4-way classification: n(n-1)/2 binary classifiers • 400 instances: balanced; 160 labeled • Clean lab speech- in-focus-94% • cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples • Broadcast news: 70% • Cf. < 50% w/SVM 160 training samples
Unsupervised Learning • Question: • Can we identify the tone structure of a language from the acoustic space without training? • Analogous to language acquisition • Significant recent research in unsupervised clustering • Established approaches: k-means • Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004): asymmetric k-lines • Little research for tone • Self-organizing maps (Gauthier et al,2005) • Tones identified in lab speech using f0 velocities • Cluster-based bootstrapping (Narayanan et al, 2006) • Prominence clustering (Tambourini ’05)
Clustering • Pitch accent clustering: • 4 way distinction: 1000 samples, proportional • 2-16 clusters constructed • Assign most frequent class label to each cluster • Classifier: • Asymmetric k-lines: • context-dependent kernel radii, non-spherical • > 78% accuracy: • 2 clusters: asymmetric k-lines best • Context effects: • Vector w/preceding context vs vector with no context comparable
Contrasting Clustering • Contrasts: • Clustering: • 3 Spectral approaches: • Perform spectral decomposition of affinity matrix • Asymmetric k-lines (Fischer & Poland 2004) • Symmetric k-lines (Fischer & Poland 2004) • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) • Binary weights, k-lines clustering • K-means: Standard Euclidean distance • # of clusters: 2-16 • Best results: > 78% • 2 clusters: asymmetric k-lines; > 2 clusters: kmeans • Larger # clusters: all similar
Tone Clustering: I • Mandarin four tones: • 400 samples: balanced • 2-phase clustering: 2-5 clusters each • Asymmetric k-lines, k-means clustering • Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised) • Broadcast news: 57% (cf. 74% supervised) • K-means requires more clusters to reach k-lines level
Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height
Tone Clustering: II • isiZulu High/Low tones • 3225 samples: no labels • Proportional: ~62% low, 38% high • K-means clustering: 2 clusters • Read speech, web-based sentences • 70% accuracy (vs 76% fully-supervised)
Conclusions • Common prosodic framework for tone and pitch accent recognition • Contextual modeling enhances recognition • Local context and broad phrase contour • Carryover coarticulation has larger effect for Mandarin • Exploiting unlabeled examples for recognition • Semi- and Un-supervised approaches • Best cases approach supervised levels with less training • Exploits acoustic structure of tone and accent space
Current and Future Work • Interactions of tone and intonation • Recognition of topic and turn boundaries • Effects of topic and turn cues on tone real’n • Child-directed speech & tone learning • Support for Computer-assisted tone learning • Structured sequence models for tone • Sub-syllable segmentation & modeling • Feature assessment • Band energy and intensity in tone recognition
Thanks • Dinoj Surendran, Siwei Wang, Yi Xu • Natasha Govender and Etienne Barnard • V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin • This work supported by NSF Grant #0414919 • http://people.cs.uchicago.edu/~levow/tai