420 likes | 574 Views
Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent. Gina-Anne Levow University of Chicago June 6, 2006. Roadmap. Challenges for Tone and Pitch Accent Variation and Learning Data collections & processing Learning with less Semi-supervised learning Unsupervised clustering
E N D
Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent Gina-Anne Levow University of Chicago June 6, 2006
Roadmap • Challenges for Tone and Pitch Accent • Variation and Learning • Data collections & processing • Learning with less • Semi-supervised learning • Unsupervised clustering • Approaches, structure, and context • Conclusion
Challenges: Tone and Variation • Tone and Pitch Accent Recognition • Key component of language understanding • Lexical tone carries word meaning • Pitch accent carries semantic, pragmatic, discourse meaning • Non-canonical form (Shen 90, Shih 00, Xu 01) • Tonal coarticulation modifies surface realization • In extreme cases, fall becomes rise • Tone is relative • To speaker range • High for male may be low for female • To phrase range, other tones • E.g. downstep
Challenges: Training Demands • Tone and pitch accent recognition • Exploit data intensive machine learning • SVMs (Thubthong 01,Levow 05, SLX05) • Boosted and Bagged Decision trees (X. Sun, 02) • HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson et al, 04,…) • Can achieve good results with large sample sets • ~10K lab syllabic samples -> > 90% accuracy • Training data expensive to acquire • Time – pitch accent 10s of time real-time • Money – requires skilled labelers • Limits investigation across domains, styles, etc • Human language acquisition doesn’t use labels
Strategy: Training • Challenge: • Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? • Exploit semi-supervised and unsupervised learning • Semi-supervised Laplacian SVM • K-means and asymmetric k-lines clustering • Substantially outperform baselines • Can approach supervised levels
Data Collections I: English • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually ToBI annotated, aligned, syllabified • Pitch accent aligned to syllables • 4-way: Unaccented, High, Downstepped High, Low • (Sun 02, Ross & Ostendorf 95) • Binary: Unaccented vs Accented
Data Collections II: Mandarin • Mandarin: • Lexical tones: • High, Mid-rising, Low, High falling, Neutral
Data Collections III: Mandarin • Mandarin Chinese: • Lab speech data: (Xu, 1999) • 5 syllable utterances: vary tone, focus position • In-focus, pre-focus, post-focus • TDT2 Voice of America Mandarin Broadcast News • Automatically force aligned to anchor scripts • Automatically segmented, pinyin pronunciation lexicon • Manually constructed pinyin-ARPABET mapping • CU Sonic – language porting • 4-way: High, Mid-rising, Low, High falling
Local Feature Extraction • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Scalar features: • Pitch, Intensity max, mean (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour
Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of adjacent syllable • Difference features wrt adjacent syllable • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope
Experimental Configuration • English Pitch Accent: • Proportionally sampled: 1000 examples • 4-way and binary classification • Contextualization representation, preceding syllables • Mandarin Tone: • Balanced tone sets: 400 examples • Vary data set difficulty: clean lab -> broadcast • 4 tone classification • Simple local pitch only features • Prior lab speech experiments effective with local features
Semi-supervised Learning • Approach: • Employ small amount of labeled data • Exploit information from additional – presumably more available –unlabeled data • Few prior examples: EM, co-& self-training: Ostendorf ‘05 • Classifier: • Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) • Semi-supervised variant of SVM • Exploits unlabeled examples • RBF kernel, typically 6 nearest neighbors
Experiments • Pitch accent recognition: • Binary classification: Unaccented/Accented • 1000 instances, proportionally sampled • Labeled training: 200 unacc, 100 acc • >80% accuracy (cf. 84% w/15x labeled SVM) • Mandarin tone recognition: • 4-way classification: n(n-1)/2 binary classifiers • 400 instances: balanced; 160 labeled • Clean lab speech- in-focus-94% • cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples • Broadcast news: 70% • Cf. <50% w/supervised SVM 160 training samples; 74% 4x training
Unsupervised Learning • Question: • Can we identify the tone structure of a language from the acoustic space without training? • Analogous to language acquisition • Significant recent research in unsupervised clustering • Established approaches: k-means • Spectral clustering: Eigenvector decomposition of affinity matrix • (Shih & Malik 2000, Fischer & Poland 2004, BNS 2004) • Little research for tone • Self-organizing maps (Gauthier et al,2005) • Tones identified in lab speech using f0 velocities
Unsupervised Pitch Accent • Pitch accent clustering: • 4 way distinction: 1000 samples, proportional • 2-16 clusters constructed • Assign most frequent class label to each cluster • Learner: • Asymmetric k-lines clustering (Fischer & Poland ’05): • Context-dependent kernel radii, non-spherical clusters • > 78% accuracy • Context effects: • Vector w/context vs vector with no context comparable
Contrasting Clustering • Approaches • 3 Spectral approaches: • Asymmetric k-lines (Fischer & Poland 2004) • Symmetric k-lines (Fischer & Poland 2004) • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) • Binary weights, k-lines clustering • K-means: Standard Euclidean distance • # of clusters: 2-16 • Best results: > 78% • 2 clusters: asymmetric k-lines; > 2 clusters: kmeans • Larger # of clusters more similar
Tone Clustering • Mandarin four tones: • 400 samples: balanced • 2-phase clustering: 2-3 clusters each • Asymmetric k-lines • Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised) • Broadcast news: 57% (cf. 74% supervised) • Contrast: • K-means: In-focus syllables: 74.75% • Requires more clusters to reach asymm. k-lines level
Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height, or slope
Conclusions • Exploiting unlabeled examples for tone and pitch accent • Semi- and Un-supervised approaches • Best cases approach supervised levels with less training • Leveraging both labeled & unlabeled examples best • Both spectral approaches and k-means effective • Contextual information less well-exploited than in supervised case • Exploit acoustic structure of tone and accent space
Future Work • Additional languages, tone inventories • Cantonese - 6 tones, • Bantu family languages – truly rare data • Language acquisition • Use of child directed speech as input • Determination of number of clusters
Thanks • V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin • Dinoj Surendran, Siwei Wang, Yi Xu • This work supported by NSF Grant #0414919 • http://people.cs.uchicago.edu/~levow/tai
Spectral Clustering in a Nutshell • Basic spectral clustering • Build affinity matrix • Determine dominant eigenvectors and eigenvalues of the affinity matrix • Compute clustering based on them • Approaches differ in: • Affinity matrix construction • Binary weights, conductivity, heat weights • Clustering: cut, k-means, k-lines
K-Lines Clustering Algorithm • Due to Fischer & Poland 2005 • 1. Initialize vectors m1...mK (e.g. randomly, or as the ¯first K eigenvectors of the spectraldata yi) • 2. for j=1 . . .K: • Define Pj as the set of indices of all points yi that are closest to the line defined by mj , and create the matrix Mj = [yi], i in Pi whose columns are the corresponding vectors yi • 3. Compute the new value of every mj as the ¯first eigenvector of MjMTj • 4. Repeat from 2 until mj 's do not change
Asymmetric Clustering • Replace Gaussian kernel of fixed width • (Fischer & Poland TR-ISDIA-12-04, p. 12), • Where tau = 2d+ 1 or 10, largely insensitive to tau
Laplacian SVM • Manifold regularization framework • Hypothesize intrinsic (true) data lies on a low dimensional manifold, • Ambient (observed) data lies in a possibly high dimensional space • Preserves locality: • Points close in ambient space should be close in intrinsic • Use labeled and unlabeled data to warp function space • Run SVM on warped space
Input : l labeled and u unlabeled examples • Output : • Algorithm : • Contruct adjacency Graph. Compute Laplacian. • Choose Kernel K(x,y). Compute Gram matrix K. • Compute • And
Current and Future Work • Interactions of tone and intonation • Recognition of topic and turn boundaries • Effects of topic and turn cues on tone real’n • Child-directed speech & tone learning • Support for Computer-assisted tone learning • Structured sequence models for tone • Sub-syllable segmentation & modeling • Feature assessment • Band energy and intensity in tone recognition
Related Work • Tonal coarticulation: • Xu & Sun,02; Xu 97;Shih & Kochanski 00 • English pitch accent • X. Sun, 02; Hasegawa-Johnson et al, 04; Ross & Ostendorf 95 • Lexical tone recognition • SVM recognition of Thai tone: Thubthong 01 • Context-dependent tone models • Wang & Seneff 00, Zhou et al 04
Pitch Target Approximation Model • Pitch target: • Linear model: • Exponentially approximated: • In practice, assume target well-approximated by mid-point (Sun, 02)
Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • SVMlight (Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal
Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonological analysis (Xu) that carryover coarticulation is greater than anticipatory
Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve
Context: Summary • Employ common acoustic representation • Tone (Mandarin), pitch accent (English) • SVM classifiers - linear kernel: 76%, 81% • Local context effects: • Up to > 20% relative reduction in error • Preceding context greatest contribution • Carryover vs anticipatory • Phrasal context effects: • Compensation for phrasal contour improves recognition
Aside: More Tones • Cantonese: • CUSENT corpus of read broadcast news text • Same feature extraction & representation • 6 tones: • High level, high rise, mid level, low fall, low rise, low level • SVM classification: • Linear kernel: 64%, Gaussian kernel: 68% • 3,6: 50% - mutually indistinguishable (50% pairwise) • Human levels: no context: 50%; context: 68% • Augment with syllable phone sequence • 86% accuracy: 90% of syllable w/tone 3 or 6: one dominates
Aside: Voice Quality & Energy • By Dinoj Surendran • Assess local voice quality and energy features for tone • Not typically associated with Mandarin • Considered: • VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt; Band energy • Useful: Band energy significantly improves • Esp. neutral tone • Supports identification of unstressed syllables • Spectral balance predicts stress in Dutch
Roadmap • Challenges for Tone and Pitch Accent • Contextual effects • Training demands • Modeling Context for Tone and Pitch Accent • Data collections & processing • Integrating context • Context in Recognition • Reducing Training demands • Data collections & structure • Semi-supervised learning • Unsupervised clustering • Conclusion
Strategy: Context • Exploit contextual information • Features from adjacent syllables • Height, shape: direct, relative • Compensate for phrase contour • Analyze impact of • Context position, context encoding, context type • > 20% relative improvement over no context