440 likes | 526 Views
Prosody in Spoken Language Understanding. Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008. U: Give me the price for AT&T. U: Give me the price for AT&T. U: Give me the price for AT&T. U: Give me the price for American Telephone and Telegraph.
E N D
Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008
U: Give me the price for AT&T. • U: Give me the price for AT&T. • U: Give me the price for AT&T. • U: Give me the price for American Telephone and Telegraph.
Error Correction Spiral • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. • U: Give me the price for AT&T. • S: Data General was at 10 ½ up a quarter. • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½. • U: Give me the price for American Telephone and Telegraph. • S: Excuse me?
Roadmap • Corrections: A motivating example • Defining prosody • Why prosody? • Challenges in prosody • Prosody in language understanding • Recognizing tone and pitch accent • Spoken corrections, Topic segmentation • Conclusions
Defining Prosody • Prosody • Phonetic phenomena in speech than span more than a single segment-“suprasegmental” • Prosody includes: • Stress, focus, tone, intonation, length/pause, rhythm • Prosodic features include: • Pitch: perceptual correlate of fundamental frequency • f0: rate of vocal fold vibration • Loudness/intensity, duration, segment quality
Why Prosody? • Prosody plays a crucial role • At all levels of language • Lexical, syntactic, pragmatic/discourse • Establishes meaning • Disambiguates sense and structure • Across languages families • Common physiological, articulatory basis • In synthesis and recognition of fluent speech
Prosody and the Lexicon • Lexical: Determines word identity • Prosodic effect at the syllable level (minimal unit) • Lexical stress: syllable prominence • Combination of length, pitch movement, loudness • REcord (N) vs reCORD (V) • Pitch accent can differentiate words in some languages • Lexical tone: tone languages, e.g. Chinese, Punjabi • Pitch height (register) and/or shape (contour) Ma (high): mother Ma (rising): hemp Ma (low): horse Ma (falling): scold
Prosody and Syntax • Prosody can disambiguate structure • Associated with chunking and attachment • Not identical with syntactic phrase boundaries • “Prosody is predictable from syntax, except when it isn’t” • Prosodic phrasing indicated by: • Some combination of pause, change in pitch
Chunking, or “phrasing” A1: I met Mary and Elena’s mother at the mall yesterday. A2: I met Mary and Elena’s mother at the mall yesterday. Example from Jennifer Venidetti
Punctuation & Prosody Humor • A panda goes into a restaurant and has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’
Prosody in Pragmatics & Discourse • Focus: • Prominence, new information: pitch accent • “October eleventh”: • Sentence type, dialogue act: • Statement vs. declarative question :“It’s raining (?)” • Discourse Structure (Topic), Emotion from Shih, Prosody Learning and Generation
Challenges in Prosody I • Highly variable • Actual realization differs from ideal • Speaker variation: • Gender, vocal track differences, idiosyncrasy • Tonal coarticulation • Neighboring tones influence (like segmental) • Underlying fall can become rise • Parallel encoding • Effects at multiple levels realized simultaneously
Challenges in Prosody II • Challenges for learning • Lack of training data • Sparseness: • Many prosodic phenomena are infrequent • E.g., non-declarative utterances, topic boundaries, contrastive accents, etc • Challenging for machine learning methods • Costs of labeling: • Many prosodic events require expert labeling • Need large corpus to attest • Time-consuming, expensive
Context and Learning in Multilingual Tone and Pitch Accent Recognition
Strategy: Context • Common model across languages • Pure acoustic-prosodic model • No word label, POS, lexical stress info • English, Mandarin Chinese (also Cantonese, isiZulu) • Exploit contextual information • Features from adjacent syllables, phrase contour • Analyze impact of • Context position, context encoding, context type • > 12.5% reduction in error over no context
Data Collections • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually annotated, aligned, syllabified • 4 Pitch accent labels, aligned to syllables • Mandarin: • TDT2 Voice of America Mandarin Broadcast News • Automatically aligned, syllabified • 4 main tones, neutral
Local Feature Extraction • Uniform representation for tone, pitch accent • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Base features: • Pitch, Intensity max, mean, min, range • (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour
Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of preceding, following syllables • Difference features • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Of preceding, following and current syllable • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope
Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • SVMlight (Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal
Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory
Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve
Strategy: Training • Challenge: • Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? • Exploit semisupervised and unsupervised learning • Semi-supervised Laplacian SVM • K-means and asymmetric k-lines clustering • Substantially outperform baselines • Can approach supervised levels
Semi-supervised Learning • Approach: • Employ small amount of labeled data • Exploit information from additional – presumably more available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05) • Classifier: • Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) • Semi-supervised variant of SVM • Exploits unlabeled examples • RBF kernel, typically 6 nearest neighbors, transductive
Experiments • Pitch accent recognition: • Binary classification: Unaccented/Accented • 1000 instances, proportionally sampled • Labeled training: 200 unacc, 100 acc • 80% accuracy (cf. 84% w/15x labeled SVM) • Mandarin tone recognition: • 4-way classification: n(n-1)/2 binary classifiers • 400 instances: balanced; 160 labeled • Clean lab speech- in-focus-94% • cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples • Broadcast news: 70% • Cf. < 50% w/SVM 160 training samples
Unsupervised Learning • Question: • Can we identify the tone structure of a language from the acoustic space without training? • Analogous to language acquisition • Significant recent research in unsupervised clustering • Established approaches: k-means • Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004): asymmetric k-lines • Little research for tone • Self-organizing maps (Gauthier et al,2005) • Tones identified in lab speech using f0 velocities • Cluster-based bootstrapping (Narayanan et al, 2006) • Prominence clustering (Tambourini ’05)
Contrasting Clustering • Contrasts: • Clustering: 2-16 clusters, label w/most freq class • 3 Spectral approaches: • Perform spectral decomposition of affinity matrix • Asymmetric k-lines (Fischer & Poland 2004) • Symmetric k-lines (Fischer & Poland 2004) • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) • Binary weights, k-lines clustering • K-means: Standard Euclidean distance • # of clusters: 2-16 • Best results: > 78% • 2 clusters: asymmetric k-lines; > 2 clusters: kmeans • Larger # clusters: all similar
Tone Clustering: I • Mandarin four tones: • 400 samples: balanced • 2-phase clustering: 2-5 clusters each • Asymmetric k-lines, k-means clustering • Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised) • Broadcast news: 57% (cf. 74% supervised) • K-means requires more clusters to reach k-lines level
Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height
Conclusions • Common prosodic framework for tone and pitch accent recognition • Contextual modeling enhances recognition • Local context and broad phrase contour • Carryover coarticulation has larger effect for Mandarin • Exploiting unlabeled examples for recognition • Semi- and Un-supervised approaches • Best cases approach supervised levels with less training • Exploits acoustic structure of tone and accent space
Error Correction Spiral • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. • U: Give me the price for AT&T. • S: Data General was at 10 ½ up a quarter. • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½. • U: Give me the price for American Telephone and Telegraph. • S: Excuse me?
Recognizing Spoken Corrections • Spoken Corrections • Recognize user attempts to correct ASR failures • Compare original input to repeat corrections • Significant differences: • Corrections: increases in duration, pause #/length, final fall • Increases in pitch accent for misrecognitions • Automatic recognition with decision trees, boosting • Distinguish corrective/not (human level) • Key features: raw/normalized duration, pause • Identify specific word being corrected • Key features: highest pitch, widest pitch range
The Problem:Speech Topic Segmentation • Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||
Recognizing Shifts in Topic & Turn • Topic & Turn boundaries in English & Mandarin • Initial syllables: • Significantly higher pitch, loudness than final • Lexical and prosodic cues: • Cue words, tf*idf similarity; pitch, loudness, silence • Automatic recognition with decision trees, boosting • Voting to combine text, prosody, silence: 97% accuracy • Key features: • Pause; pitch, loudness contrast between syllables
Conclusions & Opportunities • Prosody • Rich source of information for languages • Challenging due to variation, paucity of data • Can be successfully employed, with learning, to improve language understanding • Pitch accent, tone, dialogue act, turn, topic,… • Unrestricted conversational, multi-party, multimodal speech much more challenging • Increased variability, interaction with non-verbal evidence
Thanks • Dinoj Surendran, Siwei Wang, Yi Xu • V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin • This work supported by NSF Grant #0414919 • http://people.cs.uchicago.edu/~levow/tai
Phrasing can disambiguate Mary & Elena’s mother mall I met Mary and Elena’s mother at the mall yesterday One intonation phrase with relatively flat overall pitch range.
Phrasing can disambiguate Elena’s mother mall Mary I met Mary and Elena’s mother at the mall yesterday Separate phrases, with expanded pitch movements.
Lists of numbers, nouns twenty.eight.five ninety.four.three seventy.three.seven forty.seven.seven seventy.seven.seven coffee cake and cream chocolate ice cream and cake fish fingers and bottles cheese sandwiches and milk cream buns and chocolate [from Prosody on the Web tutorial on chunking]
Clustering • Pitch accent clustering: • 4 way distinction: 1000 samples, proportional • 2-16 clusters constructed • Assign most frequent class label to each cluster • Classifier: • Asymmetric k-lines: • context-dependent kernel radii, non-spherical • > 78% accuracy: • 2 clusters: asymmetric k-lines best • Context effects: • Vector w/preceding context vs vector with no context comparable