Prosody in Spoken Language Understanding

Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008

U: Give me the price for AT&T. • U: Give me the price for AT&T. • U: Give me the price for AT&T. • U: Give me the price for American Telephone and Telegraph.

Error Correction Spiral • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. • U: Give me the price for AT&T. • S: Data General was at 10 ½ up a quarter. • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½. • U: Give me the price for American Telephone and Telegraph. • S: Excuse me?

Roadmap • Corrections: A motivating example • Defining prosody • Why prosody? • Challenges in prosody • Prosody in language understanding • Recognizing tone and pitch accent • Spoken corrections, Topic segmentation • Conclusions

Defining Prosody • Prosody • Phonetic phenomena in speech than span more than a single segment-“suprasegmental” • Prosody includes: • Stress, focus, tone, intonation, length/pause, rhythm • Prosodic features include: • Pitch: perceptual correlate of fundamental frequency • f0: rate of vocal fold vibration • Loudness/intensity, duration, segment quality

Why Prosody? • Prosody plays a crucial role • At all levels of language • Lexical, syntactic, pragmatic/discourse • Establishes meaning • Disambiguates sense and structure • Across languages families • Common physiological, articulatory basis • In synthesis and recognition of fluent speech

Prosody and the Lexicon • Lexical: Determines word identity • Prosodic effect at the syllable level (minimal unit) • Lexical stress: syllable prominence • Combination of length, pitch movement, loudness • REcord (N) vs reCORD (V) • Pitch accent can differentiate words in some languages • Lexical tone: tone languages, e.g. Chinese, Punjabi • Pitch height (register) and/or shape (contour) Ma (high): mother Ma (rising): hemp Ma (low): horse Ma (falling): scold

Prosody and Syntax • Prosody can disambiguate structure • Associated with chunking and attachment • Not identical with syntactic phrase boundaries • “Prosody is predictable from syntax, except when it isn’t” • Prosodic phrasing indicated by: • Some combination of pause, change in pitch

Chunking, or “phrasing” A1: I met Mary and Elena’s mother at the mall yesterday. A2: I met Mary and Elena’s mother at the mall yesterday. Example from Jennifer Venidetti

Punctuation & Prosody Humor • A panda goes into a restaurant and has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’

Prosody in Pragmatics & Discourse • Focus: • Prominence, new information: pitch accent • “October eleventh”: • Sentence type, dialogue act: • Statement vs. declarative question :“It’s raining (?)” • Discourse Structure (Topic), Emotion from Shih, Prosody Learning and Generation

Challenges in Prosody I • Highly variable • Actual realization differs from ideal • Speaker variation: • Gender, vocal track differences, idiosyncrasy • Tonal coarticulation • Neighboring tones influence (like segmental) • Underlying fall can become rise • Parallel encoding • Effects at multiple levels realized simultaneously

Challenges in Prosody II • Challenges for learning • Lack of training data • Sparseness: • Many prosodic phenomena are infrequent • E.g., non-declarative utterances, topic boundaries, contrastive accents, etc • Challenging for machine learning methods • Costs of labeling: • Many prosodic events require expert labeling • Need large corpus to attest • Time-consuming, expensive

Context and Learning in Multilingual Tone and Pitch Accent Recognition

Strategy: Context • Common model across languages • Pure acoustic-prosodic model • No word label, POS, lexical stress info • English, Mandarin Chinese (also Cantonese, isiZulu) • Exploit contextual information • Features from adjacent syllables, phrase contour • Analyze impact of • Context position, context encoding, context type • > 12.5% reduction in error over no context

Data Collections • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually annotated, aligned, syllabified • 4 Pitch accent labels, aligned to syllables • Mandarin: • TDT2 Voice of America Mandarin Broadcast News • Automatically aligned, syllabified • 4 main tones, neutral

Local Feature Extraction • Uniform representation for tone, pitch accent • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Base features: • Pitch, Intensity max, mean, min, range • (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour

Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of preceding, following syllables • Difference features • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Of preceding, following and current syllable • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope

Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • SVMlight (Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal

Results: Local Context

Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve

Strategy: Training • Challenge: • Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? • Exploit semisupervised and unsupervised learning • Semi-supervised Laplacian SVM • K-means and asymmetric k-lines clustering • Substantially outperform baselines • Can approach supervised levels

Semi-supervised Learning • Approach: • Employ small amount of labeled data • Exploit information from additional – presumably more available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05) • Classifier: • Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) • Semi-supervised variant of SVM • Exploits unlabeled examples • RBF kernel, typically 6 nearest neighbors, transductive

Experiments • Pitch accent recognition: • Binary classification: Unaccented/Accented • 1000 instances, proportionally sampled • Labeled training: 200 unacc, 100 acc • 80% accuracy (cf. 84% w/15x labeled SVM) • Mandarin tone recognition: • 4-way classification: n(n-1)/2 binary classifiers • 400 instances: balanced; 160 labeled • Clean lab speech- in-focus-94% • cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples • Broadcast news: 70% • Cf. < 50% w/SVM 160 training samples

Unsupervised Learning • Question: • Can we identify the tone structure of a language from the acoustic space without training? • Analogous to language acquisition • Significant recent research in unsupervised clustering • Established approaches: k-means • Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004): asymmetric k-lines • Little research for tone • Self-organizing maps (Gauthier et al,2005) • Tones identified in lab speech using f0 velocities • Cluster-based bootstrapping (Narayanan et al, 2006) • Prominence clustering (Tambourini ’05)

Contrasting Clustering • Contrasts: • Clustering: 2-16 clusters, label w/most freq class • 3 Spectral approaches: • Perform spectral decomposition of affinity matrix • Asymmetric k-lines (Fischer & Poland 2004) • Symmetric k-lines (Fischer & Poland 2004) • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) • Binary weights, k-lines clustering • K-means: Standard Euclidean distance • # of clusters: 2-16 • Best results: > 78% • 2 clusters: asymmetric k-lines; > 2 clusters: kmeans • Larger # clusters: all similar

Contrasting Learners

Tone Clustering: I • Mandarin four tones: • 400 samples: balanced • 2-phase clustering: 2-5 clusters each • Asymmetric k-lines, k-means clustering • Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised) • Broadcast news: 57% (cf. 74% supervised) • K-means requires more clusters to reach k-lines level

Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height

Conclusions • Common prosodic framework for tone and pitch accent recognition • Contextual modeling enhances recognition • Local context and broad phrase contour • Carryover coarticulation has larger effect for Mandarin • Exploiting unlabeled examples for recognition • Semi- and Un-supervised approaches • Best cases approach supervised levels with less training • Exploits acoustic structure of tone and accent space

Error Correction Spiral • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. • U: Give me the price for AT&T. • S: Data General was at 10 ½ up a quarter. • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½. • U: Give me the price for American Telephone and Telegraph. • S: Excuse me?

Recognizing Spoken Corrections • Spoken Corrections • Recognize user attempts to correct ASR failures • Compare original input to repeat corrections • Significant differences: • Corrections: increases in duration, pause #/length, final fall • Increases in pitch accent for misrecognitions • Automatic recognition with decision trees, boosting • Distinguish corrective/not (human level) • Key features: raw/normalized duration, pause • Identify specific word being corrected • Key features: highest pitch, widest pitch range

The Problem:Speech Topic Segmentation • Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||

Is It Possible in Mandarin?

Recognizing Shifts in Topic & Turn • Topic & Turn boundaries in English & Mandarin • Initial syllables: • Significantly higher pitch, loudness than final • Lexical and prosodic cues: • Cue words, tf*idf similarity; pitch, loudness, silence • Automatic recognition with decision trees, boosting • Voting to combine text, prosody, silence: 97% accuracy • Key features: • Pause; pitch, loudness contrast between syllables

Conclusions & Opportunities • Prosody • Rich source of information for languages • Challenging due to variation, paucity of data • Can be successfully employed, with learning, to improve language understanding • Pitch accent, tone, dialogue act, turn, topic,… • Unrestricted conversational, multi-party, multimodal speech much more challenging • Increased variability, interaction with non-verbal evidence

Thanks • Dinoj Surendran, Siwei Wang, Yi Xu • V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin • This work supported by NSF Grant #0414919 • http://people.cs.uchicago.edu/~levow/tai

Phrasing can disambiguate Mary & Elena’s mother mall I met Mary and Elena’s mother at the mall yesterday One intonation phrase with relatively flat overall pitch range.

Phrasing can disambiguate Elena’s mother mall Mary I met Mary and Elena’s mother at the mall yesterday Separate phrases, with expanded pitch movements.

Lists of numbers, nouns twenty.eight.five ninety.four.three seventy.three.seven forty.seven.seven seventy.seven.seven coffee cake and cream chocolate ice cream and cake fish fingers and bottles cheese sandwiches and milk cream buns and chocolate [from Prosody on the Web tutorial on chunking]

Clustering • Pitch accent clustering: • 4 way distinction: 1000 samples, proportional • 2-16 clusters constructed • Assign most frequent class label to each cluster • Classifier: • Asymmetric k-lines: • context-dependent kernel radii, non-spherical • > 78% accuracy: • 2 clusters: asymmetric k-lines best • Context effects: • Vector w/preceding context vs vector with no context comparable

Prosody in Spoken Language Understanding

Prosody in Spoken Language Understanding

Presentation Transcript

Spoken Language

Spoken Language Structure

Prosody in Recognition/Understanding

Spoken Language Processing

Spoken Language

spoken language

Spoken Language in Teenagers

Spoken Language difficulties:

4. RHYTHM, PROSODY, TONE, LANGUAGE

Conceptual Language Model Design for Spoken Language Understanding

SPOKEN LANGUAGE COMPREHENSION

Spoken Language Understanding

Review of Spoken Language Understanding in Dialog Systems

Spoken Language

Discriminative Models for Spoken Language Understanding

Studying spoken language

Recognition and Understanding of Prosody

Spoken Language Understanding, the Research/Industry Chasm

Spoken Language Understanding

Spoken Language Processing

Prosody in Recognition/Understanding

Spoken Language Translation