J áchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář1,2Elizabeth Shriberg1,3 Yang Liu1,4 1International Computer Science Institute, Berkeley, USA 2University of West Bohemia in Pilsen, Czech Republic 3SRI International, USA 4University of Texas at Dallas, USA

Why automatic DA segmentation? • Standard STT systems output a raw stream of words leaving out structural information such as sentence and Dialog Act (DA) boundaries • Problems for human readability • Problems when applying downstream natural language processing techniques requiring formatted input Kolář et al.: On Speaker-Specific Prosodic Models for ...

Goal and Task Definition • Goal: Dialog Act (DA) segmentation of meetings • Task definition: • 2-way classification in which each inter-word boundary is labeled as within-DA boundary or boundary between DAs • e.g. “no jobs are still running ok” 3 DAs: “No.” + “Jobs are still running.” + “OK.” • Evaluation metric – “Boundary error rate” Kolář et al.: On Speaker-Specific Prosodic Models for ...

Approach:Explore Speaker-Specific Prosody • Past work has used both lexical and prosodic features, but collapsing over speakers • Speakers appear to differ, however, in both feature types, especially in spontaneous speech • Meeting applications: speaker is often known or at least recorded on one channel; often participates in ongoing meetings  good opportunity for modeling • Speaker adaptation used successfully in cepstral domain for ASR • This study takes a first look specifically at prosodic features for the DA boundary task Kolář et al.: On Speaker-Specific Prosodic Models for ...

Three Questions 1) Do individual speakers benefit from modeling more than simply pause information? 2) Do individual speakers differ enough from the overall speaker model to benefit from a prosodic model trained on only their speech? 3) How do speakers differ in terms of prosodyusage in marking DA boundaries? Kolář et al.: On Speaker-Specific Prosodic Models for ...

Data and Experimental Setup • ICSI meeting corpus – multichannel conversational speech annotated for DAs • Baseline speaker-independent model trained on 567k words • For speaker-specific experiments – 20 most frequent speakers in terms of total words (7.5k – 165k words) • 17 males, 3 females • 12 natives, 8 nonnatives Kolář et al.: On Speaker-Specific Prosodic Models for ...

Data and Experimental Setup II. • Each speaker’s data: ~70% training, ~30% testing • Jackknife instead of separate development set  using 1st half of test data to tune weights for the 2nd half and vice versa • Tested on forced alignments rather than on ASR hypotheses Kolář et al.: On Speaker-Specific Prosodic Models for ...

Prosodic Features and Classifiers • Features: 32 for each interword boundary • Pause – (after current, previous and follow. word) • Duration – (phone-normalized dur of vowels, final rhymes and words; no raw durations) • Pitch – (F0 min, max, mean, slopes, and diffs and ratios across word boundaries; raw values + PWL stylized contour) • Energy – (max, min, mean frame-level RMS values, both raw and normalized) • Classifiers: CART-style decision trees with ensemble bagging Kolář et al.: On Speaker-Specific Prosodic Models for ...

Pause-only vs. Richer Set of Prosodic Features • Compare speaker-independent (SI) model with pause only (SI-Pau) with SI model with all 32 prosodic features (SI-All) • SI-All significantly better for 19 of 20 speakers • Relative error rate reduction by prosody not correlated with the amount of training data Kolář et al.: On Speaker-Specific Prosodic Models for ...

Pause-only vs. Rich Prosody: Relative Error Reduction Kolář et al.: On Speaker-Specific Prosodic Models for ...

Speaker-Independent (SI) vs. Speaker-Dependent (SD) Models • We compare SI, SD, and interpolated SI+SD models • SI+SD defined as: • Significantly improved result would suggest prosodic marking of boundaries differs from baseline SI model Kolář et al.: On Speaker-Specific Prosodic Models for ...

Effects of Adding SD Information • SD models much smaller than SI model; as expected SI better than SD alone for most subjects (though for some SD better!) • Many subjects, no gain by adding SD information (no SD info or not enough data?) • For 7 of 20 speakers, however, SD or SI+SD is better than SI, 5 improvements statistically significant • Improvement by SD not correlated with amount of data, error rate, chance error, proficiency in English, or gender • SD often helps in “unusual” prosody situations – hesitation, lip smack, long pause, emotions • SD helps more in preventing false alarms than misses Kolář et al.: On Speaker-Specific Prosodic Models for ...

Example of preventing aFALSE ALARM: “and another thing that we did also is that |FA| we have all this training data … ” SD does not false alarm after 2nd “that” because it ‘knows’ this nonnative speaker has limited F0 range and often falls in pitch before hesitations ----------------------------------------------------------------------------- Example of preventing a MISS: “this is one |.| and I think that's just fine |.|” SD finds DA boundary after “one”, despite the short pause, probably based on the speaker’s prototypical pitch reset Audio Examples: SD Helps Kolář et al.: On Speaker-Specific Prosodic Models for ...

Feature Usage, Natives vs. Nonnatives • Feature usage – how many times a feature is queried in the tree weighted by the number of samples it affects • 5 groups of features: • Pause at boundary • Near pause • Duration • Pitch • Energy • Compare the SD feature usage of improved speakers with the SI distribution Kolář et al.: On Speaker-Specific Prosodic Models for ...

Feature Usage: Natives vs. Nonnatives Kolář et al.: On Speaker-Specific Prosodic Models for ...

Summary • Prosodic features beyond pause provides improvement for 19 of 20 frequent speakers • For ~30% speakers studied, simply interpolating large SI prosodic model with small SD model yielded improvement • Amount of data error rate, chance error, proficiency in English, or gender not correlated with improvement by SD • Some interesting observations – nonnative speakers differ from native in feature usage patterns, SD information helps in “unusual” prosody situations and preventing false alarms Kolář et al.: On Speaker-Specific Prosodic Models for ...

Conclusions and Future Work • Results are interesting and suggestive, but as of yet inconclusive • SD prosody modeling significantly benefits some speakers, but predicting who they will be is still an open question • Many issues still to address, especially joint modeling with lexical features, and better integration approach • Approach interesting to explore for other domains like broadcast news, where segmentation important and some speakers occur repeatedly Kolář et al.: On Speaker-Specific Prosodic Models for ...

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář1,2Elizabeth Shriberg1,3 Yang Liu1,4 1International Computer Science Institute, Berkeley, USA 2University of West Bohemia in Pilsen, Czech Republic 3SRI International, USA 4University of Texas at Dallas, USA

J áchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4