Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Combining Prosodic and Text Featuresfor Segmentation ofMandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004

Roadmap • The Problem: Mandarin Story Segmentation • The Tools: Prosodic and Text Cues • Mandarin Chinese • Individual Results • Integrating Cues • Conclusion & Future Work

The Problem:Mandarin Speech Topic Segmentation • Separate audio stream into component topics

Why Segment? • Enables language understanding tasks • Information Retrieval • Only regions of interest • Summarization • Cover all main topics • Reference Resolution • Pronouns tend to refer within segments

The Challenge • How do we define/measure topicality? • Are two regions on the same topic? • Fundamentally requires full understanding • How can we approach with partial understanding? • How do we identify boundaries sharply? • Association of sentences may be ambiguous • Especially, “filler”

The Tools: Prosodic and Text Cues • Represent local changes at boundaries with audio • Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00) • Represent topicality with text • Component words in audio stream • Possibly noisy • Many possible models (Hearst 94, Beeferman99,..) • Combining Prosody and Text • Human annotators more accurate, confident if use BOTH transcribed text and original audio!! (Swerts 97) • English broadcast news (Tur et al, 2001)

Data and Processing • Broadcast News • Topic Detection and Tracking TDT3 corpus • Voice of America broadcast news • ASR transcription • Manually segmented – known boundaries • ~4,000 stories, ~750K words • Acoustic analysis (Praat) • Automatic pitch, intensity tracking • Smoothed, speaker-normalized, per-word

Acoustic-Prosodic Cues • Languages differ in use of intonation • E.g. English: declarative fall, question rise • Chinese: pitch contour determines word meaning • At segment boundaries??? • Surprisingly similar, though not identical • Significantly lower pitch at end of segment • Significantly lower amplitude at end of segment • Significantly longer duration at end of segment

Acoustic-Prosodic Contrasts Mandarin Normalized Pitch Mandarin Normalized Intensity

Learning Boundaries • Decision tree classifier (Quinlan C4.5) • Classification problem • For each word, classify as final/non-final • Features • Acoustic-Prosodic: • Duration, Pitch, Loudness, Silence • Word average, Between-word difference

Text Boundary Features • Text • Information retrieval style • Cosine similarity between weighted term vectors • tf*idf in 50-word windows • Cue phrases • N-gram features • Identified by BoosTexter (Schapire & Singer, 2000) • E.g. “Voice of America”, “Audience”, “Reporting”

Classification Results • Balanced training and test sets • Results on held-out subsets • Acoustic cues only • 95.6% accuracy • Text cues (+ silence) • 95.6% accuracy • Combined text and prosody • 96.4% accuracy • Typically, false alarms twice as common as miss

Joint Decision Tree < <

Feature Assessment • Role of silence • Useful in both text and acoustic classifiers • More necessary for text • Text captures topicality, not locality • Can not identify boundaries sharply • Prosodic cues: • Localize boundaries • Multiple supporting cues: intensity, pitch: contrastive use

Issue: False Alarms • Evaluate representative sample • Boundary <<< Non-boundary • 95.6% accuracy • 2% miss, 4.4% false alarms • Non-boundary frequent • False alarms frequent

Voting Against False Alarms • Error analysis: • Construct per-feature classifiers: • Prosody-only, text-only, silence-only • Compare classifiers: per-feature, joint • Joint + 0,1 per-feature classifer FALSE ALARM • Approach: Voting • Require joint + 2 per-feature classifiers • Result: 1/3 reduction in false alarms • ~97% accuracy: 2.8% miss, 3.15% false alarm

Conclusion • Mandarin broadcast news segmentation • Identify topicality and boundary locality • Integrate text and acoustic cues • Text similarity: vector space model, n-gram cues • Prosodic cues: Silence, intensity, pitch, duration • Robust across range of languages • Provide supporting and orthogonal information • Majority agreement of per-feature classifiers: • 1/3 fewer alarms

Current & Future Work • Improving the model of topicality • Richer text similarity models; broader acoustic models • Alternative classifiers • Preliminary experiments: • Boosting, Boosted Decision trees, MaxEnt • Comparable • Alternative integration strategies • Hierarchical subtopic segmentation • Broadcast news • Dialogue: human-computer, human-human • Integration with multi-modal features: e.g. gesture, gaze

Acoustic-Prosodic Contrasts English Normalized Intensity Mandarin Normalized Pitch Mandarin Normalized Intensity English Normalized Pitch

Text Decision Tree

Prosodic Decision Tree

The Problem:Speech Topic Segmentation • Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Presentation Transcript

Story Segmentation in English Mandarin and Arabic Broadcast News

Text Features

Text Features

Text Features:

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction

TEXT FEATURES

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

Mandarin Tone Recognition using Affine-Invariant Prosodic Features and Tone Posteriorgram

Text segmentation

Acoustic/Prosodic Features

Text Features

ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION

Story Segmentation of Broadcast News

Broadcast News (1987)

Text Structures and Text Features

Prosodic/Suprasegmental Features (Part of Paralanguage)

FINE-GRAINED HIDDEN MARKOV MODELING FOR BROADCAST-NEWS STORY SEGMENTATION

Broadcast News Writing

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Combining STL Features

Writing News for Broadcast

Story Segmentation of Broadcast News