220 likes | 369 Views
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News. Gina-Anne Levow University of Chicago SIGHAN July 25, 2004. Roadmap. The Problem: Mandarin Story Segmentation The Tools: Prosodic and Text Cues Mandarin Chinese Individual Results Integrating Cues
E N D
Combining Prosodic and Text Featuresfor Segmentation ofMandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004
Roadmap • The Problem: Mandarin Story Segmentation • The Tools: Prosodic and Text Cues • Mandarin Chinese • Individual Results • Integrating Cues • Conclusion & Future Work
The Problem:Mandarin Speech Topic Segmentation • Separate audio stream into component topics
Why Segment? • Enables language understanding tasks • Information Retrieval • Only regions of interest • Summarization • Cover all main topics • Reference Resolution • Pronouns tend to refer within segments
The Challenge • How do we define/measure topicality? • Are two regions on the same topic? • Fundamentally requires full understanding • How can we approach with partial understanding? • How do we identify boundaries sharply? • Association of sentences may be ambiguous • Especially, “filler”
The Tools: Prosodic and Text Cues • Represent local changes at boundaries with audio • Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00) • Represent topicality with text • Component words in audio stream • Possibly noisy • Many possible models (Hearst 94, Beeferman99,..) • Combining Prosody and Text • Human annotators more accurate, confident if use BOTH transcribed text and original audio!! (Swerts 97) • English broadcast news (Tur et al, 2001)
Data and Processing • Broadcast News • Topic Detection and Tracking TDT3 corpus • Voice of America broadcast news • ASR transcription • Manually segmented – known boundaries • ~4,000 stories, ~750K words • Acoustic analysis (Praat) • Automatic pitch, intensity tracking • Smoothed, speaker-normalized, per-word
Acoustic-Prosodic Cues • Languages differ in use of intonation • E.g. English: declarative fall, question rise • Chinese: pitch contour determines word meaning • At segment boundaries??? • Surprisingly similar, though not identical • Significantly lower pitch at end of segment • Significantly lower amplitude at end of segment • Significantly longer duration at end of segment
Acoustic-Prosodic Contrasts Mandarin Normalized Pitch Mandarin Normalized Intensity
Learning Boundaries • Decision tree classifier (Quinlan C4.5) • Classification problem • For each word, classify as final/non-final • Features • Acoustic-Prosodic: • Duration, Pitch, Loudness, Silence • Word average, Between-word difference
Text Boundary Features • Text • Information retrieval style • Cosine similarity between weighted term vectors • tf*idf in 50-word windows • Cue phrases • N-gram features • Identified by BoosTexter (Schapire & Singer, 2000) • E.g. “Voice of America”, “Audience”, “Reporting”
Classification Results • Balanced training and test sets • Results on held-out subsets • Acoustic cues only • 95.6% accuracy • Text cues (+ silence) • 95.6% accuracy • Combined text and prosody • 96.4% accuracy • Typically, false alarms twice as common as miss
Feature Assessment • Role of silence • Useful in both text and acoustic classifiers • More necessary for text • Text captures topicality, not locality • Can not identify boundaries sharply • Prosodic cues: • Localize boundaries • Multiple supporting cues: intensity, pitch: contrastive use
Issue: False Alarms • Evaluate representative sample • Boundary <<< Non-boundary • 95.6% accuracy • 2% miss, 4.4% false alarms • Non-boundary frequent • False alarms frequent
Voting Against False Alarms • Error analysis: • Construct per-feature classifiers: • Prosody-only, text-only, silence-only • Compare classifiers: per-feature, joint • Joint + 0,1 per-feature classifer FALSE ALARM • Approach: Voting • Require joint + 2 per-feature classifiers • Result: 1/3 reduction in false alarms • ~97% accuracy: 2.8% miss, 3.15% false alarm
Conclusion • Mandarin broadcast news segmentation • Identify topicality and boundary locality • Integrate text and acoustic cues • Text similarity: vector space model, n-gram cues • Prosodic cues: Silence, intensity, pitch, duration • Robust across range of languages • Provide supporting and orthogonal information • Majority agreement of per-feature classifiers: • 1/3 fewer alarms
Current & Future Work • Improving the model of topicality • Richer text similarity models; broader acoustic models • Alternative classifiers • Preliminary experiments: • Boosting, Boosted Decision trees, MaxEnt • Comparable • Alternative integration strategies • Hierarchical subtopic segmentation • Broadcast news • Dialogue: human-computer, human-human • Integration with multi-modal features: e.g. gesture, gaze
Acoustic-Prosodic Contrasts English Normalized Intensity Mandarin Normalized Pitch Mandarin Normalized Intensity English Normalized Pitch
The Problem:Speech Topic Segmentation • Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||