770 likes | 990 Views
Speech Summarization. Sameer R. Maskey. Summarization. ‘the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani and Maybury, 1999]. Indicative or Informative. Indicative
E N D
Speech Summarization Sameer R. Maskey
Summarization • ‘the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani and Maybury, 1999]
Indicative or Informative • Indicative • Suggests contents of the document • Better suits for searchers • Informative • Meant to represent the document • Better suits users who want the overview
Speech Summarization • Speech summarization entails ‘summarizing’ speech • Identify important information relevant to users and the story • Represent the important information • Present the extracted/inferred information as an addition or substitute to the story
Are Speech and Text Summarization similar? • NO! • Speech Signal • Prosodic features • NLP tools? • Segments – sentences? • Generation? • ASR transcripts • Data size • Yes • Identifying important information • Some lexical, discourse features • Extraction
Text vs. Speech Summarization (NEWS) Speech Signal Speech Channels - phone, remote satellite, station Transcripts - ASR, Close Captioned Error-free Text Transcript- Manual Many Speakers - speaking styles Lexical Features Some Lexical Features Segmentation -sentences Structure -Anchor, Reporter Interaction Story presentation style Prosodic Features -pitch, energy, duration NLP tools Commercials, Weather Report
Speech Summarization (NEWS) Speech Signal Speech Channels - phone, remote satellite, station Transcripts - ASR, Close Captioned Error-free Text Transcript- Manual Many Speakers - speaking styles Lexical Features Some Lexical Features Segmentation -sentences Structure -Anchor, Reporter Interaction Story presentation style Prosodic Features -pitch, energy, duration many NLP tools Commercials, Weather Report
Why speech summarization? • Multimedia production and size are increasing: need less time-consuming ways to archive, extract, use and browse speech data - speech summarization, a possible solution • Due to temporal nature of speech, difficult to scan like text • User-specific summaries of broadcast news is useful • Summarizing voicemails can help us better organize voicemails
[Salton, et al., 1995] Sentence Extraction Similarity Measures [McKeown, et al., 2001] Extraction Training w/ manual Summaries SOME SUMMARIZATION TECHNIQUES BASED ON TEXT (LEXICAL FEATURES) [Hovy & Lin, 1999] Concept Level Extract concepts units [Witbrock & Mittal, 1999] Generate Words/Phrases [Maybury, 1995] Use of Structured Data
Summarization by sentence extraction with similarity measures [Salton, et al., 1995] • Many present day techniques involve sentence extraction • Extract sentence by finding similar sentence to topic sentence or dissimilar sentences to already built summary (Maximal Marginal Relativity) • Find sentences similar to the topic sentence • Various similarity measures [Salton, et al., 1995] • Cosine Measure • Vocabulary Overlap • Topic words overlap • Content Signatures Overlap
“Automatic text structuring and summarization” [Salton, et al., 1995] • Uses hypertext link generation to summarize documents • Builds intra-document hypertext links • Coherent topic distinguished by separate chunk of links • Remove the links that are not in close proximity • Traverse along the nodes to select a path that defines a summary • Traverse order can be • Bushy Path: constructed out n most bushy nodes • Depth first Path: Traverse the most bushy path after each node • Segmented bushy path: construct bushy paths individually and connect them on text level
Summarization by feature based statistical models[Kupiec, et al., 1995] • Build manual summaries using available number of annotators • Extract set of features from the manual summaries • Train the statistical model with the given set of values for manual summaries • Use the trained model to score each sentence in the test data • Extract ‘n’ highest scoring sentences • Various statistical models/machine learning • Regression Models • Various classifiers • Bayes rules for computing probability for inclusion by counting [Kupiec, et al., 1995] Where S is summary given k features Fj and P(Fj) & P(Fj|s of S) can be computed by counting occurrences
Summarization by concept/content level extraction and generation[Hovy & Lin, 1999] , [Witbrock & Mittal, 1999] • Quite a few text summarizers based on extracting concept/content and presenting them as summary • Concept Words/Themes] • Content Units [Hovy & Lin, 1999] • Topic Identification • [Hovy & Lin, 1999] uses Concept Wavefront to build concept taxonomy • Builds concept signatures by finding relevant words in 30000 WSJ documents each categorized into different topics • Phrase concatenation of relevant concepts/content • Sentence planning for generation
Summarization of Structured text database[Maybury, 1995] • Summarization of text represented in a structured form: database, templates • Report generation of a medical history from a database is such an example • Link analysis (semantic relations within the structure) • Domain dependent importance of events
Speech summarization: present • Speech Summarization seems to be mostly based on extractive summarization • Extraction of words, sentences, content units • Some compression methods have also been proposed • Generation as in some text-summarization techniques is not available/feasible • Mainly due to the nature of the content
[Christensen et al., 2004] Sentence extraction with similarity measures [Hori C. et al., 1999, 2002] , [Hori T. et al., 2003] Word scoring with dependency structure SPEECH SUMMARIZATION TECHNIQUES [Koumpis & Renals, 2004] Classification [He et al., 1999] User access information [Zechner, 2001] Removing disfluencies [Hori T. et al., 2003] Weighted finite state transducers
Content/Context sentence level extraction for speech summary[Christensen et al., 2004] • These are commonly used speech summarization techniques: • finding sentences similar to the lead topic sentences • Using position features to find the relevant nearby sentences after detecting the topic sentence where Sim is a similarity measure between two sentences
Weighted finite state transducers for speech summarization[Hori T. et al., 2003] • Speech Summarization includes speech recognition, paraphrasing, sentence compaction integrated into single Weighted Finite State Transducer • Enables decoder to employ all the knowledge sources in one-pass strategy • Speech recognition using WFST Where H is state network of triphone HMMs, C is triphone connection rules, L is pronunciation and G is trigram language model • Paraphrasing can be looked at as a kind of machine translation with translation probability P(W|T) where W is source language and T is the target language • If S is the WFST representing translation rules and D is the language model of the target language speech summarization can bee looked at as the following composition Speech Translator H C L G S D Speech recognizer Translator
User access information for finding salient parts[He et al., 1999] • Idea is to summarize lectures or shows extracting the parts that have been viewed the longest • Needs multiple users of the same show, meeting or lecture for a statistically significant training data • For summarizing lectures compute the time spent on each slide • Summarizer based on user access logs did as well as summarizers that used linguistic and acoustic features • Average score of 4.5 on a scale of 1 to 8 for the summarizer (subjective evaluation)
Word level extraction by scoring/classifying words[Hori C. et al., 1999, 2002] • Score each word in the sentence and extract a set of words to form a sentence whose total score is the product/sum of the scores of each word • Example: • Word Significance score (topic words) • Linguistic Score (bigram probability) • Confidence Score (from ASR) • Word Concatenation Score (dependency structure grammar) Where M is the number of words to be extracted, and I C T are weighting factors for balancing among L, I, C, and T r
Assumptions • There are a few assumptions made in the previously mentioned methods • Segmentation • Information Extraction • Automatic Speech Recognition • Manual Transcripts • Annotation
Speech Segmentation? • Segmentation • Sentences • Stories • Topic • Speaker • Sentences • Topics • Features • Techniques • Evaluation speech segmentation text Extraction • Text Retrieval Methods • on ASR Transcripts
Information Extraction from Speech Data? • Information Extraction • Named Entities • Relevant Sentences and Topics • Weather/Sports Information • Sentences • Topics • Features • Techniques • Evaluation speech segmentation text Extraction • Text Retrieval Methods • on ASR Transcripts
Audio segmentation Audio Segmentation Topics Story Sentences Weather Commercials Gender Speaker Speaker Types
Audio segmentation methods • Can be roughly categorized in two different categories • Language Models [Dharanipragada, et al., 1999] , [Gotoh & Renals, 2000], [Maybury, 1998], [Shriberg, et al., 2000] • Prosody Models [Gotoh & Renals, 2000], [Meinedo & Neto, 2003] , [Shriberg, et al., 2000] • Different methods work better for different purposes and different styles of data [Shriberg, et al., 2000] • Discourse cues based method highly effective in broadcast news segmentation [Maybury, 1998] • Prosodic model outperforms most of the pure language modeling methods [Shriberg, et al., 2000], [Gotoh & Renals, 2000] • Combined model of using NLP techniques on ASR transcripts and prosodic features seem to work the best
Overview of a few algorithms:statistical model[Gotoh & Renals, 2000] • Sentence Boundary Detection: Finite State Model that extracts boundary information from text and audio sources • Uses Language and Pause Duration Model • Language Model: Represent boundary as two classes with “last word” or “not last word” • Pause Duration Model: • Prosodic features strongly affected by word • Two models can be combined • Prosody Model outperforms language model • Combined model outperforms both
Segmentation using discourse cues[Maybury, 1998] • Discourse Cues Based Story Segmentation • Sentence segmentation is not possible with this method • Discourse Cues in CNN • Start of Broadcast • Anchor to Reporter Handoff, Reporter to Anchor Handoff • Cataphoric Segment (still ahead of this news) • Broadcast End • Time Enhanced Finite State Machine to represent discourse states such as anchor, reporter, advertisement, etc • Other features used are named entities, part of speech, discourse shifts “>>” speaker change, “>>>” subject change
Speech Segmentation • Segmentation methods essential for any kind of extractive speech summarization • Sentence Segmentation in speech data is hard • Prosody Model usually works better than Language Model • Different prosody features useful for different kinds of speech data • Pause features essential in broadcast news segmentation • Phone duration essential in telephone speech segmentation • Combined linguistic and prosody model works the best
Information Extraction from Speech • Different types of information need to be extracted depending on the type of speech data • Broadcast News: • Stories [Merlino, et al., 1997] • Named Entities [Miller, et al., 1999] , [Gotoh & Renals, 2000] • Weather information • Meetings • Main points by a particular speaker • Address • Dates • Voicemail • Phone Numbers [Whittaker, et al., 2002] • Caller Names [Whittaker, et al., 2002]
Statistical model for extracting named entities[Miller, et al., 1999] , [Gotoh & Renals, 2000] • Statistical Framework: V denote vocabulary and C set of name classes, • Modeling class information as word attribute: Denote e=<c, w> and model using • In the above equation ‘e’ for two words with two different classes are considered different. This bring data sparsity problem • Maximum likelihood estimates by frequency counts • Most probable sequence of class names by Viterbi algorithm • Precision and recall of 89% for manual transcript with explicit modeling
Named entity extraction results [Miller, et al., 1999] BBN Named Entity Performance as a function of WER [Miller, et al., 1999]
Information Extraction from Speech • Information Extraction from speech data essential tool for speech summarization • Named Entities, phone number, speaker types are some frequently extracted entities • Named Entity tagging in speech is harder than in text because ASR transcript lacks punctuation, sentence boundaries, capitalization, etc • Statistical models perform reasonably well on named entity tagging
Speech Summarization at Columbia • We make a few assumptions in segmentation and extraction • Some new techniques proposed • 2-level summary • Headlines for each story • Summary for each story • Summarization Client and Server model
Speech Summarization (NEWS) Speech Signal ACOUSTIC Speech Channels - phone, remote satellite, station Transcripts - ASR, Close Captioned Error-free Text Transcript- Manual LEXICAL Many Speakers - speaking styles Lexical Features Some Lexical Features DISCOURSE Segmentation -sentences Structure -Anchor, Reporter Interaction Story presentation style Prosodic Features -pitch, energy, duration STRUCTURAL many NLP tools Commercials, Weather Report
Speech Summarization Transcripts INPUT: + /Sentence Segmentation, Speaker Identification, Speaker Clustering, Manual Annotation, ACOUSTIC LEXICAL DISCOURSE STRUCTURAL Story Named Entity Detection, POS tagging 2-Level Summary Headlines Summary
Corpus • Topic Detection and Tracking Corpus (TDT-2) • We are using 20 “CNN Headline shows” for summarization • 216 stories in total • 10 hours of speech data • Using Manual transcripts, Dragon and BBN ASR transcripts
Annotations - Entities • We want to detect – • Headlines • Greetings • Signoff • SoundByte • SoundByte-Speaker • Interviews • We annotated all of the above entities and the named entities (person, place, organization)
Annotations – by Whom and How? • We created a labeling manual following ACE standards • Annotated by 2 annotators over a course of a year • 48 hours of CNN headlines news in total • We built a labeling interface dLabel v2.5 that went through 3 revisions for this purpose
Annotations – ‘Building Summaries’ • 20 CNN shows annotated for extractive summary • A Brief Labeling Manual • No detailed instruction on what to choose and what not to? • We built a web-interface for this purpose, where annotator can click on sentences to be included in the summary • Summaries stored in a MySQL database
Acoustic Features • F0 features • max, min, mean, median, slope • Change in pitch may be a topic shift • RMS energy feature • max, min, mean • Higher amplitude probably means a stress on the phrases • Duration • Length of sentence in seconds (endtime – starttime) • Very short or a long sentence might not be important for summary • Speaker Rate • how fast the speaker is speaking • Slower rate may mean more emphasis in a particular sentence
Acoustic Features – Problems in Extraction • What should be the segment to extract these features – sentences, turn, stories? • We do not have sentence boundaries. • A dynamic programming aligner to align manual sentence boundary with ASR transcripts • Feature values needs to be normalized by speaker: used Speaker Cluster ID available from BBN ASR
Lexical Features • Named Entities in a sentence • Person • People • Organization • Total count of named entities • Num. of words in a sentence • Num. of words in previous and next sentence
Lexical Features - Issues • Using Manual Transcript • Sentence boundary detection using Ratnaparkhi’s mxterminator • Named Entities annotated • For ASR transcript: • Sentence boundaries aligned • Automatic Named Entities detected using BBN’s Identifinder • Many NLP tools fail when used with ASR transcript
Structural Features • Position • Position of the sentence in the story and the turn • Turn position in the show • Speaker Type • Reporter or Not • Previous and Next Speaker Type • Change in Speaker Type
Discourse Feature • Given-New Feature Value • Computed using the following equation where n_i is the number of ‘new’ noun stems in sentence i, d is the total number of unique nouns, s_i is the number of noun stems that have already been seen, t is the total number of nouns • Intuition: • ‘newness’ ~ more new unique nouns in the sentence (ni/d) • If many nouns already seen in the sentence ~ higher ‘givenness’ s_i/(t-d)