1 / 21

Searching and Summarizing Speech

Explore the challenges of searching and summarizing audio data and discover tools that facilitate audio browsing and retrieval. Learn about speech summarization techniques and the application of information extraction from speech.

gzavala
Download Presentation

Searching and Summarizing Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching and Summarizing Speech Julia Hirschberg CS 6998

  2. Today • Speech browsing and search • Speech summarization: 2 views • Hori et al • Barzilay et al • Speech data mining

  3. Searching Audio Data • Today, large amounts of audio data available: on the web, in company archives, in our homes • But what can we do with it? • We have tools supporting random access to text – but for audio we’re limited to serial search • Goal: tools to search audio as easily as text

  4. Why? • Searching online news and archives • Searching a/v archives, movies • Searching trial recordings and legislative sessions • Browsing meetings, customer care exchanges, focus groups • Telephone calls and voicemail

  5. Audio Browsing/Retrieval for Voicemail • Motivated by interviews, surveys and usage logs of heavy users: • Hard to scan new msgs to find those you need to deal with quickly • Hard to find msg you want in archive • Hard to locate information you want in any msg • How could we help?

  6. Caller SCANMail Architecture SCANMail Subscriber

  7. Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours; 10K msgs; 2500 speakers • Gender balanced: 12% non-native speakers • Mean message duration 36.4 secs, median 30.0 secs • Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos)

  8. Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [ .hn ] well J2 actually offered to take J home with her and then would she

  9. would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]

  10. SCANMail Demo http://www.fancentral.org/~isenhour/scanmail/demo.html Audix extension: 8380 Audix password: (null)

  11. Information Extraction from Speech • Jansche & Abney ‘02

  12. Speech Summarization: Extraction Techniques • Hori et al ‘02 • Inoue et al ‘04

  13. Domain Specific Summarization (Barzilay et al ‘00) • Motivation: lab experiments show little facilitation of speech summarization by techniques that do improve search • Domain: Broadcast News • Idea: knowing what type of speaker (anchor, reporter, interviewee) is speaking provides structural clues that can “outline” the newscast since programs are predictable

  14. SCAN: Spoken Content-based Audio Navigator • TREC SDR corpus of Broadcast News • Segment speech `documents’ into audio `paratones’ acoustically • Segmentation module trained on hand-labeled discourse structure annotation in another domain • Classify recording conditions, e.g. • Music, telephone bandwidth, wide-band • Run ASR with appropriate acoustic models (~70% wac) • Index (errorful) transcripts using SMART IR

  15. Results in WYSIAWY(“What you see is almost what you hear”) GUI • Transcript prosodically formatted • Overview provides abstract structure

  16. Acoustic Condition Classification Paratone Detector Recognition SCAN db Broadcast News corpus Information Retrieval GUI

  17. Search Overview Transcript

  18. Patterns in Newscasts • Anchors present headlines and introduce stories • Most frequent speakers • Anchor/reporter turn alternation • Reporter/guest turntaking during stories

  19. Data • 35 broadcasts of “All Things Considered” • Human and ASR transcripts (without commercials but with turn boundaries) • Features to predict speaker role • Lexical: ngrams 1-5, explicit introductions (current and prior segment) • Contextual: labels and features of prior turns • Durational: turn length (absolute and relative to previous)

  20. Methods and Results • Boosting and maximum entropy --> simple weighted rules to predict speaker role • Baseline: guess anchor (35.4%) • Result on human transcripts: • BoostTexter 79% • MaxEnt 80.5% • Result on ASR transcripts: • BoostTexter 72.8% • MaxEnt 77%

  21. Speech Data Mining • How does it differ from text data mining? • Maskey et al ‘04

More Related