1 / 29

Data Mining, Information Extraction and Search in Spoken Documents

Data Mining, Information Extraction and Search in Spoken Documents. Julia Hirschberg CS 4706. Today. Data mining from text Searching audio data instead of text Information extraction from spoken documents Speech data mining. Data Mining.

lathropl
Download Presentation

Data Mining, Information Extraction and Search in Spoken Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining, Information Extraction and Search in Spoken Documents Julia Hirschberg CS 4706

  2. Today • Data mining from text • Searching audio data instead of text • Information extraction from spoken documents • Speech data mining

  3. Data Mining • Discovery of trends and patterns across very large datasets, usually for decision-making purposes • Fraud detection in banking, telephony • Stock market • Indications of demographic disasters • New causes of diseases • …finding things you don’t know you’re looking for • Information retrieval vs. ‘mining for nuggets’

  4. Dating Mining in Computational Linguistics • Finding lexical co-occurrence information • Finding parallel text corpora on the web for MT • Finding ‘new’ topics in news stories • TDT task • Exploring citation links: • Networks of influence • Information extraction, e.g. find mutual acquaintances

  5. Snowball (Agichtein et al ’01): • Seed set of patterns (e.g. Norman Mailer, 59 <firstname> <lastname>, <age>; the 59-year-old Mailer  the <age>-year-old <lastname>) • Find more patterns by looking for e.g. Mailer close to 59 • Mailer turned 59 last week. • Though Mailer is 59…

  6. But Searching Audio Data is Harder • Large amounts of audio data available: on the web, in company archives, in our homes • We have tools supporting random access to text – but for audio we’re limited to serial search • How can we develop methods to search audio as easily as text?

  7. Applications • Searching online TV and radio news and archives • Library of Congress • Searching a/v archives, movies • Searching trial recordings and legislative sessions • Searching meetings, customer care exchanges, focus groups • Telephone calls and voicemail

  8. Current Approach • Train/adapt a speech recognizer for the corpus • Produce an ASR transcript • Segment spoken `documents’ into sentences, turns, topics • Index (errorful) transcripts for Information Retrieval and link to audio via timestamps • Enables audio search by content

  9. Some Examples • SpeechBot searching internet broadcasts • Google Voice Search: search audio by voice (not yet) • SCANMailsearching voicemail

  10. Information Extraction and QA from Speech • DARPA GALE project: improve information gathering from text, speech, translations • Current Domain: newswire and news broadcasts in English, Arabic, and Mandarin • 3 competing teams • ASR/MT bakeoffs • ‘Distillation’ evaluations • QA • User studies • Requires identification and annotation of information and ‘formatting’ in speech

  11. Sample Distillation Questions • List facts about <event> • Find people who are mutual acquaintances of <person1> and <person2> • Identify persons arrested from <organization> and give their name and role in that organization • Produce a biography of <person> • Provide information on <organization> • Find statements made by or attributed to <person> about <topic> • How did <country> react to <event>

  12. Nightingale Architecture Automatic Annotation Distillation Speaker modeling Information assimilation MT ASR Audio diarization Prosodic metadata Target Language Punctuation Capitalization Source Language Info repository Linguistic structure Prosodic analysis Names Relations Intelligence delivery Topic modeling

  13. Information Annotation • Spoken documents … • Lack many cues found in text documents • Format (sentences, turns, paragraphs) • Include spontaneous speech phenomena which are difficult for ASR and NLP technologies to handle • Disfluencies, fragments • Contain errors • Annotation can turn a weakness into a strength

  14. From an ASR Transcript • aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please

  15. To Speaker Segmentation (Diarization) • Speaker: 0 - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston • Speaker: 1 - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please

  16. Add Speaker Role Labels • Anchor - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston • Reporter - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please

  17. Perform Sentence Detection and Punctuation • Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. • Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please.

  18. Detect Story Boundaries • Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. • Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please.

  19. Detect Disfluencies (and Keep/Remove) • Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. • Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please.

  20. Detect Named Entities • Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston. • Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush.Claire Shipman is covering the vice president Claire you begin tonight please.

  21. Resolve References • Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston. • Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush [Governor George W. Bush].Claire Shipman is covering the vice president Claire[Claire Shipman] you begin tonight please.

  22. Speech Data Mining • How does it differ from text data mining? • Must handle errorful transcription • Lacks (reliable) formatting • Contains spontaneous speech phenomena • We need to bring additional sources to bear on the problem

  23. Maskey et al 2004: Improving Proper Name Transcription in Voicemail • How can we improve transcription of proper names without increasing the size of the ASR lexicon? • Use meta-data available at runtime to hypothesize caller’s and callee’s names • Caller ID string – “cname” • Name of mailbox owner – “mname”

  24. Corpus • Scanmail corpus • 100 hours of voicemail messages from 140 employees of AT&T. • Manually transcribed with “cname” and “mname” tags • Gender balanced • ~12% non-native speakers • 238 random messages for testing, rest (~ 10,000 messages) for training

  25. Approach • Create a class-based language model • Create a name network to give instances for the classes of the model • Replace the class-based language model at runtime with the appropriate name networks, identified from the cname and mname of the call

  26. Name Network • To get values for “mname” and “cname”, an internal AT&T employee directory (~ 40,000 people) listing used • “cname” created from variations of static titles (Miss, Mr), full first names and nicknames (Alexander, Alex), and last names (Jones)

  27. Name Network • Probability within class – training corpus • Probability within first names – AT&T directory listing

  28. Experimental Results • Word Error Rates (WER) improvement small • Absolute reduction of 0.6% • Named Error Rate (NER) improvement significant • Absolute reduction of 20 % • Large reduction in NER important: • Getting a name right is important to business users • Scanmail users expressed a strong desire for the system to recognize their own names correctly

  29. Next Class • HTK Toolkit and HW5 (Fadi Biadsy)

More Related