1 / 46

NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls

NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls. Udhay Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking, [Jerry Weltman, Julio Schrodel] May 2008. Outline. Overview System design ASR design MT design Current results ASR results

inari
Download Presentation

NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NineOneOne:Recognizing and Classifying Speech for Handling Minority Language Emergency Calls Udhay Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking, [Jerry Weltman, Julio Schrodel] May 2008

  2. Outline • Overview • System design • ASR design • MT design • Current results • ASR results • Classification-for-MT results • Future plans

  3. Project overview • Problem: Spanish 9-1-1 calls handled in slow, unreliable fashion • Tech base: SR/MT far from perfect, but usable in limited domains • Science goal: Speech MT that really gets used • 9-1-1 as likeliest route: • Naturally limited, important, civilian domain • Interested user partner who will really try it • (vs. Diplomat experience…)

  4. Domain Challenges/Opportunities • Challenges: • Real-time required • Random phones • Background noise • Stressed speech • Multiple dialects • Cascading errors • Opportunities: • Speech data source • Strong task constraints • One-sided speech • Human-in-the-loop • Perfection not required

  5. System flow Spanish caller English Dispatcher Necessito una ambulancia Spanish ASR Dispatch Board DA Classifier Spanish TTS Spanish-To -English MT Spanish speech Nueve uno uno, ¿cuál es su emergencia? English Text I need ambulance

  6. Overall system design • Spanish to English: [no TTS!] • Spanish speech recognized • Spanish text classified (context-dependent?) into DomainActs, arguments spotted and translated • Resulting text displayed to dispatcher • English to Spanish: [no ASR!] • Dispatcher selects output from tree, typing/editing arg • Very simple “Phraselator”-style MT • System synthesizes Spanish output • Very simple limited-domain synthesizer • HCI work: keeping human in the loop! • role-playing & shadow use

  7. “Ayudame” system mock-up • We plan to interface with call-takers via web browser • Initial planned user scenario follows • The first version will certainly be wrong • One of the axioms of HCI • But iterating through user tests is the best way to get to the right design

  8. Technical plans: ASR • ASR challenges: • Disfluencies • Noisy emotional speech • Multiple dialects, some English • Planned approaches: • Noisy-channel model [Honal et al, Eurospeech03] • Articulatory features • Multilingual grammars, multilingual front end

  9. Technical plans: MT • MT challenges: • Disfluencies in speech • ASR errors • Accuracy/transparency vs. Development costs • Planned approaches: adapt and extend • Domain Act classification from Nespole! • Shallow interlingua, speaker intent (not literal) • Report-fire, Request-ambulance, Don’t-know, … • Transfer rule system from Avenue • (Both NSF-funded.)

  10. Nespole! Parsing and Analysis Approach • Goal: A portable and robust analyzer for task-oriented human-to-human speech, parsing utterances into interlingua representations • Our earlier systems used full semantic grammars to parse complete DAs • Useful for parsing spoken language in restricted domains • Difficult to port to new domains • Nespole! focus was on improving portability to new domains (and new languages) • Approach: Continue to use semantic grammars to parse domain-independent phrase-level arguments and train classifiers to identify DAs

  11. Example Nespole! representation • Hello. I would like to take a vacation in Val di Fiemme. • c:greeting (greeting=hello) c:give-information+disposition+trip (disposition=(who=i, desire), visit-spec=(identifiability=no, vacation), location=(place-name=val_di_fiemme))

  12. MT differences from Nespole! • Hypothesis: Simpler domain can allow simpler (less expensive) MT approach • DA classification done without prior parsing • We may add argument-recognizers as features, but still cheaper than parsing • After DA classification, identify, parse, and translate simple arguments (addresses, phone numbers, etc.)

  13. Currently-funded NineOneOne work • Full proposal wasnot funded • But SGER was funded • Build targeted ASR from 9-1-1 call data • Build classification part of MT system • Evaluate on unseen data, hopefully demonstrating sufficient ASR and classification quality to get follow-on • 18 months, began May 2006 • No-Cost Extension to 24 months

  14. Spanish ASR Details • Janus Recognition Toolkit (JRTk) • CI models initialized from GlobalPhone data (39 Phones) • CD models are 3 state, semi-continuous models with 32 gaussians per state • LM trained on Global Phone text corpus (Spanish news – 1.5 million words) • LM is interpolated with the training data transcriptions

  15. ASR Evaluation • Training data – 50 calls (4 hours of speech) • Dev set – 10 calls (for LM interpolation) • Test set – 15 calls (1 hour of speech) • Vocabulary size – 65K words • Test set perplexity – 96.7 • Accuracy of ASR on test set – 76.5% • Good for spontaneous, multi-speaker telephone speech

  16. Utterance Classification/Eval • Can we automatically classify utterances into DAs? • Manually classified turns into DAs • 10 labels, 845 labelled turns • WEKA toolkit SVM with simple bag-of- words binary features • Evaluated using 10-fold cross-validation • Overall accuracy 60.1% • But increases to 68.8% ignoring “Others”

  17. Initial DA Classification

  18. DA classification caveat • But DA classification was done on human transcriptions (also human utterance segmentation) • Classifier accuracy on current ASR transcriptions is 40% (49% w/o “Others”) • Probably needs to be better than that

  19. Future work • Improving ASR • Improving classification on real output: • More labelled training data • Use discourse context in classification • “Query expansion” via synsets from Spanish EuroWordNet • Engineered phone-number-recognizer etc. • Partial (simpler) return to Nespole! approach • Better ASR/classifier matching • Building and user-testing full pilot system

  20. Questions? http://www.cs.cmu.edu/~911/

  21. Class confusion matrix

  22. Argument Parsing • Parse utterances using phrase-level grammars • Nespole! used SOUP Parser (Gavaldà, 2000): Stochastic, chart-based, top-down robust parser designed for real-time analysis of spoken language • Separate grammars based on the type of phrases that the grammar is intended to cover

  23. Domain Action Classification • Identify the DA for each SDU using trainable classifiers • Nespole! used two TiMBL (k-NN) classifiers: • Speech act • Concept sequence • Binary features indicate presence or absence of arguments and pseudo-arguments

  24. Current status: March 2008 (1) (At end of extended SGER…) • Local Spanish transcribers transcribing HIPAA-sanitized 9-1-1 recordings • CMU grad student (Udhay) • managing transcribers via internal website, • built and evaluated ASR and utt. classifier, • building labelling webpage, prototype, etc. • Volunteer grad student (Weltman, LSU) analyzing, refining, and using classifier labels

  25. Current status: March 2008 (2) • “SGER worked.” • Paper on ASR and classification accepted to LREC-2008 • Two additional 9-1-1 centers sending us data • Submitted follow-on small NSF proposal in December 07: really build and user-test pilot • Letters of Support from three 9-1-1 centers • Will submit to COLING workshop on safety-critical MT systems

  26. Additional Police Partners • Julio Schrodel (CCPD) successes: • Mesa PD, Arizona • Charlotte-Mecklenburg PD, NC • Much larger cities than Cape Coral • (Each is now bigger than Pittsburgh!) • Uncompressed recordings! • Much larger, more automated 9-1-1 operations • Call-taker vs. dispatcher • User-defined call types logged

  27. Acquired data, as of 3/08 • Miami-Dade County: 5 audio cassettes! • St. Petersburg: 1 call!!

  28. Transcription Status • VerbMobil transcription conventions • TransEdit software (developed by Susi Burger and Uwe Meier) • Transcribed calls: • 97 calls from Cape Coral PD • 13 calls from Charlotte • Transcribed calls playback time: 9.7 hours

  29. LSU work: Better DA tags • Manually analyzed 30 calls to find DAs with widest coverage • Current proposal adds 25 new DAs • Created guidelines for tagging. E.g.: • If the caller answers an open-ended question with multiple pieces of information, tag each piece of information • Currently underway: Use web-based tagging tool to manually tag the calls • Determine inter-tagger agreement

  30. Sample of Proposed Additional DAs

  31. Project origin • Contacted by Julio Schrodel of Cape Coral PD (CCPD) in late 2003 • Looking for technological solution to shortage of Spanish translation for 9-1-1 calls • Visited CCPD in December 2003 • CCPD very interested in cooperating • Promised us access to 9-1-1 recordings • Designed system, wrote proposal • CCPD letter in support of proposal • Funded starting May 2006 • (SGER, only for ASR and preparatory work)

  32. Articulatory features • Model phone as a bundle of articulatory features such as voiced or bilabial • Less fragmentation of training data • More robust in handling hyper-articulation • Error-rate reduction of 25% [Metze et al, ICSLP02] • Multilingual/crosslingual articulatory features for multilingual settings • Error-rate reduction of 12.3% [Stuecker et al, ICASSP03]

  33. Grammars plus N-grams • Grammar-based concept recognition • Multilingual grammars plus n-grams for efficient multi-lingual decoding [Fuegen et al, ASRU03] • Multilingual acoustic models

  34. Interchange Format • Interchange Format (IF) is a shallow semantic interlingua for task-oriented domains • Utterances represented as sequences of semantic dialog units (SDUs) • IF representation consists of four parts • Speaker • Speech Act • Concepts • Arguments speaker : speech act +concept* +arguments* } Domain Action

  35. Hybrid Analysis Approach Hello. I would like to take a vacation in Val di Fiemme. c:greeting (greeting=hello) c:give-information+disposition+trip (disposition=(who=i, desire), visit-spec=(identifiability=no, vacation), location=(place-name=val_di_fiemme))

  36. Hybrid Analysis Approach Use a combination of grammar-based phrase-level parsing and machine learning to produce interlingua (IF) representations

  37. Grammars (1) • Argument grammar • Identifies arguments defined in the IF s[arg:activity-spec=]  (*[object-ref=any] *[modifier=good] [biking]) • Covers "any good biking", "any biking", "good biking", "biking", plus synonyms for all 3 words • Pseudo-argument grammar • Groups common phrases with similar meanings into classes s[=arrival=]  (*is *usually arriving) • Covers "arriving", "is arriving", "usually arriving", "is usually arriving", plus synonyms

  38. Grammars (2) • Cross-domain grammar • Identifies simple domain-independent DAs s[greeting] ([greeting=first_meeting] *[greet:to-whom=]) • Covers "nice to meet you", "nice to meet you donna", "nice to meet you sir", plus synonyms • Shared grammar • Contains low-level rules accessible by all other grammars

  39. Using the IF Specification • Use knowledge of the IF specification during DA classification • Ensure that only legal DAs are produced • Guarantee that the DA and arguments combine to form a valid IF representation • Strategy: Find the best DA that licenses the most arguments • Trust parser to reliably label arguments • Retaining detailed argument information is important for translation

  40. Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Avenue Transfer Rule Formalism (I) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

  41. Value constraints Agreement constraints Avenue Transfer Rule Formalism (II) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

More Related