NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls

NineOneOne:Recognizing and Classifying Speech for Handling Minority Language Emergency Calls Udhay Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking, [Jerry Weltman, Julio Schrodel] May 2008

Outline • Overview • System design • ASR design • MT design • Current results • ASR results • Classification-for-MT results • Future plans

Project overview • Problem: Spanish 9-1-1 calls handled in slow, unreliable fashion • Tech base: SR/MT far from perfect, but usable in limited domains • Science goal: Speech MT that really gets used • 9-1-1 as likeliest route: • Naturally limited, important, civilian domain • Interested user partner who will really try it • (vs. Diplomat experience…)

Domain Challenges/Opportunities • Challenges: • Real-time required • Random phones • Background noise • Stressed speech • Multiple dialects • Cascading errors • Opportunities: • Speech data source • Strong task constraints • One-sided speech • Human-in-the-loop • Perfection not required

System flow Spanish caller English Dispatcher Necessito una ambulancia Spanish ASR Dispatch Board DA Classifier Spanish TTS Spanish-To -English MT Spanish speech Nueve uno uno, ¿cuál es su emergencia? English Text I need ambulance

Overall system design • Spanish to English: [no TTS!] • Spanish speech recognized • Spanish text classified (context-dependent?) into DomainActs, arguments spotted and translated • Resulting text displayed to dispatcher • English to Spanish: [no ASR!] • Dispatcher selects output from tree, typing/editing arg • Very simple “Phraselator”-style MT • System synthesizes Spanish output • Very simple limited-domain synthesizer • HCI work: keeping human in the loop! • role-playing & shadow use

“Ayudame” system mock-up • We plan to interface with call-takers via web browser • Initial planned user scenario follows • The first version will certainly be wrong • One of the axioms of HCI • But iterating through user tests is the best way to get to the right design

Technical plans: ASR • ASR challenges: • Disfluencies • Noisy emotional speech • Multiple dialects, some English • Planned approaches: • Noisy-channel model [Honal et al, Eurospeech03] • Articulatory features • Multilingual grammars, multilingual front end

Technical plans: MT • MT challenges: • Disfluencies in speech • ASR errors • Accuracy/transparency vs. Development costs • Planned approaches: adapt and extend • Domain Act classification from Nespole! • Shallow interlingua, speaker intent (not literal) • Report-fire, Request-ambulance, Don’t-know, … • Transfer rule system from Avenue • (Both NSF-funded.)

Nespole! Parsing and Analysis Approach • Goal: A portable and robust analyzer for task-oriented human-to-human speech, parsing utterances into interlingua representations • Our earlier systems used full semantic grammars to parse complete DAs • Useful for parsing spoken language in restricted domains • Difficult to port to new domains • Nespole! focus was on improving portability to new domains (and new languages) • Approach: Continue to use semantic grammars to parse domain-independent phrase-level arguments and train classifiers to identify DAs

Example Nespole! representation • Hello. I would like to take a vacation in Val di Fiemme. • c:greeting (greeting=hello) c:give-information+disposition+trip (disposition=(who=i, desire), visit-spec=(identifiability=no, vacation), location=(place-name=val_di_fiemme))

MT differences from Nespole! • Hypothesis: Simpler domain can allow simpler (less expensive) MT approach • DA classification done without prior parsing • We may add argument-recognizers as features, but still cheaper than parsing • After DA classification, identify, parse, and translate simple arguments (addresses, phone numbers, etc.)

Currently-funded NineOneOne work • Full proposal wasnot funded • But SGER was funded • Build targeted ASR from 9-1-1 call data • Build classification part of MT system • Evaluate on unseen data, hopefully demonstrating sufficient ASR and classification quality to get follow-on • 18 months, began May 2006 • No-Cost Extension to 24 months

Spanish ASR Details • Janus Recognition Toolkit (JRTk) • CI models initialized from GlobalPhone data (39 Phones) • CD models are 3 state, semi-continuous models with 32 gaussians per state • LM trained on Global Phone text corpus (Spanish news – 1.5 million words) • LM is interpolated with the training data transcriptions

ASR Evaluation • Training data – 50 calls (4 hours of speech) • Dev set – 10 calls (for LM interpolation) • Test set – 15 calls (1 hour of speech) • Vocabulary size – 65K words • Test set perplexity – 96.7 • Accuracy of ASR on test set – 76.5% • Good for spontaneous, multi-speaker telephone speech

Utterance Classification/Eval • Can we automatically classify utterances into DAs? • Manually classified turns into DAs • 10 labels, 845 labelled turns • WEKA toolkit SVM with simple bag-of- words binary features • Evaluated using 10-fold cross-validation • Overall accuracy 60.1% • But increases to 68.8% ignoring “Others”

Initial DA Classification

DA classification caveat • But DA classification was done on human transcriptions (also human utterance segmentation) • Classifier accuracy on current ASR transcriptions is 40% (49% w/o “Others”) • Probably needs to be better than that

Future work • Improving ASR • Improving classification on real output: • More labelled training data • Use discourse context in classification • “Query expansion” via synsets from Spanish EuroWordNet • Engineered phone-number-recognizer etc. • Partial (simpler) return to Nespole! approach • Better ASR/classifier matching • Building and user-testing full pilot system

Questions? http://www.cs.cmu.edu/~911/

Class confusion matrix

Argument Parsing • Parse utterances using phrase-level grammars • Nespole! used SOUP Parser (Gavaldà, 2000): Stochastic, chart-based, top-down robust parser designed for real-time analysis of spoken language • Separate grammars based on the type of phrases that the grammar is intended to cover

Domain Action Classification • Identify the DA for each SDU using trainable classifiers • Nespole! used two TiMBL (k-NN) classifiers: • Speech act • Concept sequence • Binary features indicate presence or absence of arguments and pseudo-arguments

Current status: March 2008 (1) (At end of extended SGER…) • Local Spanish transcribers transcribing HIPAA-sanitized 9-1-1 recordings • CMU grad student (Udhay) • managing transcribers via internal website, • built and evaluated ASR and utt. classifier, • building labelling webpage, prototype, etc. • Volunteer grad student (Weltman, LSU) analyzing, refining, and using classifier labels

Current status: March 2008 (2) • “SGER worked.” • Paper on ASR and classification accepted to LREC-2008 • Two additional 9-1-1 centers sending us data • Submitted follow-on small NSF proposal in December 07: really build and user-test pilot • Letters of Support from three 9-1-1 centers • Will submit to COLING workshop on safety-critical MT systems

Additional Police Partners • Julio Schrodel (CCPD) successes: • Mesa PD, Arizona • Charlotte-Mecklenburg PD, NC • Much larger cities than Cape Coral • (Each is now bigger than Pittsburgh!) • Uncompressed recordings! • Much larger, more automated 9-1-1 operations • Call-taker vs. dispatcher • User-defined call types logged

Acquired data, as of 3/08 • Miami-Dade County: 5 audio cassettes! • St. Petersburg: 1 call!!

Transcription Status • VerbMobil transcription conventions • TransEdit software (developed by Susi Burger and Uwe Meier) • Transcribed calls: • 97 calls from Cape Coral PD • 13 calls from Charlotte • Transcribed calls playback time: 9.7 hours

LSU work: Better DA tags • Manually analyzed 30 calls to find DAs with widest coverage • Current proposal adds 25 new DAs • Created guidelines for tagging. E.g.: • If the caller answers an open-ended question with multiple pieces of information, tag each piece of information • Currently underway: Use web-based tagging tool to manually tag the calls • Determine inter-tagger agreement

Sample of Proposed Additional DAs

Project origin • Contacted by Julio Schrodel of Cape Coral PD (CCPD) in late 2003 • Looking for technological solution to shortage of Spanish translation for 9-1-1 calls • Visited CCPD in December 2003 • CCPD very interested in cooperating • Promised us access to 9-1-1 recordings • Designed system, wrote proposal • CCPD letter in support of proposal • Funded starting May 2006 • (SGER, only for ASR and preparatory work)

Articulatory features • Model phone as a bundle of articulatory features such as voiced or bilabial • Less fragmentation of training data • More robust in handling hyper-articulation • Error-rate reduction of 25% [Metze et al, ICSLP02] • Multilingual/crosslingual articulatory features for multilingual settings • Error-rate reduction of 12.3% [Stuecker et al, ICASSP03]

Grammars plus N-grams • Grammar-based concept recognition • Multilingual grammars plus n-grams for efficient multi-lingual decoding [Fuegen et al, ASRU03] • Multilingual acoustic models

Interchange Format • Interchange Format (IF) is a shallow semantic interlingua for task-oriented domains • Utterances represented as sequences of semantic dialog units (SDUs) • IF representation consists of four parts • Speaker • Speech Act • Concepts • Arguments speaker : speech act +concept* +arguments* } Domain Action

Hybrid Analysis Approach Hello. I would like to take a vacation in Val di Fiemme. c:greeting (greeting=hello) c:give-information+disposition+trip (disposition=(who=i, desire), visit-spec=(identifiability=no, vacation), location=(place-name=val_di_fiemme))

Hybrid Analysis Approach Use a combination of grammar-based phrase-level parsing and machine learning to produce interlingua (IF) representations

Grammars (1) • Argument grammar • Identifies arguments defined in the IF s[arg:activity-spec=]  (*[object-ref=any] *[modifier=good] [biking]) • Covers "any good biking", "any biking", "good biking", "biking", plus synonyms for all 3 words • Pseudo-argument grammar • Groups common phrases with similar meanings into classes s[=arrival=]  (*is *usually arriving) • Covers "arriving", "is arriving", "usually arriving", "is usually arriving", plus synonyms

Grammars (2) • Cross-domain grammar • Identifies simple domain-independent DAs s[greeting] ([greeting=first_meeting] *[greet:to-whom=]) • Covers "nice to meet you", "nice to meet you donna", "nice to meet you sir", plus synonyms • Shared grammar • Contains low-level rules accessible by all other grammars

Using the IF Specification • Use knowledge of the IF specification during DA classification • Ensure that only legal DAs are produced • Guarantee that the DA and arguments combine to form a valid IF representation • Strategy: Find the best DA that licenses the most arguments • Trust parser to reliably label arguments • Retaining detailed argument information is important for translation

Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Avenue Transfer Rule Formalism (I) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

Value constraints Agreement constraints Avenue Transfer Rule Formalism (II) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls