Advanced Speech Application Tuning T opics

Advanced Speech Application Tuning Topics Yves Normandin, Nu Echo yves.normandin@nuecho.com SpeechTEK University August 2009

Fundamental principles • Tuning is a data driven process • It should be done on representative samples of user utterances • You can only tune what you can measure • And you must measure the right things • Tuning can be quite time-consuming, so it’s important to have efficient ways to: • Quickly identify where the significant problems are • Find and implement effective optimizations • Measure impact of changes

Activities in a tuning process • Produce application performance reports • Call analysis • Tuning corpus creation & transcription • Benchmark setup + produce baseline results • Grammar / dictionary / confidence feature tuning • Confidence threshold determination • Application changes (if required) • Integration of tuning results in application • Tests

Call analysis • Goal: Analyze complete calls in order to identify and quantify problems with the application • Focus is on detecting problems that won’t be obvious from isolated utterances, e.g., usability, confusion, latency • This is the first thing that should be done after a deployment • For this, we need a call viewing tool that allows • Selecting calls that meet certain criteria (failures, etc.) • Stepping through a dialog • Listening to a user utterance • Seeing the recognition result • Annotating calls (to classify and quantify problems observed)

About call analysis • Only using utterances recorded by the engine doesn’t provide a complete picture • We don’t hear everything the caller said • Often difficult to interpret why the caller spoke in a certain was (e.g., why was there a restart?) • Having the ability to do full call recordings makes it possible to get key missing information and better understand user behavior • An interesting trick is to ask questions to callers in order to understand their behavior

Tuning corpus creation • Build a tuning corpus for each relevant recognition context • For each utterance, the corpus should contain: • The waveform logged by the recognition engine • The active grammars when the utterance was collected • The recognition result obtained in the field • Useful to provide an initial transcription • Allows comparing field results with lab results • Utterance attributes, e.g., • Interaction ID (“initial”, “reprompt-noinput”, “reprompt-nomatch”, etc.) • Date, language, etc.

Corpus transcription • Our tuning process assumes that accurate orthographic transcriptions are available for all utterances • Transcriptions are used to compute reference semantic interpretations • The “reference semantic interpretation” is the semantic interpretation corresponding to the transcription • It is produced automatically by parsing the transcription with the grammar • This needs to be done manually • Recognition result can be used as pre-transcription

Benchmark setup + Produce baseline performance results • There are several goals to this phase: • Obtain a stable ING/OOG classification for all utterances • Produce a reference semantic interpretation for all ING utterances • Clean up grammars, if required • Produce a first baseline result • This can be a significant effort, but: • Effective tools make this fairly efficient • It doesn’t require highly skilled resources

High-level grammar tuning process

Scoring recognition results:Basic definitions

Remarks • We use the term “confidence feature” to designate any score that can be used to evaluate confidence in a recognition result • We often compute confidence scores that provide much better results than those provided by the recognition engine confidence score. • The terms “accept” and “reject” mean that the confidence feature is above or below the threshold being considered • The definition of “correct” should be configurable, e.g., • Semantic scoring vs. word-based scoring • 1-best vs. N-best scoring

Scoring recognition results:Sufficient statistics

Equivalence with commonly used symbols

All metrics can be calculated based on these sufficient statistics All metrics clearly defined so that there is no ambiguity

Any metrics can be calculated based on the sufficient statistics Key metrics: Includes both incorrect recognitions and OOG utterances

Fundamental performance plot:Correct Accept vs. False Accept Low Threshold High Threshold Correct Accept rate False Accept rate

The graphical view makes improvements immediately visible That’s a very effective way of measuring progress

Problems with the basic tuning process

Some reasons why transcriptions are not covered

Examples (birth date grammar)

What to do about such utterances? • We certainly can’t ignore them • They’re represent the reality of what users actually say • The application has to deal with that • We can’t just assume they should be rejected by the application • Many of these are actually perfectly well recognized, often with a high score • The “False Accept” rate becomes meaningless • Many of them should be recognized • We can’t score them because we have no reference interpretation

Our approach:“Human perceived ING/OOG” A transcription is considered ING (valid) if a human can easily interpret it; It is OOG otherwise • Doesn’t depend on what the recognition grammar actually covers • Makes results comparisons meaningful since we always have the same sets of ING and OOG utterances • Provides accurate and realistic performance metrics • CA measured on all valid user utterances • Reliable FA measurement for precise high threshold setting

Challenge: Computing the reference semantic interpretation Two techniques:

Sample regex transformations:Remove “as in” in postal codes

Sample regex transformations:Remove repetition of first letter Repetition of first letter

Focus on high confidence OOG utterances We want to avoid utterances incorrectly classified as false accepts Transcription error (should be “one”)

Tool to add paraphrases A paraphrase replaces a transcription by another one with same meaning that parses Aligns paraphrase with transcription Shows if the paraphrase is in-grammar

Postal code example The advantage of supporting certain repeats, corrections, and the form “m as in mary” is clearly demonstrated

Postal code example Impact of adding support for “p as in peter”

Comments on the transformations-based approach • Advantages • Not dependant on a specific semantic representation • The transformation framework makes this very efficient • Single rules can deal with dozens of utterances • Problems • For really “natural language utterances”, transformed transcriptions end up bearing little resemblance to the original one • Better to use semantic tagging in this case

High-level grammar tuning process (revisited) (2) Tuning (1) Benchmark setup Note: The reference grammar is often a good starting point for the recognition grammar

Key advantage: Meaningful performance comparisons Address grammar that supports apartment numbers • Scoring done only on address slots • Same set of ING and OOG utterances in both cases, despite significant grammar changes ensures that comparisons are meaningful Address grammar that doesn’t

0.5% FA Key advantage: Better tuned applications With transformations Threshold = 0.63 CA = 83.0% Without Threshold = 0.85 CA = 78.4%

Other advantages • Lab results truly represent field performance • Better confidence in the results obtained • Little surprises when applications are deployed

Techniques to identify problems

Fundamental techniques • Listen to problem utterances • This includes incorrectly recognized utterances AND correctly recognized utterances with a low score • This cannot be emphasized enough • Identify the largest sources of errors • Frequent substitutions • Words with high error rate • Slot values with high error rate • Look at frequency patterns in the data • Analyze specific semantic slots • Certain slots cause more problems than others • Compare experiments

Substitutions / word errors Are there words with unusually high error rates?

Then examine all sentences with a specific substitution(using a substitution filter) In this case: a  eight

Slot-specific scoring Month slot Day slot Are there semantic slots that perform unusually badly?

Tags and Tag Reports • In Atelier, we can use tags to create partitions based on any utterance attribute • Semantic interpretation patterns in the transcription or the recognition result • ING / OOG • Index of correct result in the N-best list • Scoring category • Confidence score ranges • Tags can be used to filter the utterances in powerful ways • Tag reports are used to compute selected metrics for any partition of the utterances

Use tag reports to find out where the biggest problems are Semantic tags Sort based on correct accept rate

Filter utterances in order to focus on specific problem cases • The “saint-leonard” borough has a high error rate. Let’s look at these utterances

Looking at semantic substitutions • What are the most frequent substitutions with “saint-leonard”?

Comparing experiments • Can choose which fields to consider for comparison purposes • This precisely shows the impact of a change on an utterance per utterance basis

Computing grammar weights for diagnostic purposes • There are many ways of saying a birth date. Which ones are worth covering? public$date = ($intro | $NULL) ( $month{month=month.month}(the|$NULL)$dayOfMonth{day=dayOfMonth.day} | $monthNumeric{month=monthNumeric.month} (the|$NULL) $dayOfMonth{day=dayOfMonth.day} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}$month{month=month.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}of $month{month=month.month} ) $year{year=year.year} ;

Computing grammar weights for diagnostic purposes public$date = ($intro | $NULL) ( $month{month=month.month}(the|$NULL)$dayOfMonth{day=dayOfMonth.day} | $monthNumeric{month=monthNumeric.month} (the|$NULL) $dayOfMonth{day=dayOfMonth.day} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}$month{month=month.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}of $month{month=month.month} ) $year{year=year.year} ; • January the sixteenth eighty • zero one sixteen eighty • sixteen zero one eighty • sixteen of the zero one eighty • sixteen January eighty • the sixteenth of January eighty

Compute frequency weights based on transcriptions public$date = (/0.00001/ $intro | /1/ $NULL) ( /0.9636/ $month{month=month.month} (/0.06352/ the | /0.9365/ $NULL) $dayOfMonth{day=dayOfMonth.day} | /0.001654/ $monthNumeric{month=monthNumeric.month} (/0.00001/ the | /1/ $NULL) $dayOfMonth{day=dayOfMonth.day} | /0.004962/ (/0.00001/ the | /1/ $NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | /0.0008271/ (/1/ the | /0.00001/ $NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | /0.012406/ (/0.00001/ the | /1/ $NULL) $dayOfMonth{day=dayOfMonth.day}$month{month=month.month} | /0.01654/ (/0.25/ the | /0.75/ $NULL) $dayOfMonth {day=dayOfMonth.day}of $month{month=month.month} ) $year{year=year.year} ; • Weight means probability of using alternative

Discriminative grammar weights based on recognition results public$date = (/-110.743109/ $intro | /110.7751/ $NULL) ( /291.1/ $month{month=month.month} (/-104.318/ the | /395.418/ $NULL) $dayOfMonth {day=dayOfMonth.day} | /-265.0/ $monthNumeric{month=monthNumeric.month} (/-75.4683/ the | /-189.53/ $NULL) $dayOfMonth {day=dayOfMonth.day} | /-16.85/ (/-17.085/ the | /0.2347/ $NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | /0.000035/ (/0.000035/ the | /0/ $NULL) $dayOfMonthThirteenAndOver {day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | /-21.16/ (/-10.058/ the | /-11.01/ $NULL) $dayOfMonth {day=dayOfMonth.day}$month{month=month.month} | /11.94/ (/-2.211/ the | /14.15/ $NULL) $dayOfMonth {day=dayOfMonth.day} of $month{month=month.month} ) $yearyear=year.year} ; • Positive: Alternative should be favored • Negative: Alternative should be disfavored

Looking at utterance distribution statistics: Address date grammar People move more on the first of the month Note that 20 (“vingt”) has lowest recognition rate

What are the substitutions for 20 (“vingt”, in French)?

Advanced Speech Application Tuning T opics