360 likes | 514 Views
Confidence Measures in Speech Recognition. Stephen Cox School of Computing Sciences University of East Anglia Norwich, UK. sjc@cmp.uea.ac.uk. Talk Outline. Why do we need confidence measures in speech systems? Motivation for recogniser-independent measures
E N D
Confidence Measures in Speech Recognition Stephen Cox School of Computing Sciences University of East Anglia Norwich, UK. sjc@cmp.uea.ac.uk
Talk Outline • Why do we need confidence measures in speech systems? • Motivation for recogniser-independent measures • PART 1: Two methods for estimating confidence measures, based on phone/word models • Phone correlation • Metamodels • PART 2: Using semantic information to estimate confidence measures • Discussion
Why Confidence Measures? • A confidence measure (CM) is a number between 0 and 1 indicating our degree of belief that a unit output by a recogniser (phrase, word, phone etc.) is correct • The most important application of CMs is in speech dialogue systems e.g. ticket booking, call-routing, information provision etc. • Uncorrected errors can be disastrous in a dialogue system, but confirmation of each content word is tedious • The system can use a CM to decide which words are correct and which need to be confirmed or corrected. • Unsupervised speaker adaptation—use the CM in adaptation oi the acoustic models (adapt only models of words that the system considers are likely to be correct) • Aids selection of multiple hypotheses
Previous Work • Confidence measures (CMs) mostly based on deriving ad hoc features from “side-output” from recogniser e.g. • number of competing hypotheses when a word is decoded • likelihood ratio of hypotheses • stability of word in the output lattice (N-best) • number of instances of word or phonemes in word in training dataetc. etc. • Problem: These are usually highly recogniser-specific
Example: Number of hypothesized word-ends as a confidence measure
Probability of aword sequence WLANGUAGE MODEL Probability of some acoustics A given a word sequence WACOUSTIC MODELS Pr( A | W ) Pr( W ) = Pr( W | A ) Pr( A ) Probability ofa word sequence Wgiven some acoustics A Probability ofsome acoustics A PART 1: A General Approach I Speech recognition relies on Bayes’ Theorem: W = word sequence A = acoustics of speech signal
A General Approach II • Errors occur when either Pr(W) (language model) is inaccurate or Pr(W|A) is inaccurate (acoustic models) • In decoding words in a recogniser, these two probabilities are integrated • We can attempt to disentangle their effects by using a parallel phone recogniser • Two approaches: • use correlation between phone recogniser string and word recogniser string as confidence measure • use phone recogniser to hypothesise word strings and correlate with word recogniser output
p k Pre-processing for phone correlation Speech Word recogniser Phoneme Phoneme recogniser transcription DP p p p p p p p … 3 alignment 1 3 1 2 2 2 q q q q q q q ... 1 2 1 1 2 1 2 Tagged frames p p p p … 1 2 3 4 q q q q . . . means that p is within word k 2 1 4 3 Aligned phonemes
Phone correlation: distance measure p p p ... Confusion matrix 3 2 1 q q q ... 3 2 1 Aligned phoneme- sequences or tagged frames
Phone correlation: likelihood ratio Correctly decoded word Incorrectly decoded word
where P* is the most likely phoneme sequence Hypothesising words from phone strings Pr(P* | A) can be estimated from a parallel phoneme recogniser Pr(W | P*) is estimated using two techniques: LexList and Metamodels
MetaModels—candidate word lists built using phoneme confusions Motivation: • LexList method requires some ad hoc decisions about window-length, short words etc. • Combinatorial explosion in candidate words when confusion-matrix is used • MetaModel uses knowledge of phoneme confusions within an HMM framework to produce candidate word lists for CM estimation
Data and Models • Recogniser built using WSJCAM0 database • Acoustic model training: SI-TR data, ~10000 sentences, 92 speakers • Testing: SI-DT dataset, ~1900 sentences, 20 speakers • Models: 8 Gaussian mixture triphone models with tying (~ 3500 states) • Bigram language model with backoff, 20000 word vocabulary, perplexity ~160 • Confidence measures • Independent training and testing sets from SI-DT dataset
Performance measurement • Use CM to tag each decoded word as‘C’(correct) or ‘I’ (incorrect) • Guessing measure (G) error-rate: • Confidence measure (CM) error-rate: • Improvement I:
Baseline: “N-best” Confidence Confidences: can = 4/9, an = 5/9, increase = 8/9 etc. etc.
PART II: Use of semantic information in confidence measures • It is possible to define incorrect words in an utterance on semantical grounds e.g. Exxon corporations said earlier this week that it replaced one hundred forty percent its violin gas production in nineteen eighty serve on.(violin = “oil and”) • Clearly, only a small proportion of incorrect words can be identified on such grounds • However, this information is likely to be independent of measures based on decoder output, and so might be advantageously combined with other CMs. • Also, it requires no recogniser side information at all.
Preliminary Experiment • Examined decodings of about 600 sentences from our recogniser • Marked any word that we considered to be incorrect on grounds of semantics • Checked results against the transcriptions: • Only 470 incorrect words were marked as incorrect (Recall = 470/3141 = 15%) • Of these words, 421 were incorrect (Precision = 412/470 = 90%) • So human performance may be useful, but at low recall
Latent Semantic Analysis • We need a way of identifying words that are “semantically distant” from the other decoded words in a sentence • Clustering words only works up to a point because of data sparsity • Also, many semantically close word-pairs may rarely co-occur and so not cluster e.g.movie and film (synonyms)striker and batsman (both sporting roles, but different games) • Latent Semantic Analysis (LSA) has been successfully used to associate semantically similar words
N documents Doc N . . . . . . . . Doc 3 Doc 1 Doc 2 . . . . . . . . . . . . . . . . . . . . a about access account you you’ve your 0 0 0 1 0 0 0 0 1 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 1 0 . . . . . . . . . . . . . . M words Co-occurrence matrix W
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Singular Value Decomposition of W WORD/DOCUMENTSPACE LSA SPACE M x N M x R R x R R x N d d documents 1 N w 1 0 0 s = d r o w x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x w x x x x x x x x x x x x x x x x x x x x T M V S U W (W=USVT when R=N) In this case, M 20000, N 20000, R 100
Data and Representation • Use the Wall Street Journal corpus (same material as utterances for recognition experiments). • The “Documents” are the paragraphs: each paragraph is (pretty much) semantically coherent • 19396 documents and 19685 different words • Each word represented in the LSA space by a 100-d vector • Computed “semantic similarity” between two words as the dot-product of the vectors representing the words:
Semantic Score Distributions for Four Words OUTRank 1 CAUTIOUSRank 6763 DENOMINATIONRank 19666 ABOARDRank 13892
Confidence measures from LSA • Several confidence measures for a decoded word were evaluated: • 1. Mean semantic score to the other decoded words (MSS) • 2. Mean rank of semantic score to other decoded words given the complete distribution of scores to all words (MR). • 3. Probability of observing the set of scores to the other decoded words given the distribution of scores for the word. The score distribution was approximated by a five component Gaussian mixture. (PSS)
Use of a Stop List • Very commonly occurring words (e.g. function words) co-occur with most words, so have high scores to most words, and so contribute noise. • Hence words whose mean semantic score to all words in training-set was above a threshold LT were omitted. • Recogniser baseline performance increases when these words are omitted and this is taken account of in results.
Largedifference in distributionsfor high scores Littledifference in distributionsfor low scores Distribution of PSS scores
Discussion I • We expected this technique to work by identifying as incorrectly decoded words that were semantically distant from other words. • However, PSS derives its discrimination by identifying the correctly decoded words. • Analysis revealed that the words associated with high values of PSS were predominantly words that commonly occurred in the WSJ data (numbers, financial terms etc.). These are highly cognate with each other.
Discussion II • Inspection of the decoded words that had very low values of PSS associated with them showed that some of these were very common words that had been correctly decoded. • It is possible that the corpus used for making the LSA analysis does not have enough material to capture the large set of words that these common words co-occur with. • Hence the decoded utterances in the test-set contain previously unseen co-occurrences that lead to a low semantic score for these words. • Some test-set words are also out-of-vocabulary
Final Comments • We have developed techniques for identifying incorrect words in the output of a speech recogniser that do not depend on “side-information” from the recogniser, which is highly recogniser-specific. • The most successful is the “metamodels” technique. This uses a parallel phone recogniser working with the word recogniser and then correlates the output of the word recogniser with possible words constructed using metamodels. • Using semantic information gives a small but significant confidence gain and requires no other recogniser. This may well be domain-dependent. • The final test of the utility of these measures comes when they are used in a real system.