440 likes | 566 Views
Recognizer Issues. Problem : Our recognizer translates the audio to a possible string of text. How do we know the translation is correct. Problem: How do we handle a string of text containing words that are not in the dictionary.
E N D
Recognizer Issues • Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct. • Problem: How do we handle a string of text containing words that are not in the dictionary. • Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense
Correcting Recognizer Ambiguities • Problem: Resolving words not in the dictionary • Question: How different is a recognized word from those that are in the dictionary? • Solution: Count the single step transformations necessary to convert one word into another. • Example: caat cat with removal of one letter • Example: flpc fireplace requires adding the letters ire after f and a before c and e at the end
Spelling Error Types • Levenshtein distance:the smallest number of insertion, deletion, or substitution operations that transforms one string into another • Examples: differences from the word, “cat” • Insertion: catt • Deletion: ct • Substitution: car • Transposition: cta Note: There are similar well-defined rules for pronunciation variations
Spell Check Algorithm FOR unrecognized words or those out of context Generate a list of candidates • Those that differ by a one step transformation • Those that exist in the lexicon Order possibilities using language-based statistics RETURN most likely word Note: We could extend the algorithm to consider multiple transformations when we cannot find a single step solution
Example Mispelled word: acress Context * P(c) Context Candidates – with probabilities of use and use within context
Which correction is most likely? Misspelled word: accress • Word frequency percentage is not enough • We need p(typo|candidate) * p(candidate) • How likely is the particular error? • Deletion of a t after a c and before an r • Insertion of an a at the beginning • Transpose a c and an a • Substitute a c for an r • Substitute an o for an e • Insert an s before the last s, or after the last s • Context of the word within a sentence or paragraph
Additional Issues to Consider • What if there is more than one error per word? • Possible Solution: Compute minimum edit distance and allow more than one transformation • Some words in the lexicon may not appear in the frequency corpus that we use • Possible Solution: Smoothing algorithms • Is P(typo|candidate) accurate? • Possible Solution: Train the algorithm • Choice often depends on context • Possible Solution: Context sensitive algorithms • Names and places likely do not exist in the lexicon • This is a difficult issue
Dynamic Programming Definition: Nesting small decision problems inside larger decisions (Richard Bellman) Two approaches: 1. Top Down: Start out with a method call to compute a final solution. The method proceeds to solve sub-problems of the same kind. Eventually the algorithm reaches and solves the “base case.” The algorithm can avoid repetitive calculations by storing the answers to the sub-problems for later access. 2. Bottom Up: Start with the base cases, and work upwards, using previous results to reach the final answer. The examples on the slides that follow use the bottom up approach
Dynamic Programming Algorithms Implement recursive algorithms using arrays • Recursion is popular for solving ‘divide and conquer’ problems • Divide problem into smaller problems of the same kind • Define the base case • Works from the original problem down to the base case • Dynamic programming is another approach that uses arrays • Initialize a variable or table • Use loops fill in entries the table. The algorithm always fills in entries in the table before they are needed • Works top-down or bottom-up towards the final solution • Avoids repetitive calculations because the technique stores results in a table for later use • Eliminates recursion activation record overhead Dynamic programming is a widely used algorithmic technique
Fibonacci Sequence {0 1 1 2 3 5 8 13 21 …} where F(n)=F(n-1)+F(n-2) Bottom Up Top Down int[] fibs = new int[MAX]; fibs[0] = 0; fibs[1] = 1; int max = 1; intfibo(n) { if (n<=max) return fibs[n]; else fibs[n] = fibo(n-1)+fibo(n-2); max = n; return fibs[n]; } intfibo(n) { if (n <=1) return n; intfMinusTwo = 0; intfMinusOne = 1; for (inti = 2; i <= n; i++) { f = fAMinusTwo + fMinusOne; fMinusTwo= fMinusOne; fAtNMinusOne= f; } return(f) }
Example: Knapsack Problem A thief enters a house with a knapsack of a particular size. Which items of different values does he choose to fill his knapsack. • Values[v] containing values of the N items • Weights[v] contains the weights of the N items • K = knapsack capacity • dynTable[v][w] contains the dynamic algorithm table
KnapSack Algorithm • knapSack(int[] w, int[] v, int K) • { • int[][] dynTable[v.length+1][K]; • for (int w = 0 to K) a[0][w] = 0; • for (int v=1; v<=v.length; v++) • { • for (int w=0; w<W; w++) • {dyn[v][w] = dynTable[v-1][w]; // Copy up • if (w[v] <= w && ) • dynTable[v][w] = // Try for better choiceMath.max(dynTable[v][w],values[v]+dynTable[v-1,w-weights[v]]); • } • } return dynTable[N,W]; • } Note: Row for each item to consider; column for integer weights
Knapsack Example Goal: Maximum profit with a 15kg knapsack? Array cells represent value stuffed into knapsack of that weight
Example: Minimum Edit Distance A useful dynamic programming algorithm • Problem: How can we measure how different one word is from another word (ie spell checker)? • How many operations will transform one word into another? • Examples: caat --> cat, fplc --> fireplace • Definition: • Levenshtein distance: smallest number of insertion, deletion, or substitution operations to transform one string into another • Each insertion, deletion, or substitution is one operation, with a cost of 1. • Requires a two dimension array • Rows: source word positions, Columns: spelled word positions • Cells: distance[r][c] is the distance (cost) up to that point
Pseudo Code (minDistance(target, source)) n = character in source m = characters in target Create array, distance, with dimensions n+1, m+1 FOR r=0 TO n distance[r,0] = r FOR c=0 TOm distance[0,c] = c FOR eachrow r FOReach column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitution Result is in distance[n,m]
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Initialization
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 1
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 2
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 3
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 4
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 5 • Result: Distance equals 2
Comparing Audio Frames • Patterns: Database of audio samples. Match to the ones that are closest • Templates: Database of features extracted from audio samples • Training: Use a training set to create a vector of patterns or templates • Distortion Measure: algorithm to measure how far a pair of templates or patterns are apart
Will Minimum Edit Distance Work? • Maybe: Distances to higher array indices are functions of the costs of smaller indices; a dynamic programming approach may work • Issues • The algorithm may be too slow • A binary equal or not equal comparison does not work • A distance metric is needed • Speaking rates change, even with a single speaker • Do we compare the raw data or frame-based features? • Do we assign cost to adjacent cells or to those further away? • Other issues: Phase, energy, pitch misalignment, Presence of noise, length of vowels, Phoneme pronunciation variances, etc. Incorrect comparisons occur when the algorithm isn’t carefully designed
Dynamic Time Warping Goal: Find “best” alignment between pairs of audio frames (A) (B) The matrix to the right shows the optimal alignment path (warping) between frames from utterance A with those of utterance B time (frame) of (B) time (frame) of (A)
Dynamic Time Warping (DTW) Overview • Computes the “distance” between 2 frames of speech • Measures frame by frame distances to compute dissimilarities between speech • Allows the comparison to warp the comparison to account for differences in speaking rates • Requires a cost function to account for different paths through frames of speech • Uses dynamic programming algorithm to find best warping. • Computes a total “distortion score” for best warped path. • Assumptions • Constrain begin and end times to be (1,1) and (TA,TB) • Allow only monotonically increasing time • Don’t allow too many frames to be skipped • Can express results in terms of “paths” with “slope weights”
Assumptions • Does not require that both patterns have the same length • One speech pattern is the “input” and the other speech pattern is the “template” to compare against • Divide speech signal into equally-spaced frames (10-30ms) with approximately 50% overlap and compute a frame-based feature set • The local distance measure (d) is the distance between features at a pair of frames (one from A, one from B). • The Global distortion from beginning of utterance until current pair of frames called G.
Algorithm Efficiency • The algorithm complexity is O(m*n) where m and n are the respective number of frames between the two utterances. If m=n, the algorithm is O(n2). Why?: Count the number of cells that need to be filled in. • O(n2) may be too slow. Alternate solutions have been devised. • Don’t fill in all of the cells. • Use a multi-level approach
Don’t Fill in all of the Cells Disadvantage: The algorithm may miss the optimal path
The Multilevel Approach Concept Coarsen the array of features Run the algorithm Refine the array Adjust the solution Repeat steps 3-4 till the original array of features is restored Notes The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n) Example: partitioning a graph to balance work loads among threads or processors
Which Audio Features? • Cepstrals: They are statistically independent and phase differences are removed • ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next • Energy: Distinguish the frames that are voiced verses those that are unvoiced • Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers. These are some of the popular speech recognition features
Distance Metric Requirements Definition: Measure similarity of two frames of speech. The vector xt,yt contain the features from frames of two signals A distance measure should have the following properties: 0 d(xt,yt) 0 = d(xt,yt) iffxt = yt d(xt,yt) = d(xt,yt) (symmetry) d(xt,yt) d(xt,zt) + d(zt,yt) (triangle inequality) A speech distance metric should correlate with perceived distance. Perceptually-warped spectral features work well in practice (positive definiteness)
Which Distance Metric? • General Formula: array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)} • Assumption : There is no cost assessed for duplicate or eliminated frames. • Distance Formula: • Euclidian: sum the square of one metric minus another squared • Linear: sum the absolute value of the distance between features • Example of a distance metric using linear distance ∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight
Which Local Distance Measure? where xt(f), and yt(f) are frame frequency values at time t; f is a feature index • Euclidean distance: • Mahalanobis distance: Other distance measures Itakura-Saito distortion, COSH, likelihood ratio, etc… Note: we can weight the features by multiplying differences by weighting factors to emphasize/deemphasize certain features
Dynamic Time Warping Termination Step Divide the total computed cost by a normalizing factor The normalizing factor is necessary to compare resultsbetween input speech and various templates to which it is compared One quick and effective normalizing method divides by the number of frames in the template. Another method is divide the result by the length of the path taken, where we adjust the length by the slope weights at each transition. This requires backtracking to sum the slope values, but can sometimes be more accurate.
½ ½ 1 P1=(1,1)(1,0) P2=(1,1) P3=(1,1)(0,1) ½ P1 P2 ½ P3 Frame transition cost heuristics P1 P1=(1,0) P2=(1,1) P3=(1,2) P2 P3 Heuristic 2 Heuristic 1 • Path P and slope weight m determined heuristically • Paths considered backward from target frame • Larger weight values for less preferable paths • Optimal paths always go up, right (monotonically forward in time) • Only evaluate P if all frames have meaningful values (e.g. don’t evaluate a path if one frame is at time 1, because there is no data for time 1).
Dynamic Time Warping (DTW) Algorithm 1. Initialization (time 1 is first time frame)D(1,1) = d(1,1) 2. Recursion where (=zeta) is a function of previous distances and slopes 3. Termination M sometimes defined as Tx, or Tx+Ty, or (Tx2+ Ty2)½ Aconvenient value for M is the length of the template.
Dynamic Time Warping (DTW) Example heuristic paths: 3 2 2 2 2 2 3 1 3 2 1 1 1 1 3 P1=(1,0) P2=(1,1) P3=(1,2) 1 2 2 1 2 2 2 2 1 2 1 2 1 3 3 2 1 1 1 2 3 3 1 begin at (1,1), end at (7,6) 1 1 3 3 3 3 3 6 6 7 8 D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(4,5) = D(5,5) = D(6,5) = D(7,5) = D(4,6) = D(5,6) = D(6,6) = D(7,6) = 5 6 5 5 7 4 4 6 7 9 2 5 5 4 8 11 2 3 4 6 9 12 1 8 5 14 2 11 17 normalized distortion = 8/6 = 1.33 normalized distortion = 8/7 = 1.14
Dynamic Time Warping (DTW) Example heuristic paths: 3 2 2 2 2 3 1 3 2 1 1 1 3 P1=(1,0) P2=(1,1) P3=(0,1) 1 1 2 2 1 2 2 2 2 8 2 1 3 9 2 1 1 2 3 8 begin at (1,1), end at (6,6) 1 2 3 3 3 3 12 11 12 12 13 13 D(1,1) = 1 D(2,1) = 3 D(3,1) = 6 D(4,1) = 9 … D(1,2) = 3 D(2,2) = 2 D(3,2) =10 D(4,2) = 7 … D(1,3) = 5 D(2,3) = 10 D(3,3) =11 D(4,3) = 9 … D(1,4) = 7 D(2,4) = 7 D(3,4) =9 D(4,4) = 10 … D(1,5) = 10 D(2,5) = 9 D(3,5) =10 D(4,5) = 10 … D(1,6) = 13 D(2,6) = 11 D(3,6) =12 D(4,6) = 12 … normalized distortion = 13/6 = 2.17 10 11 10 9 11 10 7 7 9 10 10 10 10 5 11 8 9 11 2 12 9 3 10 7 12 1 3 9 6 15
Dynamic Time Warping (DTW) Example heuristic paths: 9 8 3 1 2 5 ½ 7 7 1 3 4 4 ½ P1=(1,1)(1,0) P2=(1,1) P3=(1,1)(0,1) 1 ½ 8 6 2 3 5 1 ½ 7 5 1 3 4 2 5 3 2 4 6 2 begin at (1,1), end at (6,6) 4 2 1 3 7 3 D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(3,5) = D(4,5) = D(5,5) = D(6,5) = D(3,6) = D(4,6) = D(5,6) = D(6,6) =
Singularities • Assumption • The minimum distance comparing two signals only depends on the previous adjacent entries • The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be well-defined • Problem: The algorithm can compute incorrectly due to mismatched alignments • Possible solutions: • Compare based on the change of feature values between windows instead of the values themselves • Pre-process to eliminate the causes of the mismatches
Possible Preprocessing • Normalize the energy of voiced audio: • Compute the energy of both signals • Multiply the larger by the percentage difference • Brick Wall Normalize the peaks and valleys: • Find the average peak and valley value • Set values larger than the average equal to the average • Normalize the pitch:Use PSOLA to align the pitch of the two signals • Remove duplicate frames: Auto correlate frames at pitch points • Implement a noise removal algorithm • Normalize the speaking rate
Dynamic Time Warping/Hidden Markov Models • Dynamic Time Warping (DTW) • Comparing speech with a number of templates • The algorithm selects the template with the lowest normalized distortion to determine the recognized word. • Hidden Markov Models (HMMs) • Refines the DTW technology. • HMMs compare speech against “probabilistic templates” • HMMs compute most likely paths using probabilities
Phoneme Marking • Goal:Mark the start and end of phoneme boundaries • Research • Unsupervised text (language) independent algorithms have been proposed • Accuracy:75% to80%, which is 5-10% lower than supervised algorithms that make assumptions about the language • If successful, a database of phonemes can be used in conjunction with dynamic time warping to simplify the speech recognition problem