160 likes | 295 Views
D ynamic Time Warping and Minimum Distance Paths for Speech Recognition. Isolated word recognition: Task : Want to build an isolated ‘word’ recogniser e.g. voice dialling on mobile phones Method: Record, parameterise and store vocabulary of reference words
E N D
Dynamic Time Warping and Minimum Distance Paths for Speech Recognition • Isolated word recognition: • Task : • Want to build an isolated ‘word’ recogniser e.g. voice dialling on mobile phones • Method: • Record, parameterise and store vocabulary of reference words • Record test word to be recognised and parameterise • Measure distance between test word and each reference word • Choose reference word ‘closest’ to test word
Words are parameterised on a frame-by-frame basis Choose frame length, over which speech remains reasonably stationary Overlap frames e.g. 40ms frames, 10ms frame shift 40ms 20ms We want to compare frames of test and reference words i.e. calculate distances between them
Calculating Distances • Easy: • Sum differences between corresponding frames • Problem: • Number of frames won’t always correspond
Solution 1: Linear Time Warping • Stretch shorter sound • Problem? • Some sounds stretch more than others
Solution 2: • Dynamic Time Warping (DTW) 5 3 9 7 3 Test 4 7 4 Reference Using a dynamic alignment, make most similar frames correspond Find distances between two utterences using these corresponding frames
Digression: Dynamic Programming • The shortest route from Dublin to Limerick goes through: • Kildare • Monasterevin • Portlaoise • Mountrath • Roscrea • Nenagh • Now consider the shortest route from Dublin to Nenagh • What towns does the route go through?
Place distance between frame r of Test and frame c of Reference in cell(r,c) of distance matrix Compute minimum distances dist each point and place in mindist matrix: mindist(5,3) = min{1 + mindist(5,2), 1 + mindist(4,2), 1 + mindist(4,3)} Test Test Reference We can also find the path through the grid that minimizes total cost of path Reference
Examples so far are uni-dimensional Speech is multi-dimensional e.g. two dimensions, using points (4,3) and (5,2) 4 5 54321 x x 1 2 3 4 5 Distance equation for 2 dimensions: Distance equation for multi-dimensional:
Constraints • Global • Endpoint detection • Path should be close to diagonal • Local • Must always travel upwards or eastwards • No jumps • Slope weighting • Consecutive moves upwards/eastwards
Local Constraints mindist(r,c) 1 mindist(r,c-1) weights 1 2 mindist(r-1,c-1) mindist(r-1,c)
Points to Note • DTW really only suitable for small vocabularies and/or speaker dependent recognition • Should normalise for reference length • Can use multiple utterances and cluster them • Poor performance if recording environment changes • High computation cost
Evaluation • Performance of designs only comparable by evaluation • Use a test set • For single word recognition we can simply quote % accuracy: • In error analysis, it can be helpful to use a confusion matrix