10 likes | 173 Views
Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space Yun-Nung Chen, Chia -Ping Chen, Hung-Yi Lee , Chun-an Chan, and Lin- shan Lee. Query Q. Corpus. MFCC Extraction. ASR. Retrieval. Lattice. Re-Ranking. Original Relevance Score S Q ( x ). v ( i ).
E N D
Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space Yun-Nung Chen, Chia-Ping Chen, Hung-Yi Lee , Chun-an Chan, and Lin-shan Lee Query Q Corpus MFCC Extraction ASR Retrieval Lattice Re-Ranking Original Relevance Score SQ(x) v (i) Acoustic Distance 1st-Pass Results XQ Graph-Based Re-Computation d(xi ,xj) Utterance Pair xi, xj White blocks: original spoken term detection Lattice w2: 0.3 Q: 0.5 w1: 0.1 w2: 0.6 Q: 0.7 MFCC sequence of x w1: 0.2 w2: 0.1 x1 x5 DTW Hit Region of the utterance xi d(xi ,xj) Hit Region of the utterance xj Out(vi)= {x3, x5, x6} x4 x3 v(4) x2 x6 National Taiwan University Summary Main Idea Acoustic Feature Space • Using graph-based re-ranking to improve spoken term detection with acoustic similarity. • With MLLR acoustic model, MAP improves from 55.54% to 67.38%. The relative improvement rate is 22.04% relevant irrelevant • Relevant utterances are similar to each other in feature space. • Utterances similar to more utterances with higher scores should be given higher relevance scores. Acoustic Distance • Compute acoustic distance d(xi ,xj) for each utterance pair xi ,xjin first-pass result. • Node: first-pass retrieved utterance • Edge: weighted by the similarity between two utterances • Nodes connected to more nodes with higher scores are given higher scores. • Considering global similarity among first-pass retrieved utterances. • “Hit Region”: the corresponding MFCC sequence which is most possible to be the query in the lattice. Experiments • Corpus: 33 hours of course lectures (single instructor) • Language: primarily in Mandarin Chinese • Acoustic Model: SI, MLLR, SD Graph-Based Re-Ranking with Acoustic Feature • Modified Random Walk • PRF: pseudo-relevance feedback in feature space (InterSpeech 2010) • Our approach performs better, specially for the relatively poorer acoustic models (SI and MLLR). • Scores propagated from neighbors of node i • Normalized original relevance score • Node: first-pass retrieved utterance • Edge: weighted by the similarity between the two utterances evaluated in feature space • v(i)is higher when • Higher original relevance score • Similar to more utterances with higher scores (matrix form) • Compute similarity from distance 0 • Solution: dominant eigenvector of P’ • Re-Ranking 1 • Normalized similarity • The performance is optimized at α = 0.9 • Better retrieval relies primarily on global similarity. • The graphic structure provides significant information in ranking. • Integrated with original relevance score