1.32k likes | 1.67k Views
Spoken Content Retrieval Beyond Cascading Speech Recognition and Text Retrieval. Speaker: Hung-yi Lee. Text Retrieval. O bama. Voice Search. Spoken Content Retrieval. O bama. Spoken Content. Lectures. Broadcast Program. Multimedia Content. Spoken Content Retrieval – Goal.
E N D
Spoken Content Retrieval Beyond Cascading Speech Recognition and Text Retrieval Speaker: Hung-yi Lee
Text Retrieval Obama Voice Search
Spoken Content Retrieval Obama Spoken Content Lectures Broadcast Program Multimedia Content
Spoken Content Retrieval – Goal • Basic goal: Identify the time spans that the query term occurs in audio database • This is called “Spoken Term Detection” time 1:01 time 2:05 … user “Obama” ……
Spoken Content Retrieval – Goal • Basic goal: Identify the time spans that the query term occurs in audio database • This is called “Spoken Term Detection” • Advanced goal: Semantic retrieval of spoken content “US President” I know that the user is looking for “Obama”. user Retrieval system
People think …… Spoken Content Retrieval Speech Recognition + Text Retrieval =
People think …… Speech Recognition Spoken Content Acoustic Models Models Text Language Model • Transcribe spoken content into text by speech recognition
People think …… Black Box Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval Query user • Transcribe spoken content into text by speech recognition • Use text retrieval approach to search the transcriptions
People think …… Black Box Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval user • For spoken queries, transcribe them into text by speech recognition.
People think …… Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval user Speech Recognition has errors.
Lattices Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval M. Larson and G. J. F. Jones, “Spoken content retrieval: A survey of techniques and technologies,” Foundations and Trends in Information Retrieval , vol. 5, no. 4-5, 2012. Query learner • Keep the possible recognition output • Each path has a weight (confidence to be correct) Lattices
Searching on Lattices • Consider the basic goal: Spoken Term Detection user “Obama”
Searching on Lattices • Consider the basic goal: Spoken Term Detection • Find the arcs hypothesized to be the query term user Obama Obama x1 “Obama” x2
Searching on Lattices • Consider the basic goal: Spoken Term Detection • Posterior probabilities computed from lattices are usually used as confidence scores R(x1)=0.9 Obama R(x2)=0.3 Obama x1 x2
Searching on Lattices • Consider the basic goal: Spoken Term Detection • The results are ranked according to the scores x1 0.9 x2 0.3 … R(x1)=0.9 Obama R(x2)=0.3 Obama x1 x2
Searching on Lattices • Consider the basic goal: Spoken Term Detection • The results are ranked according to the scores x1 0.9 x2 0.3 … R(x1)=0.9 user Obama R(x2)=0.3 Obama x1 x2
Is the problem solved? • Do lattices solve the problem? • Need high quality recognition models to accurately estimate theconfidence scores • In real application, such high quality recognition models are not available • Spoken content retrieval is still difficult even with lattices.
Is the problem solved? • Hope for spoken content retrieval • Accurate spoken content retrieval, even under poor speech recognition • Don’t completely rely on accurate speech recognition • Is the cascading of speech recognition and text retrieval the only solution of spoken content retrieval?
My point in this talk Spoken Content Retrieval Speech Recognition + Text Retrieval =
New Directions 1. Incorporating Information Lost in Standard Speech Recognition 2. Improving Recognition Models for Retrieval Purposes 3. Special Semantic Retrieval Techniques designed for Spoken Content Retrieval 4. No Speech Recognition! 5. Speech is hard to browse! • Interactive retrieval • Visualization
New Direction 1: Incorporating Information Lost in Standard Speech Recognition
Information beyond Speech Recognition Output Black Box Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval user Query Incorporating information lost in standard speech recognition to help text retrieval
Conventional Speech Recognition Obama Acoustic Models Recognition Models Is it truly “Obama”? Language Model Lots of inaccurate assumption
Exemplar-basedspeech recognition “Obama” “Obama” similarity Is it truly “Obama”? “Obama” [Kris Demuynck, et al., ICASSP 2011] [Georg Heigold, et al., ICASSP 2012] • Exemplar-based spoken term detection • Judge whether an audio segment is the query term by the audio examples of the query • The queries are usually special terminologies. • It is not realistic to find examples for all queries. Use Pseudo-relevance Feedback (PRF)
Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1
Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1 R(x1) R(x3) R(x2) Not shown to the user Confidence scores from lattices
Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1 R(x1) R(x3) R(x2) Examples of Q Assume the results with high confidence scores as correct Considered as examples of Q
Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1 R(x1) R(x3) R(x2) Examples of Q similar dissimilar
Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 x2 x3 x1 R(x1) R(x3) R(x2) Examples of Q Rank according to new scores
Similarity between Retrieved Results Dynamic Time Warping (DTW)
Pseudo Relevance Feedback(PRF)- Experiments • Digital Speech Processing (DSP) of NTU based on lattices Evaluation Measure: MAP (Mean Average Precision) (B) (A)
Pseudo Relevance Feedback(PRF)- Experiments (A) and (B) use different speech recognition systems (A): speaker dependent (84% recognition accuracy) (B): speaker independent (50% recognition accuracy) (B) (A)
Pseudo Relevance Feedback(PRF)- Experiments • PRF (red bars) improved the first-pass retrieval results with lattices (blue bars) (B) (A)
Graph-based Approach • PRF • Each result considers the similarity to the audio examples • Make some assumption to find the examples • Graph-based approach • Not assume some results are correct • Consider the similarity between all results
Graph Construction • The first-pass results is considered as a graph. • Each retrieval result is a node First-pass Retrieval Result from lattices x1 x1 x2 x5 x2 x3 x4 x3
Graph Construction • The first-pass results is considered as a graph. • Nodes are connected if their retrieval results are similar. • DTW similarities are considered as edge weights x1 similar Dynamic Time Warping (DTW) Similarity x2 x5 x3 x4
Changing Confidence Scores by Graph • The score of each node depends on its neighbors. R(x1) x1 R(x2) high x2 x5 R(x3) high x3 x4 R(x5) R(x4)
Changing Confidence Scores by Graph • The score of each node depends on its neighbors. R(x1) x1 R(x2) low x2 x5 R(x3) low x3 x4 R(x5) R(x4)
Changing Confidence Scores by Graph • The score of each node depends on its connected nodes. • Score of x1 depends on the scores of x2 and x3 x1 x2 x5 x3 x4
Changing Confidence Scores by Graph • The score of each node depends on its connected nodes. • Score of x1 depends on the scores of x2 and x3 x1 x2 x5 • Score of x2 depends on the scores of x1 and x3 x3 x4
Changing Confidence Scores by Graph • The score of each node depends on its connected nodes. • Score of x1 depends on the scores of x2 and x3 x1 x2 x5 • Score of x2 depends on the scores of x1 and x3 x3 • …… x4 None of the results can decide their scores individually random walk algorithm.
Graph-based Approach - Experiments • Digital Speech Processing (DSP) of NTU based on lattices (A): speaker dependent (high recognition accuracy) (B): speaker independent (low recognition accuracy) (A) (B)
Graph-based Approach - Experiments • Graph-based re-ranking (green bars) outperformed PRF (red bars) (A) (B)
Graph-based Approach – Experiments from Other groups • Johns Hopkins work shop 2012 • 13% relative improvement on OOV queries [Aren Jansen, ICASSP 2013][Atta Norouzian, Richard Rose, ICASSP 2013] • AMIMeeting Corpus • 14% relative improvement [Atta Norouzian, Richard Rose, Interspeech 2013]
Graph-based Approach – Experiments on Babel Program • Join Babel program at MIT • Evaluation program of spoken term detection • More than 30 research groups divided into 4 teams • Spoken content to be retrieved are in special languages
New Direction 2: Improving Recognition Models for Retrieval Purposes
Motivation • Intuition: Higher recognition accuracy, better retrieval performance • This is not always true! In Taiwan, the need of … Recognition System A Recognition System B InTaiwan, a need of … InThailand, the need of … Same recognition accuracy
Motivation • Intuition: Higher recognition accuracy, better retrieval performance • This is not always true! In Taiwan, the need of … Recognition System A Recognition System B In Taiwan, a need of … In Thailand, the need of … Not too much influence Hurt retrieval