Speaker: Hung-yi Lee

Spoken Content Retrieval Beyond Cascading Speech Recognition and Text Retrieval Speaker: Hung-yi Lee

Text Retrieval Obama Voice Search

Spoken Content Retrieval Obama Spoken Content Lectures Broadcast Program Multimedia Content

Spoken Content Retrieval – Goal • Basic goal: Identify the time spans that the query term occurs in audio database • This is called “Spoken Term Detection” time 1:01 time 2:05 … user “Obama” ……

Spoken Content Retrieval – Goal • Basic goal: Identify the time spans that the query term occurs in audio database • This is called “Spoken Term Detection” • Advanced goal: Semantic retrieval of spoken content “US President” I know that the user is looking for “Obama”. user Retrieval system

People think …… Spoken Content Retrieval Speech Recognition + Text Retrieval =

People think …… Speech Recognition Spoken Content Acoustic Models Models Text Language Model • Transcribe spoken content into text by speech recognition

People think …… Black Box Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval Query user • Transcribe spoken content into text by speech recognition • Use text retrieval approach to search the transcriptions

People think …… Black Box Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval user • For spoken queries, transcribe them into text by speech recognition.

People think …… Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval user Speech Recognition has errors.

Lattices Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval M. Larson and G. J. F. Jones, “Spoken content retrieval: A survey of techniques and technologies,” Foundations and Trends in Information Retrieval , vol. 5, no. 4-5, 2012. Query learner • Keep the possible recognition output • Each path has a weight (confidence to be correct) Lattices

Searching on Lattices • Consider the basic goal: Spoken Term Detection user “Obama”

Searching on Lattices • Consider the basic goal: Spoken Term Detection • Find the arcs hypothesized to be the query term user Obama Obama x1 “Obama” x2

Searching on Lattices • Consider the basic goal: Spoken Term Detection • Posterior probabilities computed from lattices are usually used as confidence scores R(x1)=0.9 Obama R(x2)=0.3 Obama x1 x2

Searching on Lattices • Consider the basic goal: Spoken Term Detection • The results are ranked according to the scores x1 0.9 x2 0.3 … R(x1)=0.9 Obama R(x2)=0.3 Obama x1 x2

Searching on Lattices • Consider the basic goal: Spoken Term Detection • The results are ranked according to the scores x1 0.9 x2 0.3 … R(x1)=0.9 user Obama R(x2)=0.3 Obama x1 x2

Is the problem solved? • Do lattices solve the problem? • Need high quality recognition models to accurately estimate theconfidence scores • In real application, such high quality recognition models are not available • Spoken content retrieval is still difficult even with lattices.

Is the problem solved? • Hope for spoken content retrieval • Accurate spoken content retrieval, even under poor speech recognition • Don’t completely rely on accurate speech recognition • Is the cascading of speech recognition and text retrieval the only solution of spoken content retrieval?

My point in this talk Spoken Content Retrieval Speech Recognition + Text Retrieval =

Beyond Cascading Speech Recognition and Text Retrieval

New Directions 1. Incorporating Information Lost in Standard Speech Recognition 2. Improving Recognition Models for Retrieval Purposes 3. Special Semantic Retrieval Techniques designed for Spoken Content Retrieval 4. No Speech Recognition! 5. Speech is hard to browse! • Interactive retrieval • Visualization

New Direction 1: Incorporating Information Lost in Standard Speech Recognition

Information beyond Speech Recognition Output Black Box Speech Recognition Spoken Content Models Retrieval Result Text Text Retrieval user Query Incorporating information lost in standard speech recognition to help text retrieval

Conventional Speech Recognition Obama Acoustic Models Recognition Models Is it truly “Obama”? Language Model Lots of inaccurate assumption

Exemplar-basedspeech recognition “Obama” “Obama” similarity Is it truly “Obama”? “Obama” [Kris Demuynck, et al., ICASSP 2011] [Georg Heigold, et al., ICASSP 2012] • Exemplar-based spoken term detection • Judge whether an audio segment is the query term by the audio examples of the query • The queries are usually special terminologies. • It is not realistic to find examples for all queries. Use Pseudo-relevance Feedback (PRF)

Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1

Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1 R(x1) R(x3) R(x2) Not shown to the user Confidence scores from lattices

Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1 R(x1) R(x3) R(x2) Examples of Q Assume the results with high confidence scores as correct Considered as examples of Q

Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result x2 x3 x1 R(x1) R(x3) R(x2) Examples of Q similar dissimilar

Pseudo Relevance Feedback(PRF) Retrieval System Query Q Lattices First-pass Retrieval Result time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 x2 x3 x1 R(x1) R(x3) R(x2) Examples of Q Rank according to new scores

Similarity between Retrieved Results Dynamic Time Warping (DTW)

Pseudo Relevance Feedback(PRF)- Experiments • Digital Speech Processing (DSP) of NTU based on lattices Evaluation Measure: MAP (Mean Average Precision) (B) (A)

Pseudo Relevance Feedback(PRF)- Experiments (A) and (B) use different speech recognition systems (A): speaker dependent (84% recognition accuracy) (B): speaker independent (50% recognition accuracy) (B) (A)

Pseudo Relevance Feedback(PRF)- Experiments • PRF (red bars) improved the first-pass retrieval results with lattices (blue bars) (B) (A)

Graph-based Approach • PRF • Each result considers the similarity to the audio examples • Make some assumption to find the examples • Graph-based approach • Not assume some results are correct • Consider the similarity between all results

Graph Construction • The first-pass results is considered as a graph. • Each retrieval result is a node First-pass Retrieval Result from lattices x1 x1 x2 x5 x2 x3 x4 x3

Graph Construction • The first-pass results is considered as a graph. • Nodes are connected if their retrieval results are similar. • DTW similarities are considered as edge weights x1 similar Dynamic Time Warping (DTW) Similarity x2 x5 x3 x4

Changing Confidence Scores by Graph • The score of each node depends on its neighbors. R(x1) x1 R(x2) high x2 x5 R(x3) high x3 x4 R(x5) R(x4)

Changing Confidence Scores by Graph • The score of each node depends on its neighbors. R(x1) x1 R(x2) low x2 x5 R(x3) low x3 x4 R(x5) R(x4)

Changing Confidence Scores by Graph • The score of each node depends on its connected nodes. • Score of x1 depends on the scores of x2 and x3 x1 x2 x5 x3 x4

Changing Confidence Scores by Graph • The score of each node depends on its connected nodes. • Score of x1 depends on the scores of x2 and x3 x1 x2 x5 • Score of x2 depends on the scores of x1 and x3 x3 x4

Changing Confidence Scores by Graph • The score of each node depends on its connected nodes. • Score of x1 depends on the scores of x2 and x3 x1 x2 x5 • Score of x2 depends on the scores of x1 and x3 x3 • …… x4 None of the results can decide their scores individually random walk algorithm.

Graph-based Approach - Experiments • Digital Speech Processing (DSP) of NTU based on lattices (A): speaker dependent (high recognition accuracy) (B): speaker independent (low recognition accuracy) (A) (B)

Graph-based Approach - Experiments • Graph-based re-ranking (green bars) outperformed PRF (red bars) (A) (B)

Graph-based Approach – Experiments from Other groups • Johns Hopkins work shop 2012 • 13% relative improvement on OOV queries [Aren Jansen, ICASSP 2013][Atta Norouzian, Richard Rose, ICASSP 2013] • AMIMeeting Corpus • 14% relative improvement [Atta Norouzian, Richard Rose, Interspeech 2013]

Graph-based Approach – Experiments on Babel Program • Join Babel program at MIT • Evaluation program of spoken term detection • More than 30 research groups divided into 4 teams • Spoken content to be retrieved are in special languages

Graph-based Approach – Experiments on Babel Program

New Direction 2: Improving Recognition Models for Retrieval Purposes

Motivation • Intuition: Higher recognition accuracy, better retrieval performance • This is not always true! In Taiwan, the need of … Recognition System A Recognition System B InTaiwan, a need of … InThailand, the need of … Same recognition accuracy

Motivation • Intuition: Higher recognition accuracy, better retrieval performance • This is not always true! In Taiwan, the need of … Recognition System A Recognition System B In Taiwan, a need of … In Thailand, the need of … Not too much influence Hurt retrieval

Speaker: Hung-yi Lee

Speaker: Hung-yi Lee

Presentation Transcript

Leadership in Congress

Speaker Recognition

SPEAKER RECOGNITION

Speaker Recognition

Speaker Name Speaker Affiliation Event Date of Event

Poetry

Robust and transparent watermarking scheme for colour images

Presensted by Hui-Hung Lin

Speaker Verification via Kernel Methods

A NONLINEAR MIXTURE AUTOREGRESSIVE MODEL FOR SPEAKER VERIFICATION

DISSCUSSIONS:

Introduction to Speaker Diarization

Improving RPR Fairness Convergence

Hung Suet-lin 17/10/2005

Speaker Name Speaker Title Speaker Affiliation

Adaptive Data Hiding in Edge Areas of Images With Spatial LSB Domain Systems