200 likes | 342 Views
Speech and Language Technologies for Audio Indexing and Retrieval. JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000. Outline. Introduction
E N D
Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000 Chin-Kai Wu, CS, NTHU
Outline • Introduction • Indexing and Browsing with Rough’n’Ready • Rough’n’Ready System • Indexing and Browsing • Statistical Modeling Paradigm • Speech Recognition • Speaker Recognition • Segmentation • Clustering • Identification Chin-Kai Wu, CS, NTHU
Introduction • Much of information will be in the form of speech from various source. • It’s now possible to start building automatic content-based indexing and retrieval tools. • The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing. • The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval. Chin-Kai Wu, CS, NTHU
Dual P733-MHz MP3 Collect/Manage Archive Interact with browser ActiveX controls Rough’n’Ready system ActiveX controls Chin-Kai Wu, CS, NTHU
Indexing and Browsing Chin-Kai Wu, CS, NTHU
Place People Speaker Topic Labels Organization Indexing and Browsing (Cont’d) Chin-Kai Wu, CS, NTHU
Indexing and Browsing (Cont’d) Selected from over 5500 topic labels Chin-Kai Wu, CS, NTHU
Statistic Modeling Paradigm (desired recognized sequence of the data) Maximize P(output|input, model) Chin-Kai Wu, CS, NTHU
Speech Recognition • Statistic model: acoustic models, language models • Acoustic model • Describe the time-varying evolution of feature vectors for each sound or phoneme • Employ hidden Markov models (HMM) • Gaussian mixture models the feature vector for each HMM states • Special acoustic models for nonspeech events: music, silence/noise, laughter, breath, and lip-smack. • Language model: N-gram language model Chin-Kai Wu, CS, NTHU
Speech Recognition (Cont’d) • Multipass recognition search strategy • Fast-match pass • Narrows search space • Followed by other passes with more accurate models operate on smaller search space • Backward pass • Generate top-scoring N-best word sequences (100 <= N <= 300) • N-best rescoring pass: Tree Rescoring algorithm Chin-Kai Wu, CS, NTHU
Speech Recognition (Cont’d) • Speedup algorithms • Fast Gaussian Computation (FGC) • Grammar Spreading • N-Best Tree Rescoring • Word error rate • PII 450-MHz processor, 60000-word vocabulary • 3 x RT => 21.4% • 10 x RT => 17.5% • 230 x RT => 14.8% Chin-Kai Wu, CS, NTHU
Speaker Recognition • Speaker segmentation • Segregate audio streams based on the speaker • Speaker clustering • Groups together audio segments that are from the same speaker • Speaker identification • Recognizes those speakers of interest whose voices are known to the system Chin-Kai Wu, CS, NTHU
Speaker Segmentation • Two-stage approach to speaker change detection • First: Detects speech/nonspeech boundaries • Second: Perform actual speaker segmentation within the speech segments • First stage • Collapse the phoneme into three broad classes (vowels, fricatives, and obstruents) • Include five nonspeech models (music, silence/noise, laughter, breath, and lip-smack) • 5-states HMM • Detection reliability over 90% of the time Chin-Kai Wu, CS, NTHU
λ<= t Same speaker Nonspeech region λ> t λ<= t + α Speech region otherwise λ> t + α Speaker Segmentation (Cont’d) • Second stage • Hypotheses a speaker change boundary at every phone boundary located in the first stage • Speaker change decision takes the form of a likelihood ratio (λ) test Chin-Kai Wu, CS, NTHU
K: number of clusters for any particular cut of tree Nj: number of feature vectors in cluster j Compensation for the previous term Log of determinant of the within-cluster dispersion matrix Speaker Clustering • The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated • To find the cut of the tree that is optimal based on criterion Chin-Kai Wu, CS, NTHU
Speaker Clustering (Cont’d) • The algorithm performs well regardless of the true number of speakers, producing clusters of high purity • The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8% Chin-Kai Wu, CS, NTHU
Speaker Identification • Every speaker cluster created in the speaker clustering stage is identified by gender • The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models • This approach has resulted in a 2.3% error in gender detection Chin-Kai Wu, CS, NTHU
Speaker Identification (Cont’d) • In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers • The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments Chin-Kai Wu, CS, NTHU
Speaker Identification (Cont’d) • The system resulted in three types of errors • False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker • False rejection rate of 3.0%, where a known-speaker segment was classified as unknown • False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers Chin-Kai Wu, CS, NTHU