410 likes | 492 Views
AdvAIR. An Advanced Audio Information Retrieval System. Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall. Outline . Introduction System Overview Applications Experiment Future Work Q&A. Introduction. Motivation.
E N D
AdvAIR An Advanced Audio Information Retrieval System Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall
Outline • Introduction • System Overview • Applications • Experiment • Future Work • Q&A
Motivation • Rapid expansion of audio information due to blooming of internet • Little attention paid on audio mining • Lack of a framework for generic audio information processing
Targets • Open platform that can provide a basis for various voice oriented applications • Enhance audio information retrieval by performance with guaranteed accuracy • Generic speech analysis tools for data mining
Approaches • Robust low-level sound information preprocess module • Speed oriented but accuracy algorithms • Generalized model concept for various usage • A visual framework for presentation
System Flow Chart Scene Cutting Audio Signal Implements Video Scene Change And Speaker Tracking Features Extraction Database Storage Segmentation and clustering Preprocessing Speaker Identification Training and Modeling Linguistic Identification Core Platform Extended tools
Features Extraction • Energy Measurement • Zero Crossing Rate • Pitch • Human resolves frequencies non-linearly across the audio spectrum • MFCC approach • Simulate vocal track shape
Features Extraction (con’t) • The idea of filter-bank, which approximates the non-linear frequency resolution • Bins hold a weighted sum representing the spectral magnitude of channels • Lower and upper frequency cut-offs Frequency … magnitude
Segmentation • Segmentation is to cut audio stream at the acoustic change point • BIC (Bayesian Information Criterion) is used • It is threshold-free and robust • Input audio stream is modeled as Gaussians Gaussian Mean
Segmentation • Notations for an audio stream: • N : Number of frames • X = {xi : i = 1,2,…,N} : a set of feature vectors • μ is the mean • Σ is the full covariance matrix
Audio Stream Changepoint Frame N Frame 1 Frame i Segmentation for single change pt. • Assume change point is at frame i • H0,H1 : two different models • H0 models the data as one Gaussian • X1… XN ~ N( μ , Σ ) • H1 models the data as two Gaussians • X1… Xi ~ N( μ1 , Σ1 ) • Xi+1…XN ~ N( μ2 , Σ2 )
Segmentation for single change pt. (con’t) • maximum likelihood ratio statistics is R(i) = N log | Σ | - N1 log | Σ1 | - N2 log | Σ2 | Audio Stream Changepoint Frame N Frame 1 Frame i
model H0 model H1 Segmentation for single change pt. (con’t) • BIC(i) = R(i) -λ* P • BIC(i) is +ve: i is the change point • BIC(i) is –ve: i is not the change point • Which model fits the data better, single Gaussian(H0) or 2 Gaussians(H1)?
Segmentation for single change pt. (con’t) • To detect a single change point, we need to calculate BIC(i) for all i = 1,2,…,N • The frame i with largest BIC value is the change point • O(N) to detect a single change point
Segmentation for multiple change pt. • Step 1: Initialize interval [a,b], set a = 1, b = 2 • Step 2: Detect change point in interval [a,b] through BIC single change point detection algorithm • Step 3: If no change point in interval [a,b], then set b = b+1 else let t be the changing point detected, set a = t+1, b = t+2 • Step 4: Go to Step (2)
Enhanced Implementation Algorithm • Original multiple change point detection algorithm: • Start to detect change point within 2 frames • Increase investigation interval by 1 every time • Enhanced Implementation algorithm: • minimum processing interval used in our engine is 100 frames • Increase investigation interval by 100 every time
Enhanced Implementation Algorithm (con’t) • Why do we choose to increase the interval by 100 frames? • It increases is too large, then scene change may be missed. • Must be smaller than 170 frames because there are around 170 frames in 1 second • It increases is too small, then speed of processing is too slow
Enhanced Implementation Algorithm (con’t) • Advantage: Speed up • Trade-off: the change point we detected is not too accurate • To compensate: • investigate on the frames around the change point again • investigation interval is incremented by 1 to locate a more accurate change point
Training and Modeling • Before doing various identification, training and modeling is needed • Probability-based Model Gaussian Mixture Model (GMM) is used • GMM is used for language identification, gender identification and speaker identification • GMM is modeled by many different Gaussian distributions • A Gaussian distribution is represented by its mean and variance
Gaussian Mixture Model (GMM) Model for Speaker i • To train a model is to calculate the mean , variance and weight (λ) for each of the Gaussian distribution ………………
Training of speaker GMMs • Collect sound clips that is long enough for each speaker (e.g. 20 minutes sound clips) • Steps for training one speaker model: • Step 1. Start with an initial model λ • Step 2. Calculate new mean, variance, weighting (new λ) by training • Step 3. Use a newλif it represents the model better than the oldλ • Step 4. Repeat Step 2 to Step 3 • Finally, we get λthat can represent the model
Applications • Video scene change and speaker tracking • Speaker Identification • Telephony message notification
Video scene change and Speaker tracking Multimedia Presentation Video Clip AdvAIR core Segmentation Timing And Speaker Information Video Playing Mechanism Speakers Index Information
Usage • Speaker tracking enhance data mining about a particular person (e.g. Political person in a conference) • Audio information indexing and sorting for audio library storage • It as an auxiliary tool for video cutting and editing applications
Screenshot Input clip Multimedia player Time information and indexing
Speaker Identification Preprocessed Speaker clip Sound source GMM Model Training Speaker Comparison Mechanism Speaker Models Database Speaker Identity Testing Stage Training Stage
Usage • Security authentication • Speaker identification of telephone base system • Criminal investigation (For example, similar to fingerprint)
Screenshot Input source Flexible length comparison Media Player for visual verification Speaker Identity
Telephony Message Notification Caller phone Desired group Model database GMM model comparison User can’t listen Record the leaving message of caller Desired group Non-desired Group AdvAIR segmentation Messaging API Short Message System E-mail system
Threshold-free BIC criterion Background Noise affect accuracy
Enhanced Implementation Speed enhance is determined by relative number of changing point by length
GMM modal closed-set speaker identification Training Stage 10 speaker 5 males, 5 females 20 minutes for each speaker Testing Stage 50 sound clips with 5 seconds duration 48 sound clips are correct, i.e. 96 %
GMM modal open-set speaker identification • Accept or Reject as result • Same setting as closed-set • i.e. 10 speaker, which each 20 minutes • Correct 45/50 = 90% • False reject 3/50 = 6 % • False accept 2/50 = 4 %
Problems and Limitation
Problems and limitations • Accuracy is affected by background noise • Some speakers have very likely features of sound • Open set speaker identification determination function is not so accurate if duration is short • Segmentation is still a time consuming process
Future Work • Speaker gender identification • Robust open-set speaker identification • Speech content recognition • Music pattern matching • Distributed system for segmentation