130 likes | 284 Views
FYP0202 Advanced Audio Information Retrieval System. By Alex Fok, Shirley Ng. Outline. Overview Read in the raw speech MFCC processing Detect the audio scene change Audio Clustering Interleave Audio Clustering Conclusion. Overview.
E N D
FYP0202Advanced Audio InformationRetrieval System By Alex Fok, Shirley Ng
Outline • Overview • Read in the raw speech • MFCC processing • Detect the audio scene change • Audio Clustering • Interleave Audio Clustering • Conclusion
Overview • Automatic segmentation of an audio stream and automatic clustering of audio segments have quite a bit of attention nowadays. • Example, in the task of automatic transcription of broadcast news, the data contains clean speech, telephone speech, music segments, speech corrupted by music or noise.
Overview (cont’) • We would like to SEGMENT the audio stream into homogenous regions according to speaker identity. • We would like to cluster speech segments into homogeneous clusters according to speaker identity.
Step1:Read in the raw speech • Read in a mpeg file as input • Convert the file from .mpeg format to .wav format • Because the MFCC library only process on .wav file
Step2:MFCC processing • A wav is viewed as frames, each contains different features • We make use of the MFCC library to convert the wav to MFCC features for processing • We extract 24 features for each frames • The result are stored in feature vectors Frame1 Frame 2 Frame 3
Step3: Detect the audio scene change • Make use of the feature vector to detect the audio scene change • The input audio stream will be modeled as Gaussian process • Model selection criterion called BIC (Bayesian Information Criterion) is used to detect the change point
Step3: Detect the audio scene change • Denote Xi (i = 1,…,N) as the feature vector of frame i • N is the total number of frame • mi : mean of mean vector of frame i • ∑i : full covariance matrix of frame i • R(i) = N log |∑| - N1 log |∑1| - N2 log |∑2| • ∑, ∑1, ∑2 are the sample covariance matrices from all the data, from {x1,…,xi}, from {xi+1,…,Xn} respectively
Step3: Detect the audio scene change • BIC(i) = R(i) – constant • If there is only one change point, then the frame with highest BIC score is the change point • If there are more than one change point, just simple extend the algorithm
Step 4:Audio Clustering • As we want to speed up the audio detecting, so we just roughly find the change point. • As a result, there maybe some wrongly calculated change point. • In this part, we try to combine the wrongly segmented neighbor segments • Compare with neighbor segments, if they are speech of the same person, then combine it.
Step5:Interleave Audio Clustering • Group all the segments of the same speaker into one node. • Before • After Speaker 1 Speaker 1 Speaker 2 Combined Speaker1 Speaker 1 Speaker 1 Speaker 2
Conclusion • We would like to make a precise and speedy engine that recognize the identity of speaker in a wave file. • We would like to group the same speaker in the wave.
Conclusion (cont’) • Instead of making local decision based on distance between fixed size sample, we expand the decision as wide as possible • Avoid the respectively calculation by using dynamic programming. • Detection algorithm can detects acoustic changing points with reasonable detestability.