410 likes | 423 Views
This presentation outlines the project objectives, ViaVoice recognition experiments, speech information processor, and audio information retrieval methods for a digital video library supervisor. It discusses the previous work done, including audio extraction and segmentation, as well as the experiments conducted using IBM ViaVoice for real-time dictation. The results and conclusions of the experiments are also presented, along with the development of a speech information processor for media playback, real-time dictation, timing information retrieval, and audio scene change detection. The presentation highlights the challenges faced in speech recognition and provides insights into future approaches.
E N D
LYU0103Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo
Outline of Presentation • Project objectives • ViaVoice recognition experiments • Speech information processor • Audio information retrieval • Summary
Our Project Objectives • Speech recognition • Audio information retrieval
Last Term’s Work • Extract audio channel (stereo 44.1 kHz) from mpeg video files into wave files (mono 22 kHz) • Segment the wave files into sentences by detecting its frame energy • Realtime dictation with IBM ViaVoice (ViaVoice is a speech recognition engine developed by IBM) • Developed a visual training tool
Visual Training Tool Video Window; Dictation Window; Text Editor
IBM ViaVoice Experiments • Employed 7 student helpers • Produce transcripts of 77 news video clips • Four experiments: • Baseline measurement • Trained model measurement • Slow down measurement • Indoor news measurement
Baseline Measurement • To measure the ViaVoice recognition accuracy using TVB news video • Testing set: 10 video clips • The segmented wav files are dictated • Employ the hidden Markov model toolkit (HTK) to examine the accuracy
Trained Model Measurement • To measure the accuracy of ViaVoice, trained by its correctly recognized words • 10 videos clips are segmented and dictated • The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection • Repeat the procedures of “baseline measurement” after training to get the recognition performance • Repeat the procedures of using 20 videos clips
Slow Down Measurement • Investigate the effect of slowing down the audio channel • Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 • Repeat the procedures of “baseline measurement”
Indoor News Measurement • Eliminate the effect of noise • Select the indoor news reporter sentence • Dictate the test set using untrained model • Repeat the procedure using trained model
Experimental Results Overall Recognition Results (ViaVoice, TVB News )
Experimental Result Cont. Result of trained model with different number of training videos Result of using different slow down ratio
Analysis of Experimental Result • Trained model: about 1% accuracy improvement • Slowing down speeches: about 1% accuracy improvement • Indoor speeches are recognized much better • Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)
Experiment Conclusions • Four reasons for low accuracy • Language model mismatch • Voice channel mismatch • The broadcast is very fast and some characters are not so clear • The voice of video clips is too loud • The first two reasons are the most critical ones
Speech Recognition Approach • We cannot do much acoustic model training with the ViaVoice API • Training is speaker dependent • Great difference between the news audio and the training speech for ViaVoice • The tool to adapt acoustic model is not currently available • Manually editing is necessary for producing correct subtitles
Speech Information Processor (SIP) Media player, Text editor, Audio information panel
Main Features • Media playback • Real-time dictation • Word time information • Dynamic recognition text editing • Audio scene change detection • Audio segments classification • Gender classification
Timing Information Retrieval • Use ViaVoice Speech Manager API (SMAPI) • Asynchronous callback • The recognized text is organized in a basic unit called “firm word” • SIP builds an index to store the position and time of firm words • Highlight corresponding firm words during video playback
Dynamic Index Alignment • While editing recognized result, firm word structure might be changed • Word index need to be updated accordingly • SIP captures WM_CHAR event of the text editor • Then search for the modified words, and update the corresponding entries in the index • In practice, binary search provides good responding time
Time Index Alignment Example Before Editing Editing After Editing
Audio Information Panel • The entire clip is divided into segments separated by audio scene changes • SIP classifies the segments into three categories, male, female, and non-speech • Click a segment to preview it
Detection of Audio Scene Change--Motivations • Segments of different properties can be handled differently • Apply unsupervised learning to different clusters • Assistant tool to video scene change detection
Bayesian Information Criterion (BIC) • Gaussian Distribution—model input stream • Maximum Likelihood—detect turns • BIC– make a decision
Principle of BIC • Bayesian information criterion (BIC) is a likelihood criterion • The main principle is to penalize the system by the model complexity
Detection of a single point change using BIC H0:x1,x2…xN~N(μ,Σ) H1:x1,x2…xi~N(μ1,Σ1), H2:xi+1,xi+2…xN~N(μ2,Σ2), The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|
Detection of a single point change using BIC • The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1)))logN • If BIC value>0, detection of scene change
Detection of multiple point changes by BIC • a. Initialize the interval [a, b] with a=1, b=2 • b. Detect if there is one changing point in interval [a, b] using BIC • c. If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary
Advantages of BIC approach • Robustness • Thresholding-free • Optimality
Gender Classification: Motivation and Purpose • Allowing different speech analysis algorithms for each gender • Facilitating speech recognition by cutting the search space in half • Helping us to build gender-dependent recognition model and better training of the system
Gender Classification Male Female
Speech/Non-Speech Classification • Motivation • One method we used : pitch tracking
Speech/Non-Speech classification Speech Non-Speech
Summary • ViaVoice training experiments • Speech recognition editing • Dynamic index alignment • Audio scene change detection • Speech classification • Integrated the above functions into a speech processor