LYU0103 Speech Recognition Techniques for Digital Video Library

LYU0103Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

Outline of Presentation • Project objectives • ViaVoice recognition experiments • Speech information processor • Audio information retrieval • Summary

Our Project Objectives • Speech recognition • Audio information retrieval

Last Term’s Work • Extract audio channel (stereo 44.1 kHz) from mpeg video files into wave files (mono 22 kHz) • Segment the wave files into sentences by detecting its frame energy • Realtime dictation with IBM ViaVoice (ViaVoice is a speech recognition engine developed by IBM) • Developed a visual training tool

Visual Training Tool Video Window; Dictation Window; Text Editor

IBM ViaVoice Experiments • Employed 7 student helpers • Produce transcripts of 77 news video clips • Four experiments: • Baseline measurement • Trained model measurement • Slow down measurement • Indoor news measurement

Baseline Measurement • To measure the ViaVoice recognition accuracy using TVB news video • Testing set: 10 video clips • The segmented wav files are dictated • Employ the hidden Markov model toolkit (HTK) to examine the accuracy

Trained Model Measurement • To measure the accuracy of ViaVoice, trained by its correctly recognized words • 10 videos clips are segmented and dictated • The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection • Repeat the procedures of “baseline measurement” after training to get the recognition performance • Repeat the procedures of using 20 videos clips

Slow Down Measurement • Investigate the effect of slowing down the audio channel • Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 • Repeat the procedures of “baseline measurement”

Indoor News Measurement • Eliminate the effect of noise • Select the indoor news reporter sentence • Dictate the test set using untrained model • Repeat the procedure using trained model

Experimental Results Overall Recognition Results (ViaVoice, TVB News )

Experimental Result Cont. Result of trained model with different number of training videos Result of using different slow down ratio

Analysis of Experimental Result • Trained model: about 1% accuracy improvement • Slowing down speeches: about 1% accuracy improvement • Indoor speeches are recognized much better • Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)

Experiment Conclusions • Four reasons for low accuracy • Language model mismatch • Voice channel mismatch • The broadcast is very fast and some characters are not so clear • The voice of video clips is too loud • The first two reasons are the most critical ones

Speech Recognition Approach • We cannot do much acoustic model training with the ViaVoice API • Training is speaker dependent • Great difference between the news audio and the training speech for ViaVoice • The tool to adapt acoustic model is not currently available • Manually editing is necessary for producing correct subtitles

Speech Information Processor (SIP) Media player, Text editor, Audio information panel

Main Features • Media playback • Real-time dictation • Word time information • Dynamic recognition text editing • Audio scene change detection • Audio segments classification • Gender classification

System Chart

Timing Information Retrieval • Use ViaVoice Speech Manager API (SMAPI) • Asynchronous callback • The recognized text is organized in a basic unit called “firm word” • SIP builds an index to store the position and time of firm words • Highlight corresponding firm words during video playback

Highlight words during playback

Dynamic Index Alignment • While editing recognized result, firm word structure might be changed • Word index need to be updated accordingly • SIP captures WM_CHAR event of the text editor • Then search for the modified words, and update the corresponding entries in the index • In practice, binary search provides good responding time

Time Index Alignment Example Before Editing Editing After Editing

Audio Information Panel • The entire clip is divided into segments separated by audio scene changes • SIP classifies the segments into three categories, male, female, and non-speech • Click a segment to preview it

Audio Information Retrieval

Detection of Audio Scene Change--Motivations • Segments of different properties can be handled differently • Apply unsupervised learning to different clusters • Assistant tool to video scene change detection

Bayesian Information Criterion (BIC) • Gaussian Distribution—model input stream • Maximum Likelihood—detect turns • BIC– make a decision

Principle of BIC • Bayesian information criterion (BIC) is a likelihood criterion • The main principle is to penalize the system by the model complexity

Detection of a single point change using BIC H0:x1,x2…xN~N(μ,Σ) H1:x1,x2…xi~N(μ1,Σ1), H2:xi+1,xi+2…xN~N(μ2,Σ2), The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|

Detection of a single point change using BIC • The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1)))logN • If BIC value>0, detection of scene change

Detection of multiple point changes by BIC • a. Initialize the interval [a, b] with a=1, b=2 • b. Detect if there is one changing point in interval [a, b] using BIC • c. If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary

Advantages of BIC approach • Robustness • Thresholding-free • Optimality

Comparison of different algorithms

Gender Classification: Motivation and Purpose • Allowing different speech analysis algorithms for each gender • Facilitating speech recognition by cutting the search space in half • Helping us to build gender-dependent recognition model and better training of the system

Gender Classification Male Female

Speech/Non-Speech Classification • Motivation • One method we used : pitch tracking

Speech/Non-Speech classification Speech Non-Speech

Summary • ViaVoice training experiments • Speech recognition editing • Dynamic index alignment • Audio scene change detection • Speech classification • Integrated the above functions into a speech processor

Q & A

LYU0103 Speech Recognition Techniques for Digital Video Library