Computer Science Department

Computer Science Department A Speech / Music Discriminator using RMS and Zero-crossings Costas Panagiotakis and George Tziritas Department of Computer Science University of Crete HeraklionGreece

Computer Science Department EUSIPCO 2002, Toulouse France 1 Presentation Organization • I. Introduction • II. Segmentation • Classification • Results • Conclusion

Computer Science Department EUSIPCO 2002, Toulouse France 2 Introduction (1/3) Input Figure 1: Original Sound Signal (44100 or 22050 sample rate) Output Figure 2: Real time Segmentation and Classification (Speech,Music,Silence)

Computer Science Department EUSIPCO 2002, Toulouse France 3 Introduction (2/3) Approaches • Features extraction (energy,frequency) • Feature based Segmentation and Classification Basic purpose • Real time segmentation and classification • Algorithmic - computation constraints • Low feature number • Low change extraction error (20 msec) • Low minimum distance between two changes (1 sec) • High accuracy (95 %)

Computer Science Department Introduction (3/3) Basic Features • Computed every 20 msec • Independent characteristics Root Mean Square (RMS) • Signal energy A = • Figure 3: RMS in music Figure 4: RMS in speech Zero Crossings (ZC) • Mean frequency • Figure 5: ZCin music Figure 6: ZC in speech EUSIPCO 2002, Toulouse France 4

Computer Science Department • Figure 7: Histogram RMS in speech, approximation by χ2 distribution • Figure 8: Histogram RMS in speech, approximation by χ2 distribution EUSIPCO 2002, Toulouse France 5 Segmentation (1/3) Basic characteristics RMS based χ2 distribution fits well the RMS histograms Γ( a + 1) m : mean , s2 :variance Two stage algorithm • Stage 1 • 1 sec accuracy (low computation cost) • Stage 2 • 20 msec accuracy (high computation cost)

Computer Science Department Frame i-1 Frame i Frame i+1 Frame i+2 LOW HIGH EUSIPCO 2002, Toulouse France 6 Segmentation (2/3) • Stage 1 • Partitioning in 1 sec frames (50 RMS values) • Change in Frame i  Frame i-1 and Frame i+1 have to differ • Computation of frame distance D (Matusita Distance) using frame similarity (p) • Frame i is candidate for Stage 2 (there is a change) • If D(i) > threshold and D(i) local maximal p( p1 , p2 ) Change in frame i RMS time 1 sec frames Distance

Computer Science Department EUSIPCO 2002, Toulouse France 7 Segmentation (3/3) • Stage 2 • 20 msec accuracy • for each candidate frame (i) from stage 1 • 1. move 2 successive frames (1 sec) located before and after frame (i) • 2. find the time instant where the 2 successive frames have the maximum Matusita distance in RMS distribution • Possible oversegmentation • Figure 11: The segmentation result and the RMS data • Figure 10: The RMS data and the distance D

Computer Science Department Classification (1/4) • Basic purpose • Segment classification in one of following classes • Music • Speech • Silence • Main Algorithm • Hypothesis • Segmentation gives homogenous segments • Input • Basic characteristics RMS, ZC • Actual features computation of segment • Classification based on actual features values EUSIPCO 2002, Toulouse France 8

Computer Science Department Classification (2/4) Actual Features specification • Normalized RMSvariance, σ2Α • σ2Α = • Usually (86 %) σ2Α(music) < σ2Α (speech) • The probability of null ZC, ZC0 • Always ZC0 (music) = 0 Usually (40%) ZC0(speech) > 0 • Maximal mean frequency, max(ZC) • Almost always in speech max(ZC) < 2.4 kHz In 2% of the cases in music max(ZC) > 2.4 kHz EUSIPCO 2002, Toulouse France 9

Computer Science Department Classification (3/4) Actual Features specification • Joint RMS/ZC measure, Cz • Speech : High correlation RMS, ZC many void intervals  low RMS and ZC • Music : Essentially independent RMS, ZC • Void intervals frequency, Fu • Void intervals detection ( 20 msec ): • (RMS < T1) && (RMS < 0.1•max(RMS(i)) && (RMS < T2) || (ZC = 0) • Group neighborly silent intervals • Fu : frequency of grouped silent intervals • Always in speech Fu > 0.6 • In at least 65% of music Fu < 0.6 iA EUSIPCO 2002, Toulouse France 10

Computer Science Department A i A Silence segment check Silence Actual features check speech music ομιλία EUSIPCO 2002, Toulouse France 11 Classification (4/4) Silence segment recognition Segment is silence  E < Threshold • Decision making algorithm

Computer Science Department EUSIPCO 2002, Toulouse France 12 • Data Data source • Segmentation performance Results • 11.328 sec speech • 3.131 sec music • 70% audio CDs • 15% WWW • 15% recordings • Actual features performance • 97% detection probability • Change accuracy ~ 0.2 sec Accuracy ZC0 Cz σ2Α σ2Α, ZC0 σ2Α Cz Cz σ2Α ZC0 σ2Α Fu σ2Α All Features Features

Computer Science Department • Complexity Conclusion • Minimum complexity O(N) • Low computation cost • Summary • Real time segmentation and classification in three classes • Energy distribution (RMS) suffices for segmentation • RMS – ZC suffices for classification • Purpose : minimum cost and high performance • Future extension • Content-based indexing and retrieval audio signals • Pre-processing stage for speech recognition EUSIPCO 2002, Toulouse France 13

Computer Science Department Segmentation - Classification Demo

Computer Science Department Sound Player Demo

Computer Science Department