190 likes | 389 Views
IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop. Prosodic and Phonetic Features for Speaking Styles Classification and Detection. November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN. Arlindo Veiga Dirce Celorico Jorge Proença
E N D
IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop Prosodic and Phonetic Features for Speaking Styles Classification and Detection November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão
Summary • Objective • Characterization of the corpus • Features • Methods • Automatic segmentation • Classification • Results • Automatic detection • Segmentation • Speech versus Non-speech • Read versus Spontaneous • Classification • Speech versus Non-speech • Read versus Spontaneous • Conclusions and future works IberSPEECH 2012
Objective • Automatic detection of speaking styles for segmentation purposes of multimedia data • Style of a speech segment? • Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech) • Using combination of phonetic and prosodic features • Explore also speech/non-speech segmentation clear slow informal causal fast planned prepared … spontaneous unprepared IberSPEECH 2012
Characterization of the corpus Broadcast News audio corpus TV Broadcast News MP4 podcasts Extract audio stream and downsample from 44.1kHz to 16 kHz Daily download 30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels: Level 1– dominant signal: speech, noise, music, silence, clapping, … For speech: Level 2– acoustical environment: clean, music, road, crowd,… Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity) Level 4– speaker info: BN anchor, gender, public figures,… IberSPEECH 2012
Characterization of the corpus From Level 1 – speech versus non-speech From Level 3 – read speech (prepared) versus spontaneous speech For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed IberSPEECH 2012
Features • Phonetic (size of parameter vector for each segment: 214) • Based on the results of a free phone loop speech recognition • Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation) • Silence and speech rate • Prosodic (size of parameter vector for each segment: 108) • Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope • First and second order statistics • Polynomial fit of first and second order • Reset rate (rate of voiced portions) • Voiced and unvoiced duration rates IberSPEECH 2012
Methods Automatic detection Implies automatic segmentation and automatic classification Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC Binary classification: SVM classifiers IberSPEECH 2012
…. …. si-1 si si+1 si+2 Methods • Automatic segmentation • DISTBIC - uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks DBIC<0 DBIC>0 • Parameters: • Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms) • A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms • Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process IberSPEECH 2012
Methods • Classification • SVM classifiers (WEKA tool – SMO, linear kernel, C=14): • speech / non-speech • read / spontaneous • 2 step classification approach read Speech / non-speech classification Read / spontaneous classification speech spontaneous non-speech IberSPEECH 2012
Results • Performance measure • Segmentation only: • Collar (detection tolerance) range 0.5 s to 2.0 s • A detected mark is assigned as correct if there is one reference mark less than “collar” • Automatic detection • Classification only: “AT” – agreement time = % frame correctly classified IberSPEECH 2012
Results • Segmentation performance • : F1-score: collar range 0.5 s to 2.0 s 0.8 0.7 0.6 0.5 0.4 0.3 0.5 1.0 1.5 2.0 IberSPEECH 2012
Results • Segmentation performance • : Recall: collar range 0.5 s to 2.0 s 1.0 0.9 0.8 0.7 0.6 0.5 0.5 1.0 1.5 2.0 IberSPEECH 2012
Results • Automatic detection • Speech / non-speech detection • Read / spontaneous detection “AT” – agreement time = % frame correctly classified IberSPEECH 2012
Results • Classification only (using given manual segmentation) • Speech / non-speech classifier • Read / spontaneous classifier “Acc.” – Accuracy IberSPEECH 2012
Conclusions and future work • Read speech can be differentiated from spontaneous speech with reasonable accuracy. • Good results were obtained with only a few and simple measures of the speech signal. • A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information). • We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint. • We intend to automatically segment all audio genres and speaking styles. IberSPEECH 2012
THANK YOU IberSPEECH 2012
Appendix – BIC • BIC (Bayesian Information Criterion) • Dissimilarity measure between 2 consecutive segments • Two hypothesizes: • H0 – No change of signal characteristics. Model: 1 Gaussian: • H1 – Change of characteristics. 2 Gaussians: • μ – mean vector; S – covariance matrix • Maximum likelihood rat • Maximum likelihood ratio between H0 and H1: X X1 X2 IberSPEECH 2012
Appendix – BIC • P –complexity penalization • λ – penalization factor (ideal 1.0) • Change if: • Parameters used in this work: • p=16; λ=1.3; frame rate = 100; N=200; M=10; IberSPEECH 2012