1 / 18

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop. Prosodic and Phonetic Features for Speaking Styles Classification and Detection. November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN. Arlindo Veiga Dirce Celorico Jorge Proença

odele
Download Presentation

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop Prosodic and Phonetic Features for Speaking Styles Classification and Detection November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão

  2. Summary • Objective • Characterization of the corpus • Features • Methods • Automatic segmentation • Classification • Results • Automatic detection • Segmentation • Speech versus Non-speech • Read versus Spontaneous • Classification • Speech versus Non-speech • Read versus Spontaneous • Conclusions and future works IberSPEECH 2012

  3. Objective • Automatic detection of speaking styles for segmentation purposes of multimedia data • Style of a speech segment? • Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech) • Using combination of phonetic and prosodic features • Explore also speech/non-speech segmentation clear slow informal causal fast planned prepared … spontaneous unprepared IberSPEECH 2012

  4. Characterization of the corpus Broadcast News audio corpus TV Broadcast News MP4 podcasts Extract audio stream and downsample from 44.1kHz to 16 kHz Daily download 30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels: Level 1– dominant signal: speech, noise, music, silence, clapping, … For speech: Level 2– acoustical environment: clean, music, road, crowd,… Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity) Level 4– speaker info: BN anchor, gender, public figures,… IberSPEECH 2012

  5. Characterization of the corpus From Level 1 – speech versus non-speech From Level 3 – read speech (prepared) versus spontaneous speech For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed IberSPEECH 2012

  6. Features • Phonetic (size of parameter vector for each segment: 214) • Based on the results of a free phone loop speech recognition • Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation) • Silence and speech rate • Prosodic (size of parameter vector for each segment: 108) • Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope • First and second order statistics • Polynomial fit of first and second order • Reset rate (rate of voiced portions) • Voiced and unvoiced duration rates IberSPEECH 2012

  7. Methods Automatic detection Implies automatic segmentation and automatic classification Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC Binary classification: SVM classifiers IberSPEECH 2012

  8. …. …. si-1 si si+1 si+2 Methods • Automatic segmentation • DISTBIC - uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks DBIC<0 DBIC>0 • Parameters: • Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms) • A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms • Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process IberSPEECH 2012

  9. Methods • Classification • SVM classifiers (WEKA tool – SMO, linear kernel, C=14): • speech / non-speech • read / spontaneous • 2 step classification approach read Speech / non-speech classification Read / spontaneous classification speech spontaneous non-speech IberSPEECH 2012

  10. Results • Performance measure • Segmentation only: • Collar (detection tolerance) range 0.5 s to 2.0 s • A detected mark is assigned as correct if there is one reference mark less than “collar” • Automatic detection • Classification only: “AT” – agreement time = % frame correctly classified IberSPEECH 2012

  11. Results • Segmentation performance • : F1-score: collar range 0.5 s to 2.0 s 0.8 0.7 0.6 0.5 0.4 0.3 0.5 1.0 1.5 2.0 IberSPEECH 2012

  12. Results • Segmentation performance • : Recall: collar range 0.5 s to 2.0 s 1.0 0.9 0.8 0.7 0.6 0.5 0.5 1.0 1.5 2.0 IberSPEECH 2012

  13. Results • Automatic detection • Speech / non-speech detection • Read / spontaneous detection “AT” – agreement time = % frame correctly classified IberSPEECH 2012

  14. Results • Classification only (using given manual segmentation) • Speech / non-speech classifier • Read / spontaneous classifier “Acc.” – Accuracy IberSPEECH 2012

  15. Conclusions and future work • Read speech can be differentiated from spontaneous speech with reasonable accuracy. • Good results were obtained with only a few and simple measures of the speech signal. • A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information). • We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint. • We intend to automatically segment all audio genres and speaking styles. IberSPEECH 2012

  16. THANK YOU IberSPEECH 2012

  17. Appendix – BIC • BIC (Bayesian Information Criterion) • Dissimilarity measure between 2 consecutive segments • Two hypothesizes: • H0 – No change of signal characteristics. Model: 1 Gaussian: • H1 – Change of characteristics. 2 Gaussians: • μ – mean vector; S – covariance matrix • Maximum likelihood rat • Maximum likelihood ratio between H0 and H1: X X1 X2 IberSPEECH 2012

  18. Appendix – BIC • P –complexity penalization • λ – penalization factor (ideal 1.0) • Change if: • Parameters used in this work: • p=16; λ=1.3; frame rate = 100; N=200; M=10; IberSPEECH 2012

More Related