Prosodic and Phonetic Features for Speaking Styles Classification and Detection

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop Prosodic and Phonetic Features for Speaking Styles Classification and Detection November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão

Summary • Objective • Characterization of the corpus • Features • Methods • Automatic segmentation • Classification • Results • Automatic detection • Segmentation • Speech versus Non-speech • Read versus Spontaneous • Classification • Speech versus Non-speech • Read versus Spontaneous • Conclusions and future works IberSPEECH 2012

Objective • Automatic detection of speaking styles for segmentation purposes of multimedia data • Style of a speech segment? • Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech) • Using combination of phonetic and prosodic features • Explore also speech/non-speech segmentation clear slow informal causal fast planned prepared … spontaneous unprepared IberSPEECH 2012

Characterization of the corpus Broadcast News audio corpus TV Broadcast News MP4 podcasts Extract audio stream and downsample from 44.1kHz to 16 kHz Daily download 30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels: Level 1– dominant signal: speech, noise, music, silence, clapping, … For speech: Level 2– acoustical environment: clean, music, road, crowd,… Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity) Level 4– speaker info: BN anchor, gender, public figures,… IberSPEECH 2012

Characterization of the corpus From Level 1 – speech versus non-speech From Level 3 – read speech (prepared) versus spontaneous speech For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed IberSPEECH 2012

Features • Phonetic (size of parameter vector for each segment: 214) • Based on the results of a free phone loop speech recognition • Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation) • Silence and speech rate • Prosodic (size of parameter vector for each segment: 108) • Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope • First and second order statistics • Polynomial fit of first and second order • Reset rate (rate of voiced portions) • Voiced and unvoiced duration rates IberSPEECH 2012

Methods Automatic detection Implies automatic segmentation and automatic classification Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC Binary classification: SVM classifiers IberSPEECH 2012

…. …. si-1 si si+1 si+2 Methods • Automatic segmentation • DISTBIC - uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks DBIC<0 DBIC>0 • Parameters: • Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms) • A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms • Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process IberSPEECH 2012

Methods • Classification • SVM classifiers (WEKA tool – SMO, linear kernel, C=14): • speech / non-speech • read / spontaneous • 2 step classification approach read Speech / non-speech classification Read / spontaneous classification speech spontaneous non-speech IberSPEECH 2012

Results • Performance measure • Segmentation only: • Collar (detection tolerance) range 0.5 s to 2.0 s • A detected mark is assigned as correct if there is one reference mark less than “collar” • Automatic detection • Classification only: “AT” – agreement time = % frame correctly classified IberSPEECH 2012

Results • Segmentation performance • : F1-score: collar range 0.5 s to 2.0 s 0.8 0.7 0.6 0.5 0.4 0.3 0.5 1.0 1.5 2.0 IberSPEECH 2012

Results • Segmentation performance • : Recall: collar range 0.5 s to 2.0 s 1.0 0.9 0.8 0.7 0.6 0.5 0.5 1.0 1.5 2.0 IberSPEECH 2012

Results • Automatic detection • Speech / non-speech detection • Read / spontaneous detection “AT” – agreement time = % frame correctly classified IberSPEECH 2012

Results • Classification only (using given manual segmentation) • Speech / non-speech classifier • Read / spontaneous classifier “Acc.” – Accuracy IberSPEECH 2012

Conclusions and future work • Read speech can be differentiated from spontaneous speech with reasonable accuracy. • Good results were obtained with only a few and simple measures of the speech signal. • A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information). • We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint. • We intend to automatically segment all audio genres and speaking styles. IberSPEECH 2012

THANK YOU IberSPEECH 2012

Appendix – BIC • BIC (Bayesian Information Criterion) • Dissimilarity measure between 2 consecutive segments • Two hypothesizes: • H0 – No change of signal characteristics. Model: 1 Gaussian: • H1 – Change of characteristics. 2 Gaussians: • μ – mean vector; S – covariance matrix • Maximum likelihood rat • Maximum likelihood ratio between H0 and H1: X X1 X2 IberSPEECH 2012

Appendix – BIC • P –complexity penalization • λ – penalization factor (ideal 1.0) • Change if: • Parameters used in this work: • p=16; λ=1.3; frame rate = 100; N=200; M=10; IberSPEECH 2012

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

Presentation Transcript

phonetic features in asr

Automatic Identification and Classification of Words using Phonetic and Prosodic Features

Development and Features of Architectural Styles

Local features: detection and description

Local features: detection and description

Acoustic/Prosodic Features

New Features and Insights for Pedestrian Detection

Phonetic features in ASR

Phonetic details in prosodic phenomena

Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition

Discourse Prosodic Attributes, Boundary Information and Prosodic Highlight

Global and Efficient Self-Similarity for Object Classification and Detection

Automated Detection and Classification of NFRs

Malware Classification And Detection

Atomatic summarization of voicemail messages using lexical and prosodic features

Prosodic/Suprasegmental Features (Part of Paralanguage)

Measures for Classification and Detection in Steganalysis

Text Classification and Named Entities for New Event Detection

Phonetic features in ASR

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Emotion Classification and Detection

Mass Detection and Classification System for Mammography Image Preprocessing