解析國語連續語流基頻信號中的字調、句調及語篇韻律

解析國語連續語流基頻信號中的字調、句調及語篇韻律解析國語連續語流基頻信號中的字調、句調及語篇韻律中央研究院語言學研究所語音實驗室蘇昭宇 morison@gate.sinica.edu.tw http://phslab.ling.sinica.edu.tw/ NeGSST 2008

Outline • What is fluent speech prosody? Tones and Intonation? • Why the HPG framework (Tseng, 2004—2008)? • How to decompose F0 contours? • The nature of the Fujisaki model • Auto-extracting of the Fujisaki parameters • Calculating layered contribution by the HPG • Modeling tones, intonation and additional components • Investigating prosodic style variation • Further evidence of the HPG framework • Domains and units of boundaries (acoustic features) in fluent speech • Pause or No Pause for boundaries ? • Boundary effects and the relative-ness of supra-segmental signals NeGSST 2008

The HPG framework—Speech Paragraph (context of cross-over & adjacency) (Tseng 2005) adjacency Discourse Cross-over PG PG BG BG BG PPh# PPh# PPh PPhs PPh PPh# DM/PF DM/PF DM/PF PPh PW PW PW PW PW PW PW PW PW PW PW SYL SYL SYL SYL SYL SYL SYL SYL SYL SYL SYL SYL Prosodic units Syllables (SYL), Prosodic Words (PW), Prosodic Phrase (PPh), Breath Groups (BG), Prosodic phrase Groups (PG) and corresponding Boundary Breaks B1, B2, B3, B4 and B5 where SYL/B1< PW/B2< PPh/B3< BG/B4< PG/B5 Output Prosody Is Super-Positional and Cumulative (Tseng et al, 2004, 2005, 2006) NeGSST 2008

Prosodic Units and Boundaries in the Framework Prosodic Group B5 Breath Group B4 B4 PG-Initial PG-Medial PG-Final B3 B3 PW PW .. .. .. .. .. .. .. .. .. .. .. .. .. PW B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 NeGSST 2008

Fujisaki Model (Fujisaki 1984) NeGSST 2008

Original F0 contour Auto-extraction based on Mixdorff’s method (2000, 2003) highpass filter (stop frequency at 0.5 Hz) High-frequency contour (HFC) Low-frequency contour (LFC) NeGSST 2008

B3 optimization B3 The decision of commands Low-frequency contour (LFC) from Mixdorff’s filter Optimization criteria: Min NeGSST 2008

The results of auto-extraction based on Mixdorff’s method in Mandarin- Phonetics Lab, Academia Sinica NeGSST 2008

SYL PW PW information PW model Single tone model Final output of F0 contour by HPG Tone1 Tone3 PG information PPh BG PG Single phrase model PG model (Multiple phrase) Model additional component of F0 contour by HPG Tone (SYL) & above (PW) Prosodic phrase (PPh) & above (BG&PG) NeGSST 2008

Examples without PG-effect with PG-effect One expected cell mean can’t approach LFC well, PG-initial and PG-final especially. NeGSST 2008

Model additional component of F0 contour by HPG Tone (SYL) & above (PW) Prosodic phrase (PPh) & above (BG&PG) NeGSST 2008

BG PPh PPh BG PPh PW SYL PW PW PW Residues SYL SYL SYL SYL Residues Residues Layered Contributions Linear regression • Predict syllable duration by SYL category from the bottom of HPG • Error between prediction and real value is regarded as the effect of PW instead of unpredicted variation. • The same predictions are repeated at each HPG layer from the SYL upward to PG • Final output is the sum of prediction in each prosodic layer and prediction accuracy of each layer is regarded as the layered Contributions NeGSST 2008

Speech material • Two types of Mandarin speech corpus • Read speech of (1.) plain text of 26 discourse pieces by one male • M051 and one female F051 • (2.) three rhyme formats of Chinese Classics by • one male M056 and one female F054 • Pre-analysis annotation • Automatically labeled segmental identities with HTK toolkit • Subsequent manual tagging with Sinica COSPRO Toolkit • Spot-checking for annotated segments • Table 1 Summary of speech data by corpus type NeGSST 2008

Aa prediction & Tone Model Cumulative accuracy of Aa prediction Tone Model NeGSST 2008

Boundary effect above PPh for Aa prediction Cumulative accuracy of Aa prediction NeGSST 2008

PW model NeGSST 2008

Ap & Aa Cumulative Accuracy Ap cumulative accuracy Average of Aa and Ap predictions were used as the final accuracy of total F0 contour prediction Final accuracy of total F0 contour prediction NeGSST 2008

PG Model for Ap & average duration An example of average Ap by PG-position of two adjacent PG’s by speaker and by speech data type. The horizontal axis represents the PG-position index. The vertical axis represents the average Ap values. The tempo of the same examples used in Figure1is plotted by speaker and by speech data type. The horizontal axis represents the PG-position index. The vertical axis represents the mean syllable duration values. NeGSST 2008

Significant test of PPh features Comparison of Ap by pairs of PG positions initial/medial, initial/final and medial/final and by speaker, the asterisk ＊ denotes statistically significant differences. Comparison of mean syllable duration by pairs of PG positions initial/medial, initial/final and medial/final and by speaker, the asterisk ＊ denotes statistically significant differences. NeGSST 2008

Classification of Stylistic Variations regular semi-regular irregular NeGSST 2008

Relationship between prosodic styles & HPG contribution distribution • Higher level contribution by prosody style as shown from R, SMR, IR to WIR • Various prosodic styles can be explainedby HPG framework systematically. • The more regular the prosodic style, the larger the prosodic domain, and more contribution from higher level information. NeGSST 2008

Layered Contributions of Duration, Intensity and Boundary Pause (Tseng, 2004, 2005) Duration, Pause and Intensity patterns of PW and PPh layer in speaker F051P NeGSST 2008

Layered Contributions of Duration, Intensity and Boundary Pause (Tseng, 2004, 2005) Duration, Pause and Intensity patterns of PW and PPh layer in speaker F051P Duration Pattern Intensity Pattern Pause Pattern BG layer NeGSST 2008

Comparison of Cumulative Predictions and Speech Data Comparision between speech data and predictions for M051 NeGSST 2008

Previous and Revised Models -- Duration Patterns at PPh Layer • The revised PPh patterns show how the general pattern derived from the revised model is more contrastive than earlier patterns shown in the previous PPh patterns NeGSST 2008

Previous and Revised Models -- Intensity Patterns at PPh Layer • The PPh patterns from the revised model decayed more drastically towards boundary, thus match the tendency of the intensity attenuation for PPh final weakening, especially for M051P NeGSST 2008

Discourse boundary discrimination • Boundary properties and respective discourse identities • Only from pause duration? • Pre-boundary syllable lengthening ? (in other intonation study) • Relative acoustic feature? • Boundary discrimination • Detect topic & discourse organization • Discourse prosody context • Cross-over & adjacency NeGSST 2008

Experiment 1 Q: Examine if only syllable domain is helpful to discriminate discourse boundary identities Goal: Whether singular/relative acoustic factor in syllable layer is sufficient to differentiate B3, B4 & B5 Acoustic features: Singular acoustic factor (1.) boundary pause (BP), (2.) pre-boundary syllable duration (PrDu) and (3.) pre-boundary syllable intensity (PrIn) Relative acoustic factor (4.) between-boundary syllable duration contrast (DuCon) and (5.) between-boundary syllable intensity contrast (InCon) NeGSST 2008

Results of Experiment 1 (1/2) Cross boundary discrimination by single acoustic features. Each panel-acoustic feature. The horizontal axis -the prosodic boundary indexes. The vertical axis -the coefficient of normalized values of boundary pause (BP), per-boundary duration (PrDu) and per-boundary intensity (PrIn), respectively. NeGSST 2008

Results of Experiment 1 (2/2) Cross boundary discrimination by single contrastive factors. Each panel-contrastive feature. The horizontal axis – prosodic boundary indexes. The vertical axis-the coefficient of normalized values between boundary duration contrasts (DuCon) and between boundary intensity contrasts (InCon). NeGSST 2008

Experiment 2 Q: Examine the scale of boundary context to discriminate discourse boundary identities Goal : How to account for boundary context in the acoustic signals by discourse specifications • Acoustic features Average duration of prosodic units by different scale -Syllable -Prosodic word -Prosodic phrase NeGSST 2008

Results of Experiment 2 (1/2)-- Lengthening Patterns by Discourse Units Cross boundary comparison of duration patterns by prosodic units the syllable (SYL), the PW and the PPh. The horizontal axis represents indexes of the speech data and speaker. The vertical axis denotes normalized average duration of prosodic units. NeGSST 2008

Results of Experiment 2 (2/2)—Lengthening Patterns by Discourse Units Cross-boundary duration patter by boundary breaks. The panel denotes result of specific prosodic unit. Each curve denotes one of speech data. The horizontal-axis represents prosodic boundary index. The vertical-axis denotes the normalized average duration for specific prosodic unit. NeGSST 2008

Conclusions of Fluent Narrative Speech • Prosody and prosody context More information beyond tones and intonation • adjacency and cross-over associations • Tone and intonation variation can be explainedby HPG framework systematically • Tone variation-Lower level by HPG • Various prosodic styles-Higher level by HPG • Prosody context boundaries across fluent speech • discourse specified • Possible application to ASR • topic & discourse organization detection NeGSST 2008

解析國語連續語流基頻信號中的字調、句調及語篇韻律

解析國語連續語流基頻信號中的字調、句調及語篇韻律

Presentation Transcript