490 likes | 616 Views
Fujisaki Model 對應階層性語流韻律架構 HPG 在國語的應用與分析. 中央研究院語言學研究所 蘇昭宇 morison@phslab.ihp.sinica.edu.tw. Outline. Hierarchical Framework of Discourse Prosody HPG Introduction The HPG framework Prosodic features and templates of Mandarin fluent speech prosody Corpus approach and quantitative evidences
E N D
Fujisaki Model對應階層性語流韻律架構HPG在國語的應用與分析 中央研究院語言學研究所蘇昭宇morison@phslab.ihp.sinica.edu.tw
Outline • Hierarchical Framework of Discourse Prosody HPG • Introduction • The HPG framework • Prosodic features and templates of Mandarin fluent speech prosody • Corpus approach and quantitative evidences • Fujisaki Model(F0 model) • Auto-extraction • Phrase components • Accent components • Predicting cross-phrase F0 patterns with higher level discourse information using the Fujisaki model • Experiment & results • Conclusion Negsst2007
Reference • Tseng, Chiu-yu (2006). “Prosody Analysis”,in Advances in Chinese Spoken Language Processing, edited by Chin-Hui Lee, Haizhou Li, Lin-shan Lee, Ren-Hua Wang, Qiang Huo, World Scientific Publishing, Singapore,pp.57-76. • Tseng Chiu-yu, Pin Shao-huang, Lee Yeh-lin, Wang Hsin-min and Chen Yong-cheng (2005). “Fluent speech prosody: framework and modeling”, Speech Communication, Vol.46,issues 3-4,(July 2005), Special Issue on Quantitative Prosody Modelling for Natural Speech Description and Generation, pp.284-309. • Fujisaki H, Hirose K. “ Analysis of voice fundamental frequency contours for declarative sentences ofJapanese”. J.Acoust. Soc. Jpn.(E), 1984; 5(4): 233-242. • Mixdorff, H. (2000): A Novel Approach to the Fully Automatic Extraction of Fujisaki Model Parameters. Proceedings of ICASSP 2000, vol. 3, pages 1281-1284, Istanbul, Turkey. • Mixdorff, H., Hu, Y. and Chen, G. (2003): Towards the Automatic Extraction of Fujisaki Model Parameters for Mandarin. In Proceedings of Eurospeech 2003, Geneva. • Wentao Gu, Hirose K, Fujisaki H: Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech. ISCSLP 2006: 31-42 Negsst2007
HPG (Hierarchical Prosodic Phrase Grouping) Framework of Discourse Prosody--Fluent Speech Prosody Negsst2007
Introduction of HPG (1/2) • From bottom up, output fluent speech prosody includes lexical prosody (tone), syntactic prosody (intonation) and discourse prosody (cross-phrase semantic associations). • From top down, the HPG framework represents hierarchical constraints discourse, syntactic and lexical information. Thus, higher level prosodic units constrain and govern lower level ones; lower level units are subject to and associated by higher level units. • Phrases in speech flow should NOT be treated as independent, unrelated prosodic units. Rather, intonation units are subordinate prosodic units subject to HPG specifications. Negsst2007
Introduction of HPG (2/2) 4.Output fluent speech prosody results from cumulative layered contributions from lexical, syntactic and discourse information. Therefore, prosody does NOT stop at phrase intonation. 5. According to HPG specifications, variations of phrase intonations across speech flow are systematic and predictable. Negsst2007
HPG (Hierarchical Prosodic Phrase Grouping)--Discourse Prosody Hierarchy(unit and constraints) A schematic representation of how PGs form spoken discourse Negsst2007
Prosodic Group B5 Breath Group B4 B4 Initial PP Medial Prosodic Phrase Final PP B3 B3 PW PW .. .. .. .. .. .. .. .. .. .. .. .. .. PW B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Speech data annotation • The speech data were manually labeled by independent transcribers for perceived boundaries and breaks (pauses), using a 5-step break labeling system corresponding our framework. Negsst2007
Hand Labeling Perceived Boundary (Tseng et al, 1999) in Relation to Prosody Organization –Systematic and Predictable Negsst2007
COSPRO http://www.myet.com/corporaFlow chart of speech data processing and annotation-Read speech Task flowoutput files and file namesfootnotes • file extension: *.text • Serial numbers for text and wav files are identical. Designing Text for Narration Text for speakers to read Recording Speech in Sound Proof Chambers *.wav sampling rate: 16000Hz。 sampling format: 1 channel 16-bit linear Editing Text to Match Speech Files Hand Mapping Recorded Speech with Text *.SAMPA Converting Text to SAMPA Hand Correcting Mismatch *.phn Segmenting Speech Files using HTK *.adjust Spot-checking by Hand • File extension: *.adjust • Adjustments: • segment boundaries • multiple pronunciation characters *.break Hand-labeling Perceived Prosodic Boundaries Analyzing Labeled Speech Data PG model Negsst2007
Cross-Phrase Prosodic Features and Templates • Corpus investigations and quantitative analyses enabled us to • 1. obtain quantitative evidences of cumulative contributions of prosodic layers to output prosody, • 2. derive cross-phrase hierarchical templates corresponding to every prosodic layer in the following 4 acoustic correlates (Tseng et al, 2004; 2005; 2006) 1. F0 contour templates 2. Duration cadence templates 3. Intensity distribution patterns 4. Pause cadence templates Negsst2007
BG PPh PPh PW PW PW SYL SYL SYL SYL Residues Residues Residues Quantitative Analysis and Predictions: F0, Duration , Intensity and Breaks • Hierarchical linear model • Fujisaki parameters • Pause • Duration • Intensity Auto-Extraction for Fujisaki Model F0 contour Fujisaki parameters Negsst2007
The Fujisaki Model Negsst2007
Fujisaki Model (1984)—Intonation model • Unit—syntax defined simple sentence • F0 curve corresponding to single simple phrase as defined by syntax can be generated • Generation of gradually declining baselines of F0 curve can be decomposed into the phrase components (Ap) and accent components (Aa) Evidences obtained: Japanese, English, German, Mandarin, Thai, Vietnamese…etc. Negsst2007
The Fujisaki Model (1/2) F0=Base frequency+ Phrase components+ Accent components Negsst2007
Base Freauency Phrase components Accent components :timing of the ith phrase command :onset of the first accent command in the jth command pair :duration of the first accent command in the jth command pair :magnitude of the ith phrase command :magnitude of the ith phrase command The Fujisaki Model (2/2) Negsst2007
Phrase components = 0.01~0.05 Negsst2007
Accent components = 0.1~0.5 Negsst2007
Simulation of Mandarin Prosody with Fujisaki Model Simulated Result Accent components Aa Phrase components Ap Fb Negsst2007
Simulating/Generating F0 Curves with Fujisaki Model—Auto-extraction of Parameters (other approaches vs. our approach) Negsst2007
Mixdorff (2000, 2003)-- Interpolation and Smoothing (1/3) • Intermediate F0 values for unvoiced speech segments • Microprosodic variations are smoothed out. • Feature: very close simulation, one phrase at a time. Negsst2007
Mixdorff (2000, 2003)–High-Pass Filtering and Component Separation (2/3) • highpass filter(stop frequency at 0.5 Hz) • The output of the highpass filter(HFC) • low frequency contour (LFC): containing the sum of phrase component and Fb. Component Separation • Fb : the overall minimum of the LFC • Phrase components : the residual of LFC subtracted Fb Negsst2007
Mixdorff (2000, 2003)-- Optimizing simulated F0 curve (3/3) • Hill-Climbing Methodology • Construct a sub-optimal solution that meets the constraints of the problem • Take the solution and make an improvement upon it • Repeatedly improve the solution until no more improvements are necessary/possible Negsst2007
Gu (顧文濤2006 )Generating F0 Curves Using Speech Sample from CORSPRO_05 1.Gu did NOT consider information above phrases. 2.Gu compared generation results with HPG labeled results. Negsst2007
Gu (2006)—Simulation of F0 Curves w/out Higher Level and Boundary Information Features: • Local minimum of LFC are considered and inserted with Ap • F0 curves and boundaries are generated Negsst2007
Gu (2006) observed large variations of Aps exist1. between two speakers, 2. among boundaries We observed: 1. The magnitude of Ap inserted in larger boundaries (B4, B5) are similar. 2. Similar patterns exist in BGs or PGs. Negsst2007
Why Higher Level Discourse Information? (1/2) Gu (2006)’s traditional approach without higher level information Focus: 1. Isolated phrase intonations and boundaries are generated one at a time. 2. Simulation and fine tuning of each generation. Problems: 1. Large variations of Aps exist between speakers and among boundaries. 2. Variations can not be predicted and/or solved; concatenation of each generation can not yield patterns for technological implementation. Negsst2007
Why Higher Level Discourse Information? (2/2) Tseng et al approach with higher level discourse information (HPG) Focus: • Prediction of fluent speech prosody, i.e., cross-phrase F0 curves and boundary break Advantages: • 1.Multiple phrase intonations and boundaries can be predicted according to HPG specifications. 2. Output prosody is NOT concatenation of independent isolated phrase intonations. 3. Between-speaker and among-boundary Ap variations are systematic and predictable, therefore, are NOT considered variations by HPG framework. 4. Useful to technology development (speech synthesis). Negsst2007
2 Experiments • Hypothesis • Predictions of phrase intonation curves can be improved with higher level information because HPG specifies cross-phrase associations. • Cumulative contributions from prosodic layers can provide useful information. • Implications • technology development Negsst2007
Speech Data • Sinica COSPRO 08 • Carrier paragraph: • A 30-syllable, 3-phrase complex sentence representing a short PG was constructed • A target single syllables was embedded in three PG positions, i.e., PG-Initial, -medial and –final. • “△是一個常見的字,一般人常把△字掛在嘴邊,講話時動不動就會提到△” • Speaking rates: • 289 and 308 ms/syllable for M054C and F054C • Target syllable analyzed: • Tone 1 Negsst2007
Experiment 1 Goals: 1. Patterns of Ap could be derived from speech data. 2. Evidence of interaction between phrase command and higher-level prosodic units could be found. 3. Evidences found could predict cross-phrase F0 allocation in speech flow. Negsst2007
Distribution of speech data Range of values of Ap from phrases produced by female speaker F054c in three PG related positions are presented. Negsst2007
Distribution of speech data A schematic representation of the distribution of Ap of F054c where the horizontal axis represents values of Ap and the vertical axis represents number of Ap occurrence. Negsst2007
Results The expected cell mean of predictions with and without the PG effect. The Figure is a schematic representation of the patterns of phrases after PG effect is taken into consideration. Negsst2007
Examples without PG-effect with PG-effect One expected cell mean can’t approach LFC well, PG-initial and PG-final especially. Negsst2007
Superimposed F0according to the HPG Framework F0 Syl t F0 PPh t F0 PG Negsst2007 t
What Does Higher Level Discourse Information Mean? Swapping PG-initial and PG-final Original F0 t Exchanged F0 Negsst2007 t
Further Evidences of HPG, Systematic and Predictable—Same Base Form and Different Distribution Yield Different Output Prosody Styles Negsst2007
Speech Data • Mandarin rhymed classical writing • Style • regular • semi-regular • irregular • WeatherBroadcast • Style • irregular Negsst2007
Classification of Stylistic Variations regular semi-regular irregular Negsst2007
Predictions of Ap from Higher Level information (B3, B4, B5) PPh BG PG Negsst2007
Distributions of Layered Contributions in Each Style (Male) Rhymed classical writingm056 regular semi-regular irregular Weather broadcast m054 (irregular) The more regular the style, the bigger the planning templates, and the more governing from higher level information Negsst2007
Distributions of Layered Contributions in Each Style (Female) Rhymed classical writingf054 regular semi-regular irregular Weather broadcast f054(irregular) The more regular the style, the bigger the planning templates, and the more governing from higher level information Negsst2007
PPh Contributions in Different Styles Negsst2007
Conclusions 1. Lexical, syntactic and discourse prosody ALL contribute to output prosody. Interactions are necessary, systematic and predictable from higher level considerations. HPG accounts for prosody of fluent continuous speech. 2. How a semantic complete speech paragraph begins, holds and ends across the phrases within is specified by HPG related positions: PG-Initial, PG-Medial and PG-final 3. Further evidences from Mandarin rhymed classics substantiated HPG as a base form for both planning and processing of fluent speech prosody. 4. Stylistic variations are built on the same base form with varied contribution distribution. Negsst2007