Modelling Prosody for Speech Synthesis : example from Polish

Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004

Outline • Goal • prosodic modelling for TTS • Review of past studies • intonational investigations • Current state • latest modelling results

TTS Cycle Text Processing Text Normalisation : names,abbrev.,numbers Linguistic Analysis : morphology,syntax,semantics Text Input (raw or annotated) Phonetic Analysis Grapheme-to-Phoneme Conversion : rules, dict. Prosodic Analysis Pitch, Phrasing & Duration Modelling Prosodic Analysis Pitch, Phrasing & Duration Modelling Speech Synthesis Voice Rendering

TTS Cycle • Prosodic analysis/modelling • Prosodic components (focus, stress, duration etc.) • Prosodic phrasing • Intonation: accent types, pitch contour

Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

Procedure • Prosodic modelling shopping list: • Language specific intonation description • Accent type and placement prediction & F0 generation methods • Research and evaluation tool (Festival)

Language specific intonation description • Quantitative analysis of Polish intonation (accent types) • Standard description of Polish intonation (Jassem, 1961, 1984, Demenko, 1999) • Falling: HL, HM, ML, xL • Rising: LM, MH, LH • Level: MM • Rise-fall: LHL • Broad-Narrow Focus/Peak alignment study (Andreeva and Oliver, 2003)

Accent types • Falling

Accent types • Rising

Resources • Speech corpora: PoInt (Polish Intonation Database) (Karpiński, 2001) • 350MB, multi-speaker (~40) • read, (semi)-spontaneous • Transcribed • Syllable based IPA segmental • Syllable based prosodic annotation

Resources • PoInt Prosodic transcription • Tone heights : xH, H, M, L, xL • Phrase boundary indication

Resources • Falling

Resources • Rising

Resources • Festival TTS (Black & Taylor, 1998) • a general multi-lingual speech synthesis system • offers a full text to speech system • environment for development and research of speech synthesis techniques

Modelling techniques • Default prosodic assignment from simple text analysis • Hand-built rule-based system: hard to modify and adapt to new domains • Corpus-based approaches (Sproat et al ’92) • Train prosodic variation on large labeled corpora using machine learning techniques

Modelling techniques – accent type/placement prediction • Classification and regression trees (CART) (Breiman, Friedman, Olshen & Stone 1984, 1993) • In speech synthesis widely used to model • segment durations (e.g. Riley 1992) • accent prediction (Syrdal, Hirschberg,McGory, Beckman 2001) • pitch contour generation (Dusterhoff 1997, Dusterhoff, Black, Taylor 1999)

Modelling techniques - F0 prediction • Linear regression (Black & Hunt, 1996) used e.g. for F0 contour prediction/generation • find the appropriate F0 target per syllable based on available features trained from data • predicted variable (p) can be modelled as a sum of a set of weighted real-valued factors p= w0 + w1f1 + w1f1 + w1f1 + … + wnfn factors (fi)- parameterised properties of the data weights (wi) - trained usually using a stepwise least squares technique

Prerequisite • F0 normalisation (Ladd, 1995, Clark, 2003) • (PoInt 40 speakers, mixed sex) -where is f0 mean and is the f0 standard deviation of the utterance -the rescaling uses standard deviation and mean f0 of the database :

Modelling • Steps • Building the utterance structure of the database speech files • Incorporating database intonation labelling • Extracting features for accent prediction and f0 generation • Building CART model • PoInt intonation labels • Building LR model • 3 points per syllable • Incorporating model parameters into voice description

Modelling - accent type/placement prediction • Model based on PoInt • multiple speaker (male, female) • Accent inventory (L, H, M) • Accent prediction method: CART • Features (31) • POS window • Position of candidate syllable in word and sentence • Stress information window etc. Dominika Oliver

Results – accent prediction • train set (total 963 correct 897 93.146% ) • test set (total 1070 correct 996 93.084%)

Modelling - F0 prediction/generation • F0 generation :Linear regression • Features • accent type • POS window • Position of candidate syllable in word and sentence • Stress information window etc.

Results – F0 shape prediction

Potential problems • Data • not enough tokens to learn from • Annotation inconsistencies (noisy data, messy accent class assignment ) • Inappropriate technique / suboptimal feature set

Potential data problems

PoInt Analysis • Peak alignment

Addressing data issues • F0 tracking errors • Identifying outliers / annotation inconsistencies • Re-classifying accent types

When everything else fails – blame it on the data • Labelling errors • Unmarked disfluencies/wrong reading • Phonemic labelling • Missing phrasing • No indication of sentence mode in annotation • Inconsistent labelling • Misleading transcription description • No independent labellers

Data fixes • Automatically identifying outliers /annotation inconsistencies • Statistic analysis of acoustic parameters • Manual data inspection • Insertion of phrase boundaries • Marking of disfluencies • Aligning speech with text • Deriving Gold Standard (hard)

Accent classification studies • Hierarchical clustering (Klabbers & van Santen 2004) • Linear regression (Keller & Zellner Keller, 2003) • EM bagging & boosting (Sun, 2002) • HMMs • (Kumpf, King 2004) • (Blackburn ,Vonwiller, and King, 1993) • (Batliner et al 1999, 2001) • (Maragoudakis 2003, Zervas 2004) • (Chan, Feng, Heinen, and Niederjohn 1994)

Accent type re-classification • Two stage procedure • Self-organising maps (Kohonen 1982,1995) (Kaski, 1997)(Vesanto & Alhoniemi, 2000) • create set of data representative prototype vectors • projection of prototypes onto low dimensional space • Hierarchical agglomerative clustering (HAC) • method for good candidates for map unit clusters – cut the dendrogram where there is a large distance between two clusters

Acoustic data parameterisation • Accent type classification: • (Demenko, 1999) • Difference between start F0 (first vowel) and F0 extreme value (on a vowel or consonant) • Difference between F0 extreme value and end point F0 • Difference between F0 max and F0 min • Difference between utterance mean F0 and mean F0 for all utterances by the same voice • Difference between utterance min F0 and global mean min F0 for the same voice

Accent type re-classification • Clusters description

Accent type re-classification • Clusters characteristics

Accent type re-classification

New results – Accent placement prediction • train data • test data

New results – Accent type prediction • train data • test data

Evaluation • self-organised maps - potential method for categorisation • the results relatively successful and consistent • the data pre-processing - most critical phase • automatic training phase requires solid and consistent preparations (manual)

Need for better data • Based on problems encountered • Further analysis of clusters • A large amount of data from a single speaker (primary need) • A large amount of prosodic variation • A balanced set of pitch events • Clear speech which can be easily tracked • Complex prosodic structure

Suggested improvements • Model modification • More data e.g. Peak Alignment study • Separate models for different sentence types (Y/N Quest/Statements) • Re-estimation of parameters based on new intonationally rich data

Next • Closer inspection of automatically assigned accent classes (clusters) • Evaluation: perception experiments

The End

Modelling Prosody for Speech Synthesis : example from Polish