490 likes | 505 Views
Modelling Prosody for Speech Synthesis : example from Polish. Dominika Oliver IGK Colloquium 22 July 2004. Outline. Goal prosodic modelling for TTS Review of past studies intonational investigations Current state latest modelling results . TTS Cycle. Text Processing
E N D
Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004
Outline • Goal • prosodic modelling for TTS • Review of past studies • intonational investigations • Current state • latest modelling results
TTS Cycle Text Processing Text Normalisation : names,abbrev.,numbers Linguistic Analysis : morphology,syntax,semantics Text Input (raw or annotated) Phonetic Analysis Grapheme-to-Phoneme Conversion : rules, dict. Prosodic Analysis Pitch, Phrasing & Duration Modelling Prosodic Analysis Pitch, Phrasing & Duration Modelling Speech Synthesis Voice Rendering
TTS Cycle • Prosodic analysis/modelling • Prosodic components (focus, stress, duration etc.) • Prosodic phrasing • Intonation: accent types, pitch contour
Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements
Procedure • Prosodic modelling shopping list: • Language specific intonation description • Accent type and placement prediction & F0 generation methods • Research and evaluation tool (Festival)
Language specific intonation description • Quantitative analysis of Polish intonation (accent types) • Standard description of Polish intonation (Jassem, 1961, 1984, Demenko, 1999) • Falling: HL, HM, ML, xL • Rising: LM, MH, LH • Level: MM • Rise-fall: LHL • Broad-Narrow Focus/Peak alignment study (Andreeva and Oliver, 2003)
Accent types • Falling
Accent types • Rising
Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements
Resources • Speech corpora: PoInt (Polish Intonation Database) (Karpiński, 2001) • 350MB, multi-speaker (~40) • read, (semi)-spontaneous • Transcribed • Syllable based IPA segmental • Syllable based prosodic annotation
Resources • PoInt Prosodic transcription • Tone heights : xH, H, M, L, xL • Phrase boundary indication
Resources • Falling
Resources • Rising
Resources • Festival TTS (Black & Taylor, 1998) • a general multi-lingual speech synthesis system • offers a full text to speech system • environment for development and research of speech synthesis techniques
Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements
Modelling techniques • Default prosodic assignment from simple text analysis • Hand-built rule-based system: hard to modify and adapt to new domains • Corpus-based approaches (Sproat et al ’92) • Train prosodic variation on large labeled corpora using machine learning techniques
Modelling techniques – accent type/placement prediction • Classification and regression trees (CART) (Breiman, Friedman, Olshen & Stone 1984, 1993) • In speech synthesis widely used to model • segment durations (e.g. Riley 1992) • accent prediction (Syrdal, Hirschberg,McGory, Beckman 2001) • pitch contour generation (Dusterhoff 1997, Dusterhoff, Black, Taylor 1999)
Modelling techniques - F0 prediction • Linear regression (Black & Hunt, 1996) used e.g. for F0 contour prediction/generation • find the appropriate F0 target per syllable based on available features trained from data • predicted variable (p) can be modelled as a sum of a set of weighted real-valued factors p= w0 + w1f1 + w1f1 + w1f1 + … + wnfn factors (fi)- parameterised properties of the data weights (wi) - trained usually using a stepwise least squares technique
Prerequisite • F0 normalisation (Ladd, 1995, Clark, 2003) • (PoInt 40 speakers, mixed sex) -where is f0 mean and is the f0 standard deviation of the utterance -the rescaling uses standard deviation and mean f0 of the database :
Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements
Modelling • Steps • Building the utterance structure of the database speech files • Incorporating database intonation labelling • Extracting features for accent prediction and f0 generation • Building CART model • PoInt intonation labels • Building LR model • 3 points per syllable • Incorporating model parameters into voice description
Modelling - accent type/placement prediction • Model based on PoInt • multiple speaker (male, female) • Accent inventory (L, H, M) • Accent prediction method: CART • Features (31) • POS window • Position of candidate syllable in word and sentence • Stress information window etc. Dominika Oliver
Results – accent prediction • train set (total 963 correct 897 93.146% ) • test set (total 1070 correct 996 93.084%)
Modelling - F0 prediction/generation • F0 generation :Linear regression • Features • accent type • POS window • Position of candidate syllable in word and sentence • Stress information window etc.
Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements
Potential problems • Data • not enough tokens to learn from • Annotation inconsistencies (noisy data, messy accent class assignment ) • Inappropriate technique / suboptimal feature set
PoInt Analysis • Peak alignment
Addressing data issues • F0 tracking errors • Identifying outliers / annotation inconsistencies • Re-classifying accent types
When everything else fails – blame it on the data • Labelling errors • Unmarked disfluencies/wrong reading • Phonemic labelling • Missing phrasing • No indication of sentence mode in annotation • Inconsistent labelling • Misleading transcription description • No independent labellers
Data fixes • Automatically identifying outliers /annotation inconsistencies • Statistic analysis of acoustic parameters • Manual data inspection • Insertion of phrase boundaries • Marking of disfluencies • Aligning speech with text • Deriving Gold Standard (hard)
Accent classification studies • Hierarchical clustering (Klabbers & van Santen 2004) • Linear regression (Keller & Zellner Keller, 2003) • EM bagging & boosting (Sun, 2002) • HMMs • (Kumpf, King 2004) • (Blackburn ,Vonwiller, and King, 1993) • (Batliner et al 1999, 2001) • (Maragoudakis 2003, Zervas 2004) • (Chan, Feng, Heinen, and Niederjohn 1994)
Accent type re-classification • Two stage procedure • Self-organising maps (Kohonen 1982,1995) (Kaski, 1997)(Vesanto & Alhoniemi, 2000) • create set of data representative prototype vectors • projection of prototypes onto low dimensional space • Hierarchical agglomerative clustering (HAC) • method for good candidates for map unit clusters – cut the dendrogram where there is a large distance between two clusters
Acoustic data parameterisation • Accent type classification: • (Demenko, 1999) • Difference between start F0 (first vowel) and F0 extreme value (on a vowel or consonant) • Difference between F0 extreme value and end point F0 • Difference between F0 max and F0 min • Difference between utterance mean F0 and mean F0 for all utterances by the same voice • Difference between utterance min F0 and global mean min F0 for the same voice
Accent type re-classification • Clusters description
Accent type re-classification • Clusters characteristics
New results – Accent placement prediction • train data • test data
New results – Accent type prediction • train data • test data
Evaluation • self-organised maps - potential method for categorisation • the results relatively successful and consistent • the data pre-processing - most critical phase • automatic training phase requires solid and consistent preparations (manual)
Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements
Need for better data • Based on problems encountered • Further analysis of clusters • A large amount of data from a single speaker (primary need) • A large amount of prosodic variation • A balanced set of pitch events • Clear speech which can be easily tracked • Complex prosodic structure
Suggested improvements • Model modification • More data e.g. Peak Alignment study • Separate models for different sentence types (Y/N Quest/Statements) • Re-estimation of parameters based on new intonationally rich data
Next • Closer inspection of automatically assigned accent classes (clusters) • Evaluation: perception experiments