1 / 49

Modelling Prosody for Speech Synthesis : example from Polish

Modelling Prosody for Speech Synthesis : example from Polish. Dominika Oliver IGK Colloquium 22 July 2004. Outline. Goal prosodic modelling for TTS Review of past studies intonational investigations Current state latest modelling results . TTS Cycle. Text Processing

nadiat
Download Presentation

Modelling Prosody for Speech Synthesis : example from Polish

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004

  2. Outline • Goal • prosodic modelling for TTS • Review of past studies • intonational investigations • Current state • latest modelling results

  3. TTS Cycle Text Processing Text Normalisation : names,abbrev.,numbers Linguistic Analysis : morphology,syntax,semantics Text Input (raw or annotated) Phonetic Analysis Grapheme-to-Phoneme Conversion : rules, dict. Prosodic Analysis Pitch, Phrasing & Duration Modelling Prosodic Analysis Pitch, Phrasing & Duration Modelling Speech Synthesis Voice Rendering

  4. TTS Cycle • Prosodic analysis/modelling • Prosodic components (focus, stress, duration etc.) • Prosodic phrasing • Intonation: accent types, pitch contour

  5. Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

  6. Procedure • Prosodic modelling shopping list: • Language specific intonation description • Accent type and placement prediction & F0 generation methods • Research and evaluation tool (Festival)

  7. Language specific intonation description • Quantitative analysis of Polish intonation (accent types) • Standard description of Polish intonation (Jassem, 1961, 1984, Demenko, 1999) • Falling: HL, HM, ML, xL • Rising: LM, MH, LH • Level: MM • Rise-fall: LHL • Broad-Narrow Focus/Peak alignment study (Andreeva and Oliver, 2003)

  8. Accent types • Falling

  9. Accent types • Rising

  10. Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

  11. Resources • Speech corpora: PoInt (Polish Intonation Database) (Karpiński, 2001) • 350MB, multi-speaker (~40) • read, (semi)-spontaneous • Transcribed • Syllable based IPA segmental • Syllable based prosodic annotation

  12. Resources • PoInt Prosodic transcription • Tone heights : xH, H, M, L, xL • Phrase boundary indication

  13. Resources • Falling

  14. Resources • Rising

  15. Resources • Festival TTS (Black & Taylor, 1998) • a general multi-lingual speech synthesis system • offers a full text to speech system • environment for development and research of speech synthesis techniques

  16. Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

  17. Modelling techniques • Default prosodic assignment from simple text analysis • Hand-built rule-based system: hard to modify and adapt to new domains • Corpus-based approaches (Sproat et al ’92) • Train prosodic variation on large labeled corpora using machine learning techniques

  18. Modelling techniques – accent type/placement prediction • Classification and regression trees (CART) (Breiman, Friedman, Olshen & Stone 1984, 1993) • In speech synthesis widely used to model • segment durations (e.g. Riley 1992) • accent prediction (Syrdal, Hirschberg,McGory, Beckman 2001) • pitch contour generation (Dusterhoff 1997, Dusterhoff, Black, Taylor 1999)

  19. Modelling techniques - F0 prediction • Linear regression (Black & Hunt, 1996) used e.g. for F0 contour prediction/generation • find the appropriate F0 target per syllable based on available features trained from data • predicted variable (p) can be modelled as a sum of a set of weighted real-valued factors p= w0 + w1f1 + w1f1 + w1f1 + … + wnfn factors (fi)- parameterised properties of the data weights (wi) - trained usually using a stepwise least squares technique

  20. Prerequisite • F0 normalisation (Ladd, 1995, Clark, 2003) • (PoInt 40 speakers, mixed sex) -where is f0 mean and is the f0 standard deviation of the utterance -the rescaling uses standard deviation and mean f0 of the database :

  21. Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

  22. Modelling • Steps • Building the utterance structure of the database speech files • Incorporating database intonation labelling • Extracting features for accent prediction and f0 generation • Building CART model • PoInt intonation labels • Building LR model • 3 points per syllable • Incorporating model parameters into voice description

  23. Modelling - accent type/placement prediction • Model based on PoInt • multiple speaker (male, female) • Accent inventory (L, H, M) • Accent prediction method: CART • Features (31) • POS window • Position of candidate syllable in word and sentence • Stress information window etc. Dominika Oliver

  24. Results – accent prediction • train set (total 963 correct 897 93.146% ) • test set (total 1070 correct 996 93.084%)

  25. Modelling - F0 prediction/generation • F0 generation :Linear regression • Features • accent type • POS window • Position of candidate syllable in word and sentence • Stress information window etc.

  26. Results – F0 shape prediction

  27. Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

  28. Potential problems • Data • not enough tokens to learn from • Annotation inconsistencies (noisy data, messy accent class assignment ) • Inappropriate technique / suboptimal feature set

  29. Potential data problems

  30. Potential data problems

  31. Potential data problems

  32. PoInt Analysis • Peak alignment

  33. Addressing data issues • F0 tracking errors • Identifying outliers / annotation inconsistencies • Re-classifying accent types

  34. When everything else fails – blame it on the data • Labelling errors • Unmarked disfluencies/wrong reading • Phonemic labelling • Missing phrasing • No indication of sentence mode in annotation • Inconsistent labelling • Misleading transcription description • No independent labellers

  35. Data fixes • Automatically identifying outliers /annotation inconsistencies • Statistic analysis of acoustic parameters • Manual data inspection • Insertion of phrase boundaries • Marking of disfluencies • Aligning speech with text • Deriving Gold Standard (hard)

  36. Accent classification studies • Hierarchical clustering (Klabbers & van Santen 2004) • Linear regression (Keller & Zellner Keller, 2003) • EM bagging & boosting (Sun, 2002) • HMMs • (Kumpf, King 2004) • (Blackburn ,Vonwiller, and King, 1993) • (Batliner et al 1999, 2001) • (Maragoudakis 2003, Zervas 2004) • (Chan, Feng, Heinen, and Niederjohn 1994)

  37. Accent type re-classification • Two stage procedure • Self-organising maps (Kohonen 1982,1995) (Kaski, 1997)(Vesanto & Alhoniemi, 2000) • create set of data representative prototype vectors • projection of prototypes onto low dimensional space • Hierarchical agglomerative clustering (HAC) • method for good candidates for map unit clusters – cut the dendrogram where there is a large distance between two clusters

  38. Acoustic data parameterisation • Accent type classification: • (Demenko, 1999) • Difference between start F0 (first vowel) and F0 extreme value (on a vowel or consonant) • Difference between F0 extreme value and end point F0 • Difference between F0 max and F0 min • Difference between utterance mean F0 and mean F0 for all utterances by the same voice • Difference between utterance min F0 and global mean min F0 for the same voice

  39. Accent type re-classification • Clusters description

  40. Accent type re-classification • Clusters characteristics

  41. Accent type re-classification

  42. New results – Accent placement prediction • train data • test data

  43. New results – Accent type prediction • train data • test data

  44. Evaluation • self-organised maps - potential method for categorisation • the results relatively successful and consistent • the data pre-processing - most critical phase • automatic training phase requires solid and consistent preparations (manual)

  45. Overview • Procedure • Resources • Modelling techniques • Modelling prosody • Problems & solutions • Suggested improvements

  46. Need for better data • Based on problems encountered • Further analysis of clusters • A large amount of data from a single speaker (primary need) • A large amount of prosodic variation • A balanced set of pitch events • Clear speech which can be easily tracked • Complex prosodic structure

  47. Suggested improvements • Model modification • More data e.g. Peak Alignment study • Separate models for different sentence types (Y/N Quest/Statements) • Re-estimation of parameters based on new intonationally rich data

  48. Next • Closer inspection of automatically assigned accent classes (clusters) • Evaluation: perception experiments

  49. The End

More Related