270 likes | 481 Views
Modelling Polish Intonation for Speech Synthesis. Dominika Oliver 23 May 2002. Plan. Aims & Objectives Reasons Methodology. Building TTS systems. Basic building blocks: pre- processing: analysis of raw and labelled text into identifiable words.
E N D
Modelling Polish Intonation for Speech Synthesis Dominika Oliver 23 May 2002
Plan • Aims & Objectives • Reasons • Methodology
Building TTS systems • Basic building blocks: • pre-processing: analysis of raw and labelled text into identifiable words. • Text normalisation(abbreviations, dates, money time indications, addresses, telephone num, bank accounts, etc) • tokenization, mapping tokens to words, resolvingmark-up languages • linguistic module : From words to segments: • Orthographic to phonetic conversion of words (morphological analysis, g2p, syllabification, stress assignment) • Sentence analysis (resolve pronunciation ambiguities, syntactic, lexical and semantic analysis)
Building TTS systems (cd) • phonetic module • F0 and durations (and anything else appropriate for waveform synthesis) • Prosodic modelling (generation of intonation contour by intonation model, prosodic phrase, accent and F0 prediction) • acoustic module (Waveform synthesis) • Conversion into digital speech signal • From segments, F0 and duration to a waveform. • There are many techniques to do this, concatenative synthesis (diphone, unit selection), formant synthesis and articulatory synthesis.
Terminology • Stress- lexically specified distinction between strong and weak syllables, a stressed syllable louder and longer than an unstressed one • Tone - lexically specified pitch movement, property of a syllable • Accent - post-lexical pitch movement, linked to a stressed syllable • Pitch accent - lexical pitch movement, property of a word
Intonation in TTS • Intonation prediction can be split into two tasks • Prediction of accents: (and/or tones) this is done on a per syllable basis, identifying which syllables are to be accented as well as what type of accent is required (if appropriate for the theory). • Realization of F0 contour: given the accents/tones generate an F0 contour.
Why is it important? • In the task of rendering natural sounding speech from raw text, one of the many tasks is generating natural sounding intonation. • A number of intonation theories have been utilised in various systems to try to do this task. • As the quality of speech synthesis improves, a greater demand is put on the intonation system to produce more varied intonation tunes.
Models of Intonation • Linear or Tone sequence models -generate values from left to right as a sequence of values or movements. • British school - based on auditory analysis • Pierrehumbert 1980 - predominantly acoustic analysis • Dutch school - ‘t Hart, Collier and Cohen 1990 - perceptual data • Tilt - Taylor 1998 - phonetic • Superpositional or hierarchical models- generate a contour by modelling factors separately (phone, syllable, word, phase, sentence) and then combining the partial models. • Fujisaki 1983, Grønnum 1992, Möbius et al. 1993,
Techniques of intonation modelling:using Tilt & ToBI • Tilt and ToBI typify two major classes of intonation systems. • Tilt comes from a data-driven approach attempting to form an abstraction of the natural contour directly from the waveform. • ToBI takes a more linguistic or phonological approach specifying a small set of discrete labels which identify the intonational space of accents and tones. • Also prosodic labelling systems
ToBI (Pierrehumbert, 1980) • Autosegmental-metrical approach, pitch movements are decomposed into pitch levels. • Intonation phrases are modelled as sequences of (H) high and (L) low pitch levels. • ToBI offers a well-defined intonation phonology for labelled speech. Most widely available standard labelling system. • The ToBI labelling system itself does not define a mechanism to go from the labels to an F0 contour, or the reverse. However there are both hand written rule systems (e.g. M. Anderson, J. Pierrehumbert, and M. Liberman 1984) • and statistically trained methods (e.g. A. Black and A. Hunt, 1996.) which do this task. • Machine readable . • Increase in descriptive power : transcriptions can be compared across dialects and languages, ToBI for English, GToBI for German, SCToBI for Serbo-Croatian, ToDI for Dutch, etc.
Tilt (Taylor 1998) • Tilt is a phonetic model of intonation that represents intonation as a sequence of continuously parameterised events (pitch accents or boundary tones). • These parameters are called tilt parameters, determined directly from F0 contour. • They are : duration, amplitude and tilt • Imposes no categorial classification on events.
Tilt (cd) • Duration is a sum of the rise and fall durations. • Amplitude is the sum of the magnitudes of the rise and fall amplitudes. • Tilt parameter – expresses overall shape of the event, the difference of the amplitudes divided by their sum. • The tilt parameter has a range of -1 to 1, -1 pure fall, 1 pure rise, 0 equal portions of rise and fall.
Examples of intonation control • Information provided by intonation: • Focus or given/new information • Emotions, word emphasis, syntactic disambiguation examples from Mary TTS (DFKI) • Gehen wir nach Hause !/? • Der Zug fährt nach Frankfurt, oder? • Ist die Nummer 180? Nein, die Nummer ist 100 80.
Prosodic Labelling Systems • ToBi (Tones and Break Indices) • ToBI is a intonational labelling standard for speech databases that in some way is based on Janet Pierrehumbert's thesis Pierrehumbert 1980. • Made on the basis of a speech wave and F0 trace • The labelling scheme consists of: • (1) words spoken Orthographic tier • (2) the degree of juncture between words Break-index tier • (3) intonation Tone tier • (4) comments Miscellaneous tier
Prosodic Labelling Systems • ToBI (cd) • discrete intonation accents types: H*, H+!H, L*, L*+H and L+H*. • phrase accent type: H- and L- • boundary tones: L-L%, L-H%, H-L% and H-H% • break levels: 0, 1, 3, and 4 (2 reserved for special cases)
Prosodic Labelling Systems (cd) • Tilt • A Tilt labelling for an utterance consists of an assignment of one of four basic intonational events: • pitch accents, • boundary tones, • connections, • silence (labelled a, b, c, sil).
Polish synthesis (examples) • What is available : • Festival (University of Edinburgh, CSTR) • Realspeak (Scansoft) • Spiker (IVO Software) • SynTalk (Neurosoft)
Polish intonation model • British school (Jassem 1984,Demenko, 1999) • The description of accent and intonation at the linguistic level is based on the main features of a British-English system developed essentially by O’Connor and Arnold (1973) and Jassem (1984), • an intonational phrase is defined in terms of a sequence of (optional) pre-nuclear, (constitutive) nuclear, and (optional) post-nuclear accents. • [prehead [ head [[ nucleus ] tail]]] (O'Connor & Arnold) • [anacrusis][[prenuclear intonation[nuclear intonation]]] (Jassem) e.g. • To jest naj' lepsza 'pora "dnia. • To jest naj' lepsza po" radnia. • "Co mó wiłeś?
Intro - Polish intonation structure • A Polish phrase includes only one ictic accent, which is the also referred to as nuclear accent, • The pre-ictic accent is referred to as pre-nuclear and post-ictic accents are called post-nuclear accents • The pre-nuclear and the nuclear accents are mainly determined by specific pitch relations, whilst the post-nuclear accent (if any) is essentially durational.
Intro - Polish intonation structure (cd) • 2 classes of pre-nuclear accents: H (high) and L (low) • 9 classes of nuclear accents: HL, ML, xL, HM, LM, MH, MM, and LHL have been distinguished, where H is High, M Medium, L Low and xL extra-Low relative to the particular speaker’s average and mean-Low pitch; e. g., LH means “rising from Low to High”. etc. • e.g. ``Znowu ten wariat. (HL) ,, Znowu ten wariat? (LH)
Platform • Festival is a speech synthesis application developed at the The Centre for Speech Technology Research (CSTR) at the University of Edinburgh • Multilingual text to speech • (English, Spanish, German, Welsh, Catalan, Polish) • Allows addition of new languages • Synthesis research and development environment • Tools for development - support for extracting information from speech databases, in a way suitable for building models. (Models for accent prediction, F0 generation, duration, vowel reduction, homograph disambiguation, phrase break assignment and unit selection) • Free software
Platform (cd) - direct route from research to use • Multi-lingual text to speech: for those who have little interest in the internal workings of the system, and just want speech output. • Synthesis for language system: for applications that generate text from known forms. In this type of system perhaps telephone numbers, addresses, etc. can be explicitly marked, language type, even intonational forms can be specified. This form of access requires more knowledge about the synthesis internals but still not its low level details. • Synthesis development environment: In this mode, new synthesis modules, intonation, waveform synthesizers, etc. can be developed and compared in a software environment that provides the right basic tools so that development may concentrate on the theory not the implementation.
Intonation in Festival • Task : • Prediction of accents & realisation of F0 contour • Method : • Statistical and rule based • Tilt • ToBI
Intonation in Festival (cd) • ToBI: Accents and boundary types are predicted by a CART tree (classification and regression trees), but the F0 generation method is a statistically trained method. • Three F0 values are predicted for each syllable, at the start, mid vowel and end. They are predicted using linear regression based on a number of features including ToBI accent type, phrase position, syllable position with contexts. • Although a three point prediction system cannot capture all the variability in natural intonation, by experiment it has been used to be sufficient to produce reasonable F0 contours (Black 1998).
Intonation in Festival (cd) • The Tilt Intonation Theory, takes a bottom up approach. Its intention is to build a parameterization of the F0 contour, that is abstract enough to be predictable in a text to speech system. • It has been shown that a good representation of a natural F0 contour can be made automatically from the raw signal (though it is better of the accents and boundaries are hand labelled). Dusterhoff 1997 further shows how that parameterization can be predicted from text.
Future work : pilot study • Immediate Plans • ToBI description of Polish Intonation Phrase (Polish Intonation database (Karpiński 2000) • Future Work • Synthesis assessment : visually impaired • Potential Applications • free Polish-English talking dictionary (EU project)