1 / 22

Bootstrapping a Language-Independent Synthesizer

Bootstrapping a Language-Independent Synthesizer. Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002. Introducing the Problem. Given a set of recordings and transcriptions in an arbitrary language, can we quickly and easily build a speech synthesizer?

ashanti
Download Presentation

Bootstrapping a Language-Independent Synthesizer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping a Language-Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002

  2. Introducing the Problem Given a set of recordings and transcriptions in an arbitrary language, can we quickly and easily build a speech synthesizer? YES, if we know something about the language. However, for the majority of languages for which such resources don’t exist…

  3. PROS The existing synthesizer provides a store of “linguistic” knowledge we can start from. Analogue to speaker adaptation in Speech Recognition systems. Overall, quality should be better. CONS Difficulty related to degree of different between sample and target language. Best as a gradual process: accent/dialect, not language Starting from Sample

  4. PROS Difficulty directly proportional to complexity of the language. Common (machine-learning) procedure based upon machine learning from recordings and transcript. CONS Don’t have a great deal of relevant knowledge to apply to the task. If not using principled phone set, necessary to segment / label recordings cleanly Starting from Scratch

  5. The Obvious Compromise Take what we do know from building speech synthesis, and generalize it to an existing framework. -- we’re not specifically learning from “scratch” -- at the same time, we’re not making linguistic assumptions pre-coded into the source voices

  6. “Generic” Synthesis Framework/Toolkit • Set of Scripts, Utilities, and Definition files to help to help to automate the creation of reasonable speech synthesis voices from an arbitrary language without the need for linguistic or language-specific information. • Build on top of the Festival Speech Synthesis System and FestVox toolkit (for wave form synthesis; most of text processing and pronunciation handling externalized to locally-developed tools)

  7. Language-Dependent Synthesis Components • Phone set • Word pronunciation (lexicon and/or letter-to-sound rules) • Token processing rules (numbers etc) • Durations • Intonation (accents and F0 contour) • Prosodic phrasing method

  8. Phoneme Sets • If we rely on a pre-existing set of pronunciation rules, lexicon, etc., we are automatically limited to using the phone-set used in those resources (or something which they can be mapped to); most likely something language-dependent. • IPA, SAMPA: something language-universal? • We need to generate pronunciations: how do we create the relationship between our training database / phonetic representation / orthography?

  9. “Multilingual” Phoneme Sets: IPA, SAMPA We don’t want to be stuck with a set of phonemes targeted for a specific language, so we instead use a phoneme definition designed to be inclusive of all But… this still assumes we know the relationship between the phone set and orthography of the language; i.e. for any given text we can generate a pronunciation. This approach still assumes linguistic knowledge!

  10. Orthography as Pronunciation cf: R. Singh, B. Raj and R.M. Stern, “Automatic Generation of Phone Sets and Lexical Transcriptions;” .. Suppose we begin with the orthography of the written language. e.g. CAT = [c] [a] [t] DOG = [d] [o] [g] This implies • A relation between number of characters in a spelling and the length of the pronunciation • The orthography of a language is consistent / efficient

  11. Orthography as Pronunciation

  12. Implications for Data Labeling and Training

  13. Non-Roman Orthography: Questions of Transcription

  14. Difficulties in Machine Learning of Pronunciation “But there is a much more fundamental problem … in that it crucially assumes that letter-to-phoneme correspondences can in general be determined on the basis of information local to a particular portion of the letter string. While this is clearly true in some languages (e.g. Spanish), it is simply false for others…. “…It is unreasonable to expect that good results will be obtained from a system trained with no guidence of this kind, or … with data that is simply insufficient to the task.” – Sproat et. al, Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, pp.76-77

  15. Lexicon / Letter-to-Sound Rules

  16. Token Processing

  17. Duration and Stress Modeling

  18. Intonation and Phrasing

  19. Unit Selection and Waveform Synthesis

  20. Overview: Adaptation for Accent and Dialect

  21. Final Points

More Related