200 likes | 300 Views
Creating a Voice for Festival. Presentation by Matthew Hood Supervisors: S. Bangay A. Lobb . Voice: cmu_uk_rab_diphone. Presentation Overview. About the project Festival About Text to Speech 3 layer approach Waveform Generation Languages, phones and diphones Making a voice
E N D
Creating a Voice for Festival Presentation by Matthew Hood Supervisors: S. Bangay A. Lobb Voice: cmu_uk_rab_diphone
Presentation Overview • About the project • Festival • About Text to Speech • 3 layer approach • Waveform Generation • Languages, phones and diphones • Making a voice • Recording Diphones • Labelling • Results
About the Project • Text to speech programs have been around for many years without much excitement. • Many new applications have arisen, sparking new interest. • One of the factors limiting its usefulness is the limited number of voices (fewer than 10?) • Creating a voice is a long, tedious process. But a greater problem is the lack of documentation. • This project aims to give a comprehensive overview of how to make a voice in Festival, pointing out all the pitfall ahead of time.
Festival • Festival is an open source TTS system developed at the University of Edinburgh in the late 90s. • “It offers a free, portable, language independent, run-time speech synthesis engine for various platforms under various APIs.” [Black et al] • Supported by the FestVox toolkit. • Documented in “Building Synthetic Voices” [Black et al]
General Text to Speech • Text Analysis Words and Utterances identified. • Linguistic Analysis Words analysed in context and pronunciation generated e.g. 1990. • Waveform Generation Utterances turned into sound and the words “Spoken”. Due to abstraction from previous layers, this is the only layer were the voice is used.
Waveform Generation • Festival is a concatenative synthesis system. • This means sound clips are joined together to generate speech eg Talking Clocks. Recorded Sound set “The time is”; “past”; “o’clock”; numbers etc. Generated Output “The time is” – “half” – “past” – “three”. Voice: cmu_us_kal_diphone
Waveform Generation • For a more general system it is not feasible to record everything that could be said. • Speech needs to be broken down into smaller units. • A phone is a single phonetic sound that is generated by a human when speaking. eh - get ; feather s - sit ; mass zh - vision ; casual
Languages • A language is defined by its phoneme set. • A phoneme set is a collection of every phonetic sound used in any word in the language (including silence). • US English phoneset used in Festival has 44 phones. • BUT it is not enough to record every phone in the phoneset.
Diphones • We donot always pronounce a phone the same way. • Its pronunciation depends on its neighbouring phones. This is know as the co-articulatory effect. • Festival relies on the simplifying assumption that the co-articulatory effect does not extend across more than a pair of phones. • These are known as diphones.
Diphones • By combining recorded diphones, we can now “say” any word in the language. • E.g. Jack - jh-ae-k jh - ae k - __ ae - k __- jh
Recording Diphones • Because of the co-articulatory effect, it is nearly impossible to pronounce a diphone accurately on its own. • Using made up words is preferable to using real words. us_006 “pau t aa k aa k aa pau” - “k-aa” “aa-k” us_603 “pau t aa t ey ah t aa pau” - “ey-ah”
Recording Diphones • In theory the number of diphones needed to speak a language is the number of phones squared. • But we don’t actually talk every combination. • The standard US diphone list used by festival contains 1396 diphones. • It is often worth extending this list to take into account strong accents or common foreign words.
Recording Diphones • Because pronouncing the words can be a bit tricky, especially the first few times you try, FestVox provides a prompting tool.
Recording Environment • The better the recording the better the voice. • With a decent sound card it is possible to record straight onto the PC. • Background noise must be kept to a minimum. • Takes approximately 1.5 hours to record all diphones. • Enviroment must be repeatable.
Labelling • Labelling is the hardest and one of the most important part of creating a voice. • Label file consists of series of boundary times. • Emu label is an open source program that graphically shows where in the wave file the phones are marked. • Part of the Emu Speech Tools available on Source Forge.
Hand Labelling Us-0603 “ey- ah” • Displays phones, frequency and waveform. • Sound extracted from mid point of labels. • Worth moving further into the phone when recording eh-__.
Auto Labelling - results • FestVox provides an auto labeller. • 1.6% failure rate. • 8 – 15% error rate. • 70% useable diphones. (400+ hand correction)
Auto labeller • Test, test and retest. • Created splittest.pl • Hand label any problem phones. • Remove DB markers.
Finishing voice • Once happy with labels. • Optional pitchmark extraction. • Volume levelling. • Load the voice into festival and test with actual speech. • Build final voice database. • Create symbolic link.
What I have learnt & achieved • Learnt a lot about speech and speech synthesis. • Learnt a lot about Linux and sound editing. • Created a number of variations of ru_us_matt_diphone, used to test different labelling methods, how recordings affect results etc. • Final paper giving step by step guide and helpful hints. • There is much room for future work, including voice adaptation. • Am sick of the sound of my own voice. Voice: ru_us_matt_diphone