210 likes | 329 Views
Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009). Kishore Prahallad Email: kishore@iiit.ac.in International Institute of Information Technology (IIIT) Hyderabad, India & Language Technologies Institute, Carnegie Mellon University. Objective.
E N D
Building a Limited Domain Voice Using Festvox(Workshop Talk at IIT Kharagpur, Mar 4-5, 2009) Kishore Prahallad Email: kishore@iiit.ac.in International Institute of Information Technology (IIIT) Hyderabad, India &Language Technologies Institute, Carnegie Mellon University Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Objective • Objective: To provide introduction to the inner details of Festival Synthesis system • Best Resources: Documentation of Festival, Festvox and Speech Tools and their mailing lists • Topics: • Festival, Festvox and Speech Tools • Modules and data structures in Festival • Synthesis Flow • Building a limited domain voice Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Festival & Speech Tools • Festival • Full text to speech system • Multi-lingual • A general framework for building new voices in existing and new languages • APIs: Shell Level, C++ Library, Emacs interface • Speech Tools • A set of modules for common tasks found in speech processing • Example: Feature Extraction • Interface: Stand alone executables and a set of library calls linked into user programs Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Festvox • Voice building tool • Interface created on top of Festival and Speech Tools to build voices Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
How Festival, Festvox & Speech Tools are Related Speech Tools Festival Multi-lingual Synthesis Engine Festvox Environment To build voices Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Output of Festvox • Festvox uses SpeechTools and Festival to create a new voice • The Voice created is put back into Festival framework to synthesize text Speech Tools Festival Multi-lingual Synthesis Engine Festvox Environment To build voices Voice Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
User Interface with Festival User World Speech Tools Festival Multi-lingual Synthesis Engine Festvox Environment To build voices Voice Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Some Festival-Specific Terminology • Utterance: *Name* of a data structure used in Festival • Segment: A phone is referred to as segment Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Tokens: • White Space separated • European language: Space, CR, newline, tab, • vertical tab etc.. • Asian Languages: No white space separators – • Use dictionaries • Punctuation: • The boy----was usually late-----but arrived on time!! • We have orange/apple/banana flavors Basic Modules of Festival TTS system There are many modules in the Festival system - the basic modules used for text-to-speech are: • Token_POS • basic token identification • Token • Apply the token to word rules (handle non-standard words) • POS • A standard part of speech tagger • Phrasify • A Chunker, detect the phrase boundaries • Word • Implements letter to sound rules Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Basic Modules of Festival TTS system contd.. • Pauses • Prediction of pauses, inserting silences. • Intonation • Prediction of accents: Which syllables have accent (stress) • PostLex • Post lexicon rules that can modify segments based on their context. This is used for things like vowel reduction, contractions, etc. • Duration • Prediction of durations of segments. • Int_Targets • Realization of F0 contour: given the accents/tones generate an F0 contour. • Wave_Synth • A general function that in turn calls the appropriate method to actually generate the waveform. Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Data Structure in Festival • Utterance: A dashboard data structure (as all modules read/write on a common memory) • *Utterance* is the input and the output of every module in the Festival Utterance Utterance Module Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Utterance consist of ? • *Items* and *Relations* • Items: • It is an object to store strings representing word, segment etc. • Relation: • A graph which links the items • For example: “syllable” is a relation which links the items storing segment-names together Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
What Each Module Does to an Utterance • Each module access *items* and *relations* in an utterance and generate new features, items and relations in the same utterance • For ex: Token_POS • Input: Utterance with one item - a string representing a sentences • Output: Utterance with multiple items – each item represents a token • Synthesis process in Festival is viewed as applying a set of modules to an utterance Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Synthesis Flow Relations Modules June 25 Text Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Synthesis Flow Relations Modules June 25 Text Tokenize Token June 25 Token2Word Word June Twenty Fifth Noun POS Num Num Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Synthesis Flow Word June Twenty Fifth Noun POS Num Num Word 1 1 1 0 Syllable Segment jh uu n t w e n t ii f i f th Wave Synthesize Wave Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Installation of Festival & Festvox • Step 1: Install Speech tools • Step 2: Install Festival • Synthesize text in English to check the sound card, rate of speech etc. • Step 3: Install Festvox • Detailed Notes available from course web site Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Building Limited Domain • Unit selection is applied to a limited with restricted vocabulary • High quality speech systems • Units are words • Implementation in Festival: • The units are still phone, but are restricted to be coming from a specific word • /p/ from “Pennsylvania” is differentiated from /p/ from “Pittsburgh” • To synthesize “Pittsburgh” all the phones should come from the word “Pittsburgh” (there may be many examples of the same word). • Talking clock, Weather Prediction, Rail/Air Inquiry Systems • http://www.cs.cmu.edu/~awb/papers/ICSLP2000_ldom/index.html Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Limited Domain Setup (http://festvox.org/bsv/bsv-ldom-ch.html) • 1. Set the Environment: $FESTVOXDIR/src/ldom/setup_ldom iiit time pra #This would give a talking clock set up. #To change it to any another domain, all you have to do is to replace "etc/time.data" #with the domain specific training sentences. #For non-english languages, these sentences are transliterated in English. • 2. Generate Prompts • Synthesize the sentence which *you* are going to speak • How can you synthesize? – mostly applicable to English languages only • Why Synthesize at all? – To *prompt* you what to speak festival -b festvox/build_ldom.scm '(build_prompts "etc/txt.done.data")' • 3. Record prompts • For new languages, switch off the * playing of the prompt* by commenting na_play in bin/prompt_them bin/prompt_them etc/txt.done.data • 4. Label Automatically • Uses dynamic programming for labeling the speech • Labeling builds the correspondence between the text and the speech bin/make_labs prompt-wav/*.wav • 4.1 Manually correct the labeling errors emulabel etc/emu_lab time0001 Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
Contd… • 5. Generate Pitch markers bin/make_pm_wave wav/*.wav • 6. Correct the pitch markers bin/make_pm_fix pm/*.pm • 7. Generate Mel Cepstral coefficients bin/make_mcep wav/*.wav • 8. Generate Utterance Structure festival -b festvox/build_ldom.scm '(build_utts "etc/txt.done.data")' • 9. Cluster the units festival -b festvox/build_ldom.scm '(build_clunits "etc/txt.done.data")' • 10. Test the voice. festival festvox/iiit_time_pra_ldom '(voice_iiit_time_pra_ldom)' • To see the units selected (set! utt (SayText "abhii samaya hai....") (clunits::units_selected utt "-") Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad
References • http://festvox.org • 11-752 CMU course slides • http://festvox.org/festtut/ • 11-752 CMU Course Lecture Notes • http://festvox.org/festtut/notes/festtut_toc.html • Building Synthetic Voices • http://www.festvox.org/bsv/ • The Festival Speech Synthesis System • http://www.festvox.org/docs/manual-1.4.3/festival_toc.html • Edinburgh Speech Tools Library • http://www.festvox.org/docs/speech_tools-1.2.0/book1.htm Kishore Prahallad (kishore@iiit.ac.in), IIIT Hyderabad