240 likes | 325 Views
E XPERIMENTS WITH U NIT S ELECTION S PEECH D ATABASES FOR I NDIAN L ANGUAGES. S P Kishore* + , Alan W Black # , Rohit Kumar*, Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology, Hyderabad
E N D
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES S P Kishore*+, Alan W Black#, Rohit Kumar*, Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology, Hyderabad # Language Technologies Institute, Carnegie Mellon University + Institute of Software Research International, Carnegie Mellon University
ORGANIZATION OF THE TALK • Role of Language Technologies • Text to Speech Systems • Text Processing Front End • Speech Generation Component • Unit Selection Approach • Experiments • Choice of Unit Size • Generation of Databases – Content & Size of Database • Evaluation of Hindi Speech Synthesis System • Applications • Conclusion EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
ROLE OF LANGUAGE TECHNOLOGIES • Natural Interfaces for Information Access • Crucial Role for Multilingual Societies • Integration of Speech Recognition, Machine Translation and Speech Synthesis • For Interaction between 2 people speaking different languages EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Text Processing Front End Speech Generation Component INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMS • A Text to Speech System converts an arbitrary given text into a corresponding spoken waveform. • Why Text to Speech Synthesis ? Basic Blocks of a Text to Speech System Basic Units Sequence Prosody Information Text Speech EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSTEXT PROCESSING FRONT END • Nature of Indian Scripts • Basic units of Indian writing system are Aksharas • An Akshara is typically of the form V, CV, CCV • Common Phonetic Base • About 35 Consonants and 18 Vowels • Phonetic nature of languages - What is written is what is spoken • Exception: Schwa Deletion (Inherent Vowel Suppression) EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSTEXT PROCESSING FRONT END • Format of Input Text • ISCII, Unicode, Various Fonts • Can be handled by use of appropriate conversion module(s) • Mapping Non Standard Words to Standard Words • NSW: Symbols, digits, initials, abbreviations, Punctuations, non-native words etc. EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSTEXT PROCESSING FRONT END • Standard Words to Phoneme Sequence • Involves Lexicon Lookup and use of Letter to Sound rules for English • Due to phonetic nature of Indian scripts, simple letter to sound rules can be used • Problems with some languages • Inherent Vowel Suppression (schwa deletion) • e.g. ratana (rtana) is spoken as ratan • Presently we are using set of Heuristic Rules EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSSPEECH GENERATION COMPONENT • ARTICULATORY MODEL BASED SYNTHESIS: • Involves simplistic modeling of human speech production mechanism • Difficult to accurately model the motion of articulators • PARAMETER BASED SYNTHESIS: • Speech segments are parameterized in terms of formant frequencies or linear prediction coefficients • Difficult to come up with large number of rules to accurately manifest co – articulation and prosody • CONCATENATION BASED SYNTHESIS: • Inventory of recorded speech segments (units) used • Prosodic Variations: • Intonation and duration could be acquired and incorporated in the form of rules • Store multiple realizations of units with differing prosody EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSSPEECH GENERATION COMPONENT • Unit Selection (Data Driven) Approach • Multiple realizations of basic units with varying prosodic features are stored in the speech database • Storage and retrieval of large number of recorded units is feasible in real time due to availability of cheap memory and computation power EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
UNIT SELECTION APPROACH • Building up of Speech Databases • Collection of optimal text corpuses • Recording the text corpuses • Automatic labeling followed by manual correction of labels • Extraction of units features • Clustering units to facilitate selection EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
UNIT SELECTION APPROACH ISSUES INVOLVED • Choice of Unit Size • Sub words units: half phone, phone, diphone, syllable • Larger the unit size: lesser the joins and lesser the discontinuities • Also wide coverage of units in various contexts desirables • Generation of Speech Databases • Approach for Optimal Selection of Utterances • Criteria for Unit Selection • Most suitable units are selected from the database on basis of minimization of target and concatenation costs EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EXPERIMENTSCHOICEOF UNIT SIZE • Hindi Synthesizers using different choices of unit sizes built • Syllable, diphone, phone, half phones • 24 sentences from Hindi news bulletin synthesized • Perceptual Test on Native Hindi Speaking Subjects conducted • AB – Test • Results: • Syllables performed better than diphones, phones and half phones • Half phones performed better than diphones and phones • Ref.: S. P. Kishore, Alan W. Black, “Unit Size in Unit Selection Speech Synthesis”, Eurospeech 2003, Geneva EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EXPERIMENTSCHOICEOF UNIT Example Utterances • Half Phones «««« • Phones «« • Diphones ««« • Syllables ««««« EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
GENERATION OF SPEECH DATABASES • Selection of utterances with wide phonetic and prosodic coverage • High Frequency Syllables: • Syllable with relatively high occurrence in a corpus • A sentence is selected if it has at least one high frequency syllable not present in the previous selected sentences • Utterances Recorded and Labeled EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
GENERATION OF SPEECH DATABASES SYLLABLE COVERAGE AND DURATION OF SPEECH DATABASES To Study: Dependency of Quality on Coverage >> EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EXPERIMENTSGENERATION OF SPEECH DATABASES • 6 databases with varying syllable coverage built EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EXPERIMENTSGENERATION OF SPEECH DATABASES PERCEPTUAL TESTS 5 Subjects asked to listen to 5 sentences and score them on a scale of 0 (worst) to 5 (Best). Example Example Example Example Example Example EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EXPERIMENTSGENERATION OF SPEECH DATABASES EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EVALUTION OF HINDI SPEECH SYNTHESIS SYSTEM • Text Processing Front End developed • Support of Hindi text in Unicode • Handles Non – Standard words like • Date, Currency, Digits, Address Abbreviations, etc. • Schwa Deletion using Heuristic Rules • 200 Sentences Synthesized • 9 Native hindi speaking subjects evaluated perceptual quality of the synthesizer • Each Subject evaluated nearly 40 sentences out of the 200 • Scoring on a scale of 0 (worst) to 5 (Best) • Words “Not Sounding Natural” were marked EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EVALUTION OF HINDI SPEECH SYNTHESIS SYSTEM EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
EVALUTION OF HINDI SPEECH SYNTHESIS SYSTEM OBSERVATIONS • 30% of “Not Sounding Natural” words were loan words from English • Proper Nouns not being pronounced correctly • Schwa Deletion rules not successfully deleting schwa in some places • Some punctuations characters not getting handled properly LESSONS • Additional Phonetic Coverage for proper nouns and loan words required • Good text processing component needed for high quality speech synthesis EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
APPLICATIONS • Talking Tourists Aid • Limited Domain Synthesis • Allows person to communicate queries about city, travel, accomodation, etc. • News Reader • Reading news from a Hindi News Portal • Screen Reader for Visually Impaired EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
CONCLUSION • Syllables are better units for Indian Language Speech Synthesis • Syllable > Half Phone > Diphone > Phone • High coverage of units produces high quality speech. Also there would be less variance marking higher consistency of results • Effects of Loan words should be considered in design of speech corpus • Good text processing front end needed for high quality synthesis EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
QUESTIONS EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES