130 likes | 144 Views
Explore SSML extensions for Korean including homograph words, Chinese characters in Korean, and pronunciation challenges. Find out how to leverage these extensions for accurate Korean text-to-speech conversion.
E N D
Workshop : 2005/11/02 (Wed) W3C Workshop on Internationalizing SSMLSSML Extension for Korean Sang-Jin Kimsangjin@icu.ac.kr
Contents • Characteristic of Korean • SSML Extension for Chinese Characters in Korean • SSML Extension for Homograph Words in Korean • Conclusion
Characteristic of Korean • Hangul, The Korean Character • Consists of forty letters • 21 vowels (including 13 diphthongs), and 19 consonants • Syllable • V, CV, VC, and CVC (C : consonant, V : vowel) • Eojeol, the word phrase is different from a phrase in English • Completely different from Japanese except for the grammatical structure • Completely different from Chinese although Korean has borrowed many Chinese words and some Chinese characters
Characteristic of Korean • Vowels in Hangul, The Korean Character • Monothong vowels classified according to tongue position and height
Characteristic of Korean • Consonants in Hangul, The Korean Character • Consonants classified according to place and manner of articulation
SSML Extension forChinese Characters in Korean • Chinese Characters in Korean • Present Korean and Japanese use many Chinese Characters • But, pronunciation of the characters is different • Same characters is represented differently according to the country • These simplified characters are not used in Korea
SSML Extension forChinese Characters in Korean • Chinese Characters in Korean • We can write text only with Korean characters • Not unusual to use Chinese characters as well • The pronunciation of the are exactly same
SSML Extension forChinese Characters in Korean • Chinese Characters in Korean TTS • The input text for text-to-speech(TTS) system has to be converted into a phonetic list • If Chinese characters are mixed with Korean characters, they have to be substituted to Korean • We don’t use all Chinese characters, rather there is a frequently-used-Chinese-character-list recommended by our Korean government and its size is 2000 • We need to utilize this list and their pronunciations in the Korean TTS system, since the pronunciations of them are different from Chinese and Japanese
SSML Extension forChinese Characters in Korean • SSML Extension for Chinese Characters in Korean • Same characters but different pronunciation in Chinese Characters according to the country <lexicon xml:lang=”ko” uri=”http://www.multilingual.org/lexicon.file”> <lexicon xml:lang=”ko-CN” uri=”http://www.multilingual.org/Chinese_lexicon_freq_KR.file”> <lexicon xml:lang=”ko-CN” uri=”http://www.multilingual.org/Chinese_lexicon_technical.file”> <lexicon xml:lang=”ja-KR” uri=”http://www.multilingual.org/Chinese_lexicon_JP.file”> <lexicon xml:lang=”cn-KR” uri=”http://www.multilingual.org/Chinese_lexicon_CN.file”>
SSML Extension forHomograph Words in Korean • Homograph Words in Korean • Same word, different pronunciation, different meaning • The difference is “duration”
SSML Extension forHomograph Words in Korean • SSML Extension for Homograph Words in Korean • Only the difference for these words is the duration in pronunciation • necessary to give the duration information to a TTS system for these kinds of words • SSML recommendation supports “say-as” element and “sub” element, these elements cannot handle the above problem successfully
SSML Extension forHomograph Words in Korean • SSML Extension for Homograph Words in Korean • We suggest “tone” tag for this problem • Attribute values for tone element are ‘long’, ‘short’ and ‘default’ would be enough for Korean.
Conclusion • SSML Extension for Chinese Characters in Korean • lexicon element doesn’t support “xml:lang” tag • We suggest xml:lang=“ko”, xml:lang=“ko-CN”, xml:lang=“ja-KR”, xml:lang=“cn-KR” tags • SSML Extension for Homograph Words in Korean • “say-as” and “sub” elements cannot handle homograph problem successfully • We suggest “tone” element • Attribute values, type=“long”, type=“short”, and type=“default” would be enough for Korean