190 likes | 216 Views
Presentation discussing diacritics auto-completion, focusing on Polish language diacritics disappearance, quasi-Polish text, and examples such as 'slonce' to 'słońce'. The presentation explores similar diacritic issues in other languages like Czech, German, and French.
E N D
R&DCentre Vocal Services Section An extension to the SSML for diacritics auto-completion W3C Workshop, Beijing, 2nd of November 2005
Plan of the presentation • The nature of the problem • Similarities among other languages • Possible solutions • Discussion Regarding: Diacritics auto-completion
Diacritics diacritical mark or diacritic, sometimes called an accent mark, is a mark added to a letter to alter a word's pronunciation or to distinguish between similar words. Example: Polish letters with diacritics: Ą ą Ć ć Ę ę Ł ł Ń ń Ó ó Ś ś Ź ź Ż ż • Polish alphabet contains 35 letters=26 basic + 9 with diacritics • Different pronunciation from letters without diacritics • Included in ISO-8859-2, UNICODE, CP-1250, DOS 852… • Not included in US ASCII 7-bit codepage ? Regarding: Diacritics auto-completion
Why Polish diacritics sometimes disappear? • No possibility to obtain while typing • Application / hardware does not support non US-ASCII characters • Improper regional settings in OS or firmware • The codepages hell • All the codepages differs from each other • Unicode (utf-8) is still not very popular • Pruned on WWW - SMS gateways • A little bit hard to type • As a combination Alt gr+<letter> on a PC keyboard („Polish programmer” variant of US keyboard) • As the 5th or further letter on a key of mobile phone keypad (key 2 sequence=„ABC2ĄĆ” Regarding: Diacritics auto-completion
quasi-Polish text (without diacritics) • Is not orthographically correct • Is not up to netiquette • Is not Polish (in fact) • Cannot be transformed into Polish with simple substitution rules • Speech synthesised from this text may be incomprehensible …but: • Sometimes it is the only possibility to represent text • Is easier to write = can be written faster • Can be quite easily read by human as if it was written correctly (because of nature of „human reading device”) thus: is widespread in Polish e-mails, SMSes, news posts and chats Regarding: Diacritics auto-completion
Examples slonce –>słońce(Eng.the Sun) - unambiguous mapping maki –>maki(Eng.poppies Nominative, plural) ormąki(Eng.flour Genitive, singular) Question: add a diacritic or not ? zeby –>zęby(Eng.teeth Nominative, plural) orżeby(Eng.in order that ) Question: Where to add a diacritic ? Regarding: Diacritics auto-completion
Other languages • Czech, Slovak • Problem with diacritics is very similar to Polish • German • Umlaut „ ä ü ö ” and sharp „s” = β • Russian • Volapuk encoding – informal romanization used in SMSes • e.g.: „Ж” = „}” + „|” + „{” • French • Accents strongly affecting pronunciation, e.g.: „è” „é” „ê” • Other diacritics: „ë” „ï” „ô” „û” • … and many other Regarding: Diacritics auto-completion
How to classify the problem? • a new dialect? • an alternative spelling (context dependent orthography)? • an erroneous text that requires correction (jargon)? Regarding: Diacritics auto-completion
Example: Multi-channel access to Instant Messaging From: chris Date: 2nd Nov 05 Time: 10h15 Msg: <msg content> Correct text Text without diacritics Home IM user SMS gateway Message usertext IM Server Mobile user Text Processing Speech Synthesis Visually impaired user Regarding: Diacritics auto-completion
Variant 1: correction by IM server • Do everything on server side • SSML content developer takes care about correct spelling in text send to TTS • Text processing (correction software) is tight to the IM Server vendor which may lead to proprietary solutions • TTS is given correct text so has no problem to render it Message in SSML 1.0 TTS engine IM Server Built-in Text Processing Speech Synthesis Proprietary Text Processing Rules No need for data exchange format standardization Regarding: Diacritics auto-completion
Variant 2: correction by TTS engine • IM does not do anything – lets the TTS engine render the text • No additional work of SSML content developer required • TTS must recognize scope of the quasi-correct part of text (no tags in current SSML) • TTS must complete diacritics to correctly pronounce text TTS engine Message in SSML 1.0 IM Server Built-in Text Processing Speech Synthesis Proprietary Text Processing Rules Regarding: Diacritics auto-completion
Variant 3 – use external lexicons • Use special lexicon file to properly render text: • Quite simple and easy for SSML developer • Lexicon affects the whole file: correct and quasi-correct parts • No context dependent rules in PLS (req. 7.3) • No prefix/suffix morphological rules in PLS (req. 7.2) • The lack of diacritics is not a pronunciation exception but a spelling error TTS engine Message in SSML 1.0 IM Server Lexicon-based built-in Text Processing Speech Synthesis Lexicons in PLS 1.0 Text Processing Lexicons Regarding: Diacritics auto-completion
Recommendation • Use separate correction unit for jargon (external) • Enclose quasi-correct text with tags • Still easy for SSML developers • Text Correction software knows which part of text should be specifically pre-processed • For diacritics completion an external program can be used • For simpler cases, just dedicated lexicon can be used • SSML needs to be extended TTS engine Message in enhanced SSML 1.0 IM Server Lexicon-based built-in Text Processing Jargon Text Correction Speech Synthesis Lexicons in PLS 1.0 Text Processing Lexicons Regarding: Diacritics auto-completion
Example of SSML document (jsp) <speak> ... User <%= sSender%>writes: <say-as interpret-as=”jargon” format=”im”> <%= sMessageContent %> </say-as> The message has been sent: <say-as intepret-as=”date”> <%= sDate %> </say-as> at <say-as intepret-as=”time”> <%= sTime %> </say-as> </speak> Regarding: Diacritics auto-completion
Another example <speak> ... User <%= sSender%>writes: The message has been sent: <say-as intepret-as=”date”> <%= sDate %> </say-as> at <say-as intepret-as=”time”> <%= sTime %> </say-as> </speak> <jargon format=”im”> <%= sMessageContent %> </jargon> Regarding: Diacritics auto-completion
Conclusions • In modern communication services people use specific language, frequently not conforming to orthographic rules (e.g. without diacritics) • Applying standard phonetization rules to erroneous text may result in incomprehensible speech • TTS for best rendering results should have complete information about the text • One SSML document can have both correct and erroneous text; there is a need to mark it • Correcting erroneous text can be context and application dependent Regarding: Diacritics auto-completion
Questions and doubts • How many types of erroneous input should we consider? • How to handle jargon evolution? • How does input device affect the text? • New interpret-as value or a new tag? • Scope and structure of the new tag (if applicable)? • Will future TTS be a software composed of complex text processor and acoustic synthesis engine, or will we have a possibility to freely choose these modules from different vendors? Regarding: Diacritics auto-completion
Dziękujemy Thank you Regarding: Diacritics auto-completion
Prepared by: Name: Name: Przemyslaw Zdroik Krzysztof Majewski Division: Division: Vocal Services Secion Vocal Services Section TP S.A. Research and Development Centre TP S.A. Research and Development Centre Department: Department: (+ 48) 22 699 56 06 (+ 48) 22 699 55 64 Phone#: Phone#: Przemyslaw.Zdroik@telekomunikacja.pl Krzysztof.Majewski@telekomunikacja.pl E-mail: E-mail: Regarding: Diacritics auto-completion