210 likes | 225 Views
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia jerneja.gros@alpineon.com.
E N D
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006 • Considerations on using PLS for Slovenian Pronunciation Lexicon Construction • Jerneja Žganec Gros • Alpineond.o.o., Ljubljana, Slovenia • jerneja.gros@alpineon.com
Presentation outline • Introduction • SI-PRONlexicon: • word list • lexicon format • phonetic transcription • morpho-syntactic descriptions • Proposed extensions to PLS, SSML • Conclusions
Introduction • Speech technology applications: • automatic speech recognition (ASR) • text-to-speech synthesis (TTS) • require consistent specification of pronunciation • Slovenian: lexical stress position not fixed -> pron lex crucial • Pronunciation lexicons: • general • application-specific • word/phrase pronunciations • application-specific proper nouns: personal&location names
Slovenian pron lex • General: • S5 (Gros et al., 1996) • Onomastica (Derlić and Kačič, 1997) • SImlex/SIflex (Verdonik et al, 2002) • SI-LC-STAR (Verdonik and Rojc, 2004) • AlpSynth (Gros et al., 2002) • SI-BN (Žibert, 2005, Žgank; 2005) • Application-specific: • Gopolis, SpeechDAT, etc
Word-list • SI-PRON wordlist: (a) 93,154 lemmas from SSKJ (b) over 1,000,000 word form derived from (a) – morphol. deriv. (c) additional word list: • corpus-based search • 20,000 most freq inflected word forms not covered by SSKJ lemmas (d) collocations, multi-word expressions SSKJ: Slovar slovenskega knjižnega jezika
Phonetic transcriptions • SSKJ lemmas: • automatic derivation, based on dynamic/tonemic accent information • manual corrections for about 2.500 lemmas (words of foreign origin) • Word forms derived from SSKJ: • automatic: SSKJ lemma pronunciation look-up, inflectional paradigms • Additional corpus-based word list: • automatic lexical stress assignment • AlpSynth grapheme-to-phoneme rule set
GTP rules • 193 context-dependent grapheme-to-phoneme rules: Left Grapheme Right Phonetic Example Rule explanation context string context transcr. $ er _ [@r] Gaber @ occurs before each -r not followed by a vowel (T opori sic 91, p.49) = m f [F] Simfonija <m> in front of <f> and <v> is pronounced as a labiodental (Pravopis90, p. 145)
Transcription accuracy experiment • reference: hand-crafted pron lex, 30K lexemes • automatic lexical stress assignment: 25% error rate • lexical stress & o/e pronunciation known in advance: • transcription success rate 99.01 % (0.6% handcrafting errors) • conclusion: • for semi-automatic derivation of Slovenian phonetic transcriptions with a 0.03% error rate only lexical stress positions&e/o need to be manually validated
SI-PRON format • LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004) • Pronunciation Lexicon Specification (PLS) • W3C Voice Browser Activity • Pronunciation lexicon markup language • Version 1.0, W3C Last Call Working Draft 31 January 2006 • http://www.w3.org/TR/pronunciation-lexicon/ • Two main applications: • Speech Synthesis (SSML documents) • PLS improves SSML on text normalization, GTP • Speech Recognition (SRGS grammars) • W3C standard! recommendation
PLS in SSML • SSML document references an external pron lexicon: • TTS engine loads the PLS documents and applies them to the SSML document • applications may specify contextual PLS documents, which are to be used in different points of the interaction (like airports.pls, carriers.pls, …) <?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="SI"> <lexicon uri="http://www.alpineon.com/airports.pls"/> Letalo letalske družbe British Airlines, ki prihaja iz Manchestra, bo imelo 5 minut zamude. </speak>
Phonetic alphabet • SI-SAMPA (Zemljak et al., 2002) • Speech Assessment Methods Phonetic Alphabet • only ASCII characters, not the IPA extended char set • augmented with additional markers for tonemic accents (tonemic acute&tonemic circumflex), lexical stress accents (acute, circumflex&grave)
PLS • The <lexeme> element - container of a lexicon entry: • usually only one<grapheme> element • several<phoneme>or <alias>elements <?xml version="1.0" encoding="UTF - 8"?> <lexicon version="1.0" xml:lang="si - SI" alphabet="x - sampa - SI - reduced"> <lexeme> <grapheme>dober</grapheme> <phoneme>"d/o: - b@r</phoneme> <! -- This is an example of the x - samp a - SI - reduced string for the pronunciation of the Slovenian word: "dober", meaning "good" in English -- > </lexeme> </lexicon>
Pronunciation variations • multiple pronunciations: • several<phoneme> elements • preferred pronunciation: • indicated by the prefer element • usually the 1st pronunciation from the SSKJ • for some words, 2 pronunciations are equally preferred EXAMPLE: - male Slovenian nouns, terminating with "ilec" like /borilec/, /darovalec/ • "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts" • typically account for more fluent"iUts" or overarticulated"ilts"pronunciation
Extensions… • proposed extension: • a new optional attribute for the <phoneme> element: • pron-styleattribute • values: "fluent", "overarticulated" • pron-stylealso for other elements: • <voice>, <speak>, <p>, <s> • another optional attribute for the above elements: emotionfor expressive TTS ?
Extensions… • dialects: • user-friendly apps require dialect/sociolect pronunciation variations • another optional attribute for the following elements: <phoneme>, <voice>, <speak>, <p>, <s> - rfc3066-like identifiers may be used to indicate dialects
Extensions… • source/creator: • only the <metadata>element • source of multiple pronunciations: • useful info when merging multiple PLS dox • some sources/creators may be more reliable than others… - additional optional attribute pron-sourcefor the <phoneme>element
Extensions… • part-of-speech tags: • Slovenian language – complex inflectional paradigm • including "dual" – like ancient Greek! • morphological, syntactic and semantion descriptors welcome in future revisions of the PLS document • proprietary <lemma>, <MSD> elements used in SI-PRON • MULTEXT-East MSDs (Erjavec, 2004)
Conclusion • SI-PRON pronunciation lexicon for Slovenian • proposed extensions to PLS, SSML • pron-styleattribute • emotionattribute • annotating dialects/sociolects • source/creatorattribute • morpho-syntactic, semantic descriptors
Project Partners • L6-5405 project • Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources • Spoken representation of Slovenian words: • http://bos.zrc-sazu.si/sskj.html • Alpineon • ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language