220 likes | 355 Views
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia jerneja.gros@alpineon.com. ALPINEon
E N D
Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006 • Considerations on using PLS for Slovenian Pronunciation Lexicon Construction • Jerneja Žganec Gros • Alpineond.o.o., Ljubljana, Slovenia • jerneja.gros@alpineon.com
ALPINEon • SI-PRONlexicon: • word list • lexicon format • phonetic transcription • morpho-syntactic descriptions • Proposed extensions to PLS, SSML • Conclusions
Language specifics • Slovenian language: • Slavic language, 2 million speakers, over 70 dialects • complex inflectional paradigm (common to Slavic languages) • including "dual" – like ancient Greek! • lexical stress position – undefined (unlike some other Slavic languages, e.g. Croatian never carries accent on the last syllable) • many homographs, usually POS info helps at disambiguation: • example: On je. (He is/eats). auxiliary_verb/indicative
Pron lex • Speech technology applications: • automatic speech recognition (ASR) • text-to-speech synthesis (TTS) • require consistent specification of pronunciation • Slovenian: lexical stress position not fixed -> pron lex crucial • Pronunciation lexicons: • general: not supposed to be covered by PLS • application-specific • word/phrase pronunciations • application-specific proper nouns: personal&location names
Slovenian pron lex • General: • S5 (Gros et al., 1996) • Onomastica (Derlić and Kačič, 1997) • SImlex/SIflex (Verdonik et al, 2002) • SI-LC-STAR (Verdonik and Rojc, 2004) • AlpSynth (Gros et al., 2002) • SI-BN (Žibert, 2005, Žgank; 2005) • Application-specific: • Gopolis, SpeechDAT, etc
Word-list • SI-PRON wordlist: (a) 93,154 lemmas from SSKJ (b) over 1,000,000 word form derived from (a) – morphol. deriv. (c) additional word list: • corpus-based search • 20,000 most freq inflected word forms not covered by SSKJ lemmas (d) collocations, multi-word expressions SSKJ: Slovar slovenskega knjižnega jezika
Phonetic transcriptions • SSKJ lemmas: • automatic derivation, based on dynamic/tonemic accent information • manual corrections for about 2.500 lemmas (words of foreign origin) • Word forms derived from SSKJ: • automatic: SSKJ lemma pronunciation look-up, inflectional paradigms • Additional corpus-based word list: • automatic lexical stress assignment • AlpSynth grapheme-to-phoneme rule set
GTP rules • 193 context-dependent grapheme-to-phoneme rules: Left Grapheme Right Phonetic Example Rule explanation context string context transcr. $ er _ [@r] Gaber @ occurs before each -r not followed by a vowel (T opori sic 91, p.49) = m f [F] Simfonija <m> in front of <f> and <v> is pronounced as a labiodental (Pravopis90, p. 145)
Transcription accuracy experiment • reference: hand-crafted pron lex, 30K lexemes, no loanwords(!) • automatic lexical stress assignment: 15% error rate • lexical stress & o/e pronunciation known in advance: • transcription success rate 99.1% (0.6% handcrafting errors) • conclusion: • for semi-automatic derivation of phonetic transcriptions with a 0.3% error rate only lexical stress positions & e/o need to be manually validated
SI-PRON format • LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004) • Pronunciation Lexicon Specification (PLS) • Version 1.0, W3C Last Call Working Draft 31 January 2006 • http://www.w3.org/TR/pronunciation-lexicon/ • PLS: • Ver 1.0 not designed for TTS internal lexicons • on the other hand, we want to have a stronger link between SSML and the lexicon • we are even thinking of introducing POS attribute into token-like elements! • leave these issues for PLS Ver 2.x or address them now?
Pronunciation variations • multiple pronunciations: • several<phoneme> elements • preferred pronunciation: • indicated by the prefer element • usually the 1st pronunciation from the SSKJ • for some words, 2 prons are equally preferred, e.g.: - male Slovenian nouns, terminating with "ilec" like /borilec/, /darovalec/ • "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts" • typically account for more fluent"iUts" or overarticulated"ilts"pronunciation
Extensions… • proposed extension for PLS/SSML: • a new optional attribute for the <phoneme> element: • pron-styleattribute • values: "fluent", "overarticulated" • pron-stylealso for other elements(linkage SSML-lex!): • <voice>, <speak>, <p>, <s> • another optional attribute for the above elements: emotionfor expressive TTS ? • could this be covered by the new role attribute? • similar to <speaking_style>, proposed yesterday
Extensions… • dialects: • user-friendly apps require dialect/sociolect pronunciation variations • another optional attribute for the following elements: <phoneme>, <voice>, <speak>, <p>, <s> - rfc3066-like identifiers may be used to indicate dialects
Extensions… • PLS…. source/creator: • only the <metadata>element • source of multiple pronunciations: • useful info when merging multiple PLS dox • some sources/creators may be more reliable than others… - additional optional attribute pron-sourcefor the <phoneme>element
Extensions… • part-of-speech tags: • Slovenian – complex inflectional paradigm • morphological, syntactic and semantic(?) descriptors welcome in future revisions of the PLS specification • SSML: POS tags could be defined as an optional attribute of the <token> element • lemma, MSD attributes used in SI-PRON • MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede Multext-East LRs, http://nl.ijs.si/ME/V3 EAGLES,TEI P4 compliant
MSDs
MDSs • TTS-internal lexicon (for high-inflected languages) • full-blown form (PLS or other) • compact lexicons: • exception lexicon • derivational scheme/paradigm for providing prefix/suffix morphological rules, indications of lexical stress position shifts (hardly an issue of PLS)
Conclusion • SI-PRON pronunciation lexicon for Slovenian • proposed extensions to PLS, SSML • pron-styleattribute • emotionattribute • annotating dialects/sociolects • source/creatorattribute • morpho-syntactic, semantic descriptors
Project Partners • L6-5405 project • Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources • Spoken representation of Slovenian words: • http://bos.zrc-sazu.si/sskj.html • Alpineon • ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language