380 likes | 572 Views
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios. Tim Schlippe , Wolf Quaschningk , Tanja Schultz. Outline. Motivation and Goals Experimental Setup Grapheme-to-phoneme converters Data Experiments and Results
E N D
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Tim Schlippe, Wolf Quaschningk, Tanja Schultz
Outline • Motivation and Goals • Experimental Setup • Grapheme-to-phoneme converters • Data • Experiments and Results • Single grapheme-to-phoneme converters’ performance • Phoneme-level combinationscheme • Adding web-driven grapheme-to-phoneme converters • Automatic speech recognition experiments • Conclusion and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Motivation • About 7.100 languages exist in the world (www.ethnologue.com) • only few languages have speech processing systems • Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR) • Manual production of pronunciations slow and costly • 19.2–30s / word for Afrikaans (Davel and Barnard, 2004) • Automatic grapheme-to-phoneme (G2P) conversion • But: Consistency pronunciations first at ~3.7k word-pronunciation pairs for training (30k phoneme tokens) Methodstoreducemanualeffort Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Goals • Common approaches use their single favorite G2P conversion tool • Idea: • Use synergy effects of multiple G2P converters • Close in performance but at the same time produce an output that differs in their errors • Provides complementary information Achievepronunciationswithhigherqualitythroughcombinationof G2P converteroutputs • Reducemanualeffort in semi-automaticmethods • Impact on ASR performance Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Grapheme-to-phoneme converters (Accordingto(Bisaniand Ney, 2008)) c a r s K AX 9r S Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
different grade of G2P relationship Data different amountsofsmalltrainingdatasizestosimulatelowresources • Languages: • English, German, French, Spanish • Dictionaries: • English: CMU dictionary • German, Spanish: GlobalPhone • French: QuaeroProject • Data sets (randomlychosen): • Training: 200, 500, 1k, 5k, 10k word-pronunciationpairs • Development / testset: 10k word-pronunciationpairs(disjunctive) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Analysis of Single G2P Converter Outputs Lower PERs withincreasingamountoftrainingdata Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distancetoreferencepronunciations at phonemelevel (phonemeerror rate (PER))
Analysis of Single G2P Converter Outputs Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Mosesit is very close for de Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distancetoreferencepronunciations at phonemelevel (phonemeerror rate (PER))
Analysis of Single G2P Converter Outputs For200 en and fr W-P pairs, Rules outperforms Moses Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distancetoreferencepronunciations at phonemelevel (phonemeerror rate (PER))
Phoneme-level combinationscheme Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Based on ROVER (Fiscus, 1997)(Recognizer Output Voting Error Reduction)(traditionally at wordlevel) • VotingModule • byfrequencyofoccurence, since G2P confidencescoresnot reliable
Phoneme-level combinationscheme Converter Output PER PLC output PER Sequitur G2P k EH 9r ZH 25% Phonetisaurus K AA ZH 25% CART K AE ZH 50% K AA 9r ZH 0% Moses K AA 9r S 25% 1:1 G2P (Rules) K AX 9r S 50% Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Example(trainedwith 200 W-P pairs): • Reference: cars K AA 9r ZH
Phoneme-level combination de In 10 of 16 cases combination equal or better Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Relative PER changecomparedtobestsingleconverteroutput
Phoneme-level combination de Most improvement for de and en ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Relative PER changecomparedtobestsingleconverteroutput
Phoneme-level combination de es (most regular G2P relationship) never improvements Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Relative PER changecomparedtobestsingleconverteroutput
Wiktionary 39 Wiktionary editions with more than 1k IPA prons. (June 2012) Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List ofWiktionaries T. Schlippe, S. Ochs, T. Schultz: Web-basedtoolsandmethodsfor rapid pronunciationdictionarycreation, Speech Communication, vol. 56, pp. 101 – 118, January 2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Wiktionary 4.6k W-P pairs Internal consistency (PER %) 1.5k W-P pairs 3.8k W-P pairs 3.3k W-P pairs Additional G2P convertersbasedon word-pronunciationpairsin Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data • Filtered web-derivedpronunciations • Fullyautomaticmethodsfrom(Schlippe, 2012a, 2012b, 2014) • ~15% witheachfilteringmethod Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combination PLC-unfiltWDP already better than w/oWDP Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER changecomparedtobestsingleconverteroutput
Phoneme-level combination 23.1% rel. PER reduction Filtering web-derived pronunciations helps Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER changecomparedtobestsingleconverteroutput
ASR experiments • As in PER evaluation: Sequiturand Phonetisaurus very good in most cases • However: Rules results in lowest WERs for most scenarios Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Replacedictionaries in de & enrecognizerswithpronunciationsgeneratedwith G2P converters Train anddecodethesystems Word Error Rate (WER)
ASR experiments In only 1 case PLC-w/oWDPbetter or equal best single converter Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments Filtering web-derived word-pronunciation pairs hels. Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments Confusion Network Combination (CNC) outperforms PLC Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments In 9cases Adding system with PLC in helps in CNC Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Conclusion and Future Work • In most cases, PLC comes close validated reference pronunciations more thanthesingleconverters • Web-derived word-pronunciation pairs can further improve quality (Filteringthe web datahelpful) • Weighting single G2P converters’ outputs gave no improvement • according to performance on devset • according to converters‘ confidences • Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editingeffort Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Conclusion and Future Work • Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems • Including systems with pronunciation dictionaries that have been built with PLCto CNC can lead to improvements • Future work: • Embedding PLCand web-derivedpronunciationsintothesemi-automatic pronunciation dictionary creation • Further languages and further G2P converters Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
благодари́м за внима́ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
ASR experiments In 6 cases System with PLC better or equal best single converter Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data 1st-stage filtering (Len / Eps / M2NAlign) word-pronunciationpairs filteredword-pronunciationpairs prefiltering • (Black et al., 1998) • (Martirosian and Davel, 2007) • (Schlippe, 2012a, 2012b) • Filtered web-derivedpronunciations • Thresholdforeachfilteringmethoddependent onmean µ andstandard deviation σof measure in focus Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data 1st-stage filtering (Len / Eps / M2NAlign) „reliable“ g2p model word-pronunciationpairs filteredword-pronunciationpairs prefiltering Train „reliable“ g2p model Apply g2p model towords Edit distance < threshold remainingword-pronunciationpairs 2nd-stage filtering (G2P) • Filtered web-derivedpronunciations • Thresholdforeachfilteringmethoddependent onmean µ andstandard deviation σof measure in focus Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phoneme-level combinationscheme K EH ZH 9r Sequitur (25% PER) K ZH AA Phonetisaurus(25% PER) K ZH AE CART (50% PER) 9r S K AA Moses (25% PER) S K 9r AX 1:1 G2P (50% PER) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Example(trainedwith 200 W-P pairs): • Reference: cars K AA 9r ZH
Phoneme-level combinationscheme K EH ZH 9r K ZH @ AA @ K ZH AE 9r S K AA S K 9r AX ZH 9r K AA (0.6) (0.6) (1) (0.4) • Alignment Module • Voting Module • byfrequencyofoccurence, since G2P confidencescores not reliable Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data Lenth-Filtering G2P-Filtering • Filtered WDPs • German: G2PLen • Remove a pronunciation if the ratio of grapheme and phoneme tokens is shorter than µLen – σLenor longer thanµLen+ σLen 2.1. Train G2P models with remaining more “reliable” W-P pairs. 2.2. Apply the G2P models to convert a grapheme string into a most likely phoneme string. 2.3 Remove a pronunciation if the edit distance between the synthesized phoneme string and the pronunciation inquestion is shorter than µG2P–σG2Por longer thanµG2P+ σG2P PER reduction: 16.74 14.17 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data • Filtered WDPs • English, Spanish: M2NAlign • Perform an m-n G2P alignment (Black et al., 1998) • Remove a pronunciation if the alignment score is shorter than µG2P – σG2Por longer thanµG2P + σG2P. English PER reduction: 33.18 26.13 Spanish PER reduction: 10.25 10.90 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Data • Filtered WDPs • French: Eps(according to (Martirosian and Davel, 2007)) • Perform an 1-1 G2P alignment (Black et al., 1998) Alignment process involves the insertion of graphemic and phonemic nulls (epsilons) into the lexical entries of words. • Remove a pronunciation if the proportion of graphemic and phonemic nulls is shorter than µG2P –σG2Por longer thanµG2P + σG2P. PER reduction: 14.96 13.97 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
References Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment