1 / 37

Tim Schlippe , Wolf Quaschningk , Tanja Schultz

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios. Tim Schlippe , Wolf Quaschningk , Tanja Schultz. Outline. Motivation and Goals Experimental Setup Grapheme-to-phoneme converters Data Experiments and Results

montana
Download Presentation

Tim Schlippe , Wolf Quaschningk , Tanja Schultz

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Tim Schlippe, Wolf Quaschningk, Tanja Schultz

  2. Outline • Motivation and Goals • Experimental Setup • Grapheme-to-phoneme converters • Data • Experiments and Results • Single grapheme-to-phoneme converters’ performance • Phoneme-level combinationscheme • Adding web-driven grapheme-to-phoneme converters • Automatic speech recognition experiments • Conclusion and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  3. Motivation • About 7.100 languages exist in the world (www.ethnologue.com) • only few languages have speech processing systems • Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR) • Manual production of pronunciations slow and costly • 19.2–30s / word for Afrikaans (Davel and Barnard, 2004) • Automatic grapheme-to-phoneme (G2P) conversion • But: Consistency pronunciations first at ~3.7k word-pronunciation pairs for training (30k phoneme tokens)  Methodstoreducemanualeffort Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  4. Goals • Common approaches use their single favorite G2P conversion tool • Idea: • Use synergy effects of multiple G2P converters • Close in performance but at the same time produce an output that differs in their errors • Provides complementary information  Achievepronunciationswithhigherqualitythroughcombinationof G2P converteroutputs • Reducemanualeffort in semi-automaticmethods • Impact on ASR performance Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  5. Grapheme-to-phoneme converters (Accordingto(Bisaniand Ney, 2008)) c a r s K AX 9r S Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  6. different grade of G2P relationship Data different amountsofsmalltrainingdatasizestosimulatelowresources • Languages: • English, German, French, Spanish • Dictionaries: • English: CMU dictionary • German, Spanish: GlobalPhone • French: QuaeroProject • Data sets (randomlychosen): • Training: 200, 500, 1k, 5k, 10k word-pronunciationpairs • Development / testset: 10k word-pronunciationpairs(disjunctive) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  7. Analysis of Single G2P Converter Outputs Lower PERs withincreasingamountoftrainingdata Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distancetoreferencepronunciations at phonemelevel (phonemeerror rate (PER))

  8. Analysis of Single G2P Converter Outputs Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Mosesit is very close for de Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distancetoreferencepronunciations at phonemelevel (phonemeerror rate (PER))

  9. Analysis of Single G2P Converter Outputs For200 en and fr W-P pairs, Rules outperforms Moses Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Edit distancetoreferencepronunciations at phonemelevel (phonemeerror rate (PER))

  10. Phoneme-level combinationscheme Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Based on ROVER (Fiscus, 1997)(Recognizer Output Voting Error Reduction)(traditionally at wordlevel) • VotingModule • byfrequencyofoccurence, since G2P confidencescoresnot reliable

  11. Phoneme-level combinationscheme Converter Output PER PLC output PER Sequitur G2P k EH 9r ZH 25% Phonetisaurus K AA ZH 25% CART K AE ZH 50% K AA 9r ZH 0% Moses K AA 9r S 25% 1:1 G2P (Rules) K AX 9r S 50% Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Example(trainedwith 200 W-P pairs): • Reference: cars K AA 9r ZH

  12. Phoneme-level combination de In 10 of 16 cases  combination equal or better Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Relative PER changecomparedtobestsingleconverteroutput

  13. Phoneme-level combination de Most improvement for de and en ASR experiments Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Relative PER changecomparedtobestsingleconverteroutput

  14. Phoneme-level combination de es (most regular G2P relationship) never improvements Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Relative PER changecomparedtobestsingleconverteroutput

  15. Wiktionary 39 Wiktionary editions with more than 1k IPA prons. (June 2012) Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List ofWiktionaries T. Schlippe, S. Ochs, T. Schultz: Web-basedtoolsandmethodsfor rapid pronunciationdictionarycreation, Speech Communication, vol. 56, pp. 101 – 118, January 2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  16. Wiktionary 4.6k W-P pairs Internal consistency (PER %) 1.5k W-P pairs 3.8k W-P pairs 3.3k W-P pairs Additional G2P convertersbasedon word-pronunciationpairsin Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  17. Data • Filtered web-derivedpronunciations • Fullyautomaticmethodsfrom(Schlippe, 2012a, 2012b, 2014) • ~15% witheachfilteringmethod Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  18. Phoneme-level combination PLC-unfiltWDP already better than w/oWDP Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER changecomparedtobestsingleconverteroutput

  19. Phoneme-level combination 23.1% rel. PER reduction Filtering web-derived pronunciations helps Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Relative PER changecomparedtobestsingleconverteroutput

  20. ASR experiments • As in PER evaluation: Sequiturand Phonetisaurus very good in most cases • However: Rules results in lowest WERs for most scenarios Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Replacedictionaries in de & enrecognizerswithpronunciationsgeneratedwith G2P converters Train anddecodethesystems Word Error Rate (WER)

  21. ASR experiments In only 1 case PLC-w/oWDPbetter or equal best single converter Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  22. ASR experiments Filtering web-derived word-pronunciation pairs hels. Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  23. ASR experiments Confusion Network Combination (CNC) outperforms PLC Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  24. ASR experiments In 9cases Adding system with PLC in helps in CNC Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  25. Conclusion and Future Work • In most cases, PLC comes close validated reference pronunciations more thanthesingleconverters • Web-derived word-pronunciation pairs can further improve quality (Filteringthe web datahelpful) • Weighting single G2P converters’ outputs gave no improvement • according to performance on devset • according to converters‘ confidences • Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editingeffort Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  26. Conclusion and Future Work • Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems • Including systems with pronunciation dictionaries that have been built with PLCto CNC can lead to improvements • Future work: • Embedding PLCand web-derivedpronunciationsintothesemi-automatic pronunciation dictionary creation • Further languages and further G2P converters Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  27. благодари́м за внима́ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  28. ASR experiments In 6 cases System with PLC better or equal best single converter Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  29. Data 1st-stage filtering (Len / Eps / M2NAlign) word-pronunciationpairs filteredword-pronunciationpairs prefiltering • (Black et al., 1998) • (Martirosian and Davel, 2007) • (Schlippe, 2012a, 2012b) • Filtered web-derivedpronunciations • Thresholdforeachfilteringmethoddependent onmean µ andstandard deviation σof measure in focus Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  30. Data 1st-stage filtering (Len / Eps / M2NAlign) „reliable“ g2p model word-pronunciationpairs filteredword-pronunciationpairs prefiltering Train „reliable“ g2p model Apply g2p model towords Edit distance < threshold remainingword-pronunciationpairs 2nd-stage filtering (G2P) • Filtered web-derivedpronunciations • Thresholdforeachfilteringmethoddependent onmean µ andstandard deviation σof measure in focus Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  31. Phoneme-level combinationscheme K EH ZH 9r Sequitur (25% PER) K ZH AA Phonetisaurus(25% PER) K ZH AE CART (50% PER) 9r S K AA Moses (25% PER) S K 9r AX 1:1 G2P (50% PER) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • Example(trainedwith 200 W-P pairs): • Reference: cars K AA 9r ZH

  32. Phoneme-level combinationscheme K EH ZH 9r K ZH @ AA @ K ZH AE 9r S K AA S K 9r AX ZH 9r K AA (0.6) (0.6) (1) (0.4) • Alignment Module • Voting Module • byfrequencyofoccurence, since G2P confidencescores not reliable Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  33. Data Lenth-Filtering G2P-Filtering • Filtered WDPs • German: G2PLen • Remove a pronunciation if the ratio of grapheme and phoneme tokens is shorter than µLen – σLenor longer thanµLen+ σLen 2.1. Train G2P models with remaining more “reliable” W-P pairs. 2.2. Apply the G2P models to convert a grapheme string into a most likely phoneme string. 2.3 Remove a pronunciation if the edit distance between the synthesized phoneme string and the pronunciation inquestion is shorter than µG2P–σG2Por longer thanµG2P+ σG2P  PER reduction: 16.74  14.17 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  34. Data • Filtered WDPs • English, Spanish: M2NAlign • Perform an m-n G2P alignment (Black et al., 1998) • Remove a pronunciation if the alignment score is shorter than µG2P – σG2Por longer thanµG2P + σG2P.  English PER reduction: 33.18  26.13  Spanish PER reduction: 10.25  10.90 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  35. Data • Filtered WDPs • French: Eps(according to (Martirosian and Davel, 2007)) • Perform an 1-1 G2P alignment (Black et al., 1998) Alignment process involves the insertion of graphemic and phonemic nulls (epsilons) into the lexical entries of words. • Remove a pronunciation if the proportion of graphemic and phonemic nulls is shorter than µG2P –σG2Por longer thanµG2P + σG2P.  PER reduction: 14.96 13.97 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  36. References Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment

More Related