10 likes | 170 Views
Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names? Ariadna Font Llitjos Advisor: Dr. Alan Black. Introduction and motivation. Danish, Dutch, Estonian, Hebrew, Italian, Malaysian, Norwegian, Portuguese, Serbian, Slovenian, Swedish, Turkish,
E N D
Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names? Ariadna Font Llitjos Advisor: Dr. Alan Black Introduction and motivation Danish, Dutch, Estonian, Hebrew, Italian, Malaysian, Norwegian, Portuguese, Serbian, Slovenian, Swedish, Turkish, and also built corpora of proper nouns only for a few more languages, by crawling the web automatically (Corpusbuilder [Ghani, Jones and Mladenic, 2001]) and manually (size: 500 – 6198 names): Catalan, Chinese, Japanese, Korean, Polish,Thai, Tamil and other-Indian (-Tamil). Our hypothesis is that knowing the origin of an unknown word may allow more specific rules to be applied ([Black et al., 1998] and [Church, 2000]). In some cases, we even expect our system to outperform native English speakers, since what we are after is educated pronunciation of foreign proper names in American English. One such case is Chinese names, since few native English speakers know how to pronounce them, but there are very concrete English rules for pronouncing such names, and so if we added this information to our system, it would pronounce those names in the Americanized, educated way, achieving higher pronunciation accuracy than average American speakers. Previous experiments on large lists of words and pronunciations have shown that when a lexicon has more foreign words than another (CMUDICT vs. OALD in [Black et al., 1998]), this has quite an impact on speech synthesis accuracy. Such experiments report a difference of 16.76% on word accuracy due to foreign words, a great proportion of which is likely to be proper names, since they are harder to predict without any higher level of information. Therefore, there is clearly room for improvement in this domain. Building Letter Language Models Once I collected all these multilingual data, I built a trigram LLM for each language. I used Laplace smoothing, which only made a significant difference for the proper names corpora, since I didn’t have enough data to reliably estimate all the triagrams. I then implemented a language identifier, which given a word (or a document), uses the LLMs to determine to which language it belongs to with a certain probability. To build the CART we decided to add the following features to each name: 1st-language, higher-probability, 2nd-language, 2nd-higher-probability and the difference between the two higher probabilities: (zysk((best-lang slovenian.train)(higher-prob 0.18471) (2nd-best-lang czech.train) (2nd-higher-prob 0.18428)(prob-difference 0.00043))) The resulting CART (probCART), therefore, had a richer parameter space, since it used all the previous features (previous and next phones) as well as the language of origin features. Baselines For the baseline, I looked up CMUdict with stress (version 0.4) to extract the pronunciations of a list of 56,000 foreign proper names, and every tenth word in the lexicon was held out for testing, using the rest 90% as training data. Based on the techniques described in [Black et al., 1998] and used in [Chotimongkol and Black, 2000], I used decision trees to predict phones based on letters and their context. In English, letters map to epsilon, a phone or occasionally two phones. The following three mappings illustrate this: (a) Monongahela m ax n oa1 ng g ax hh ey1 l ax (b) Pittsburgh p ih1 t s b er g (c) exchange ih k-s ch ey1 n jh The results for the CART trained on proper names only turned out to have a 3.15% higher word accuracy than the CART trained on the whole CMUdict (see table). Preliminary results We built the probabilistic CART (probCART) using the stop value which had been proven optimal for the CMUdict in previous experiments [Black et al., 1998], 5. However, since the parameter space was richer (the tree had more features to split itself into), we suspected there was a data fragmentation problem, and there wasn’t actually enough data on the leaves to have reliable estimates, so we also built CARTs using a stop value of 8. The word accuracy for all the CARTs were the following: Multilingual Data collection This represents a 7.64% increment in word accuracy over the proper names baseline. The next step was to collect data to build the letter to language models (LLM) for as many different languages as possible, so that I could effectively build a language identifier and add the relevant features to build the CART. For that, I post processed all the corpora for the languages in the European Corpus Initiative Multilingual Corpus I (size: 255 thousand – 11 billion words): English, French, German, Spanish, Croatian, Czech, Some references - Black, A., Lenzo, K. and Pagel, V. Issues in Building General Letter to Sound Rules. 3rd ESCA Speech Synthesis Workshop, pp. 77-80, Jenolan Caves, Australia, 1998 - Chotimongkol, A. and Black, A. Statistically trained orthographic to sound models forThai. Beijing October 2000. - Church, K. (2000). Stress Assignment in Letter to Sound rules for Speech Synthesis (Technical Memoradnum).AT&T Labs –Research. November 27, 2000. - Ghani R., Jones R. and Mladenic D. (2001). Building Minority Language Corpora by Learning to Generate Web Search Queries. Technical Report CMU-CALD-01-100 http://www.cs.cmu.edu/~TextLearning/corpusbuilder/.