1 / 13

[domi@ms.sapientia.ro] J ózsef DOMOKOS Ovidiu BUZA Gavril TODEREAN

100k+ Words, Machine Readable, Pronunciation Dictionary for the Romanian Language. [domi@ms.sapientia.ro] J ózsef DOMOKOS Ovidiu BUZA Gavril TODEREAN. Communications Department, Technical University of Cluj-Napoca. Outline. Introduction Motivation The used grapheme and phoneme set

nate
Download Presentation

[domi@ms.sapientia.ro] J ózsef DOMOKOS Ovidiu BUZA Gavril TODEREAN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 100k+ Words, Machine Readable, Pronunciation Dictionary for the Romanian Language [domi@ms.sapientia.ro] József DOMOKOSOvidiu BUZAGavril TODEREAN Communications Department, Technical University of Cluj-Napoca

  2. Outline • Introduction • Motivation • The used grapheme and phoneme set • Dictionary development stages • Tests and validation results • Conclusions • Acknowledgments

  3. Introduction • This paper intends to present a newly developed Romanianlanguage pronunciation dictionary called NaviRo. • Thedictionary contains almost 140k words from the DexOnline dictionary together with their phonetictranscriptions in SAMPA machine readable alphabet. • Thedevelopment of the pronunciation dictionary and the systemarchitecture are also described in the paper.

  4. Motivation • Pronunciation dictionaries are very useful resources for spoken language technology. These resources are widely used in ASR and TTS applications • In case of some languages such as Romanian, considered under-resourced, the existence of a pronunciation dictionary can considerably speed up ASR and TTS system development. • To our best knowledge there is not available any large, machine-readable pronunciation dictionary for the Romanian language, as it is for example the CMU Pronouncing Dictionary, English OALD, or the BEEP dictionary for English, which can be used for grapheme-to-phoneme transcription.

  5. The used grapheme and phoneme set • Table 1. The 31 graphemes used for modern Romanian writing (according to [8]) • a ă â b c d • e f g h i î • j k l m n o • p q r s ş t • ţ u v w x y • z • Table 2. The used phoneme set presented in SAMPA coding • a @ 1 b k d • e e_X f g h i • i_0 j l m n o • o_X p r s S t • ts tS u v z Z • dZsil

  6. Dictionary development • The dictionary was developed in multiple stages: • First we have manually collected a 1k words dictionary from some linguistic resources available in published form [6], containing words transcribed by phonetician experts. • In the second stage using an ANN system having a parallel structure of 30 neural networks we havedeveloped a 5k word pronunciation dictionary. This automated grapheme-to-phoneme transcription system was tested on the small 1k word hand built database and presented in in a previous work [5]. • The trained system was able to perform grapheme-to-phoneme transcription with an accuracy of 92.83%, calculated at the phoneme level.

  7. Dictionary development • We have recorded and segmented the audio samples for the usedphonemes. These audio files were used for generation of the sounded version for each transcription for the words included in the word list. We have corrected the 5k words dictionary using Dictionary Maker [7], a software application created to facilitate the creation of an electronic pronunciation dictionary in a target language. • We then built the 140k words NaviRO dictionary using a modified and extended version of Dictionary Maker, starting with the previously created 5k dictionary as initial dictionary for rule extraction and a 140k wordlist from DexOnline[8] dictionary.

  8. Dictionary development

  9. Testing and validation • We have tested the capacity of generalization of the created dictionary using 5 fold cross validation test. • We have used 80% of the dictionary as training set and 20% for testing. • The average result of cross validation test shows 76.3% accuracy measured at the word level (calculated as the percentage of words that are predicted 100% correctly)

  10. Conclusions • We have created the first 100k+ words machine-readableRomanian language pronunciation dictionary based on thewords from the lexem table of DexOnline. • We have tested the capacity of generalization of the created dictionary using 5 fold cross validation, and we get 76.3% accuracy at the word level . • We appreciate that the results are very useful, because it can speed up the Romanian language large vocabulary speech recognition system and text-to-speech system development. • We have also developed a 1 million word pronunciation dictionary based on all the inflected wordforms from DexOnlinewhich is not yet tested and validated

  11. Conclusions • NaviRo pronunciation dictionary is freely available on theproject website (http://users.utcluj.ro/~jdomokos/naviro/)in HTK and Festival Speech Synthesis System dictionary and also in text format. • There are also available for download the used graphemeand phoneme set and the audio samples for the usedphonemes to be used for bootstrapping new dictionary development process. • The use of these resources is completelyunrestricted for any research purposes in order to promoteRomanian language speech technology research.

  12. Acknowledgment • This paper was supported by the project "Develop andsupport multidisciplinary postdoctoral programs inprimordial technical areas of national strategy of theresearch - development - innovation" 4D-POSTDOC,contract nr. POSDRU/89/1.5/S/52603, project co-fundedfrom European Social Fund through Sectorial OperationalProgram Human Resources 2007-2013. • Thanks for MarelieDavel and Etienne Barnard for the Dictionary Maker application. • Thanks for Audacity application developers.

  13. Selected references • M. Bisani, H. Ney, “Joint-Sequence Models for Grapheme-to-Phoneme Conversion”, Speech Communication, Vol. 50, Elsevier, pp. 434–451, 2008. • M. Davel, E. Barnard, “PronunciationPredictionwithDefault&Refine”, Computer Speech andLanguage, Vol. 22,Elsevier, pp. 374-393, 2008. • M. Davel, E. Barnard, “Bootstrapping in LanguageResourceGeneration”, Proceedings of the 13th AnnualSymposium of thePattern Recognition Association of South Africa (PRASA), pp. 97-100, Langebaan, South Africa, 2003. • A. Stan, J. Yamagishia, S. King and M. Aylettc, “TheRomanian Speech Synthesis (RSS) corpus: building a highqualityHMM-basedspeech synthesissystemusing a highsamplingrate”Speech Communication, Vol 53, Issue 3, Elsevier, pp. 442-450,2010. • J. Domokos, O. Buza, G. Toderean, “AutomatedGraphemeto-PhonemeConversionSystem for Romanian”, Proceedingsofthe6th Speech Technology andHuman-ComputerDialogueConferenceSpeD, Braşov Romania, 2011. • Institutul de Lingvistică „Iorgu Iordan - Alexandru Rosetti” alAcademiei Române, “DOOM - Dicţionarul Ortografic, Ortoepic şiMorfologic al Limbii Române (Editia a II-a, revizuita şi adăugită)”,Editura Univers Enciclopedic, Bucureşti, 2005. • Dictionary Maker application homepage on SourceForge: http://dictionarymaker.sourceforge.net/ • DexOnline - Transpunerea pe Internet a Unor Dicționare dePrestigiu ale Limbii Române, http://dexonline.ro/

More Related