130 likes | 237 Views
Thai Linguistic Resources. Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia. Linguistic. Linguistic. Knowledge. Training. Knowledge. Resources.
E N D
Thai Linguistic Resources Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia
Linguistic Linguistic Knowledge Training Knowledge Resources Defining Statistical Rules Modeling Top-Down Bottom-Up Models Adjust Adjust Evaluation Evaluation Resources How Important ! Language Processing • Linguistic resources are necessary even in top-down and bottom-up design • Exploitable in modeling and evaluation
What we need ? • Lexicon / Dictionary (30k) • Tagged Text (2MB) / Speech Corpora • Language Model • Word Extraction (ML; p=85%; r=56%) • Word Segmentation / POS tagger (ML; 96-97%) • Sentence Segmentation (ML; 85-89%) • Grapheme-to-Phoneme Conversion (PGLR; 73-90%) • Word Sense Disambiguation • Corpus / UNL / UW (concept) Editor • MT (ParSit; http://come.to/parsit) / UNL • Text Summarization • Speech Recognition / Synthesis
Open Linguistic Resources • • LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994) • About 11,000 Thai entries; 9,000 English entries • http://www.links.nectec.or.th/lexit • • Thai Royal Institute Dictionary (T-T dictionary) • Basic term 32,000 entries • Technical term 15,339 entries • http://www.royin.go.th/ • • ORCHID POS-Tagged Corpus (supported by CRL, 1997) • 160 documents; 2MB text; 400K words • XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags) • http://www.links.nectec.or.th/orchid • ParSit (http://come.to/parsit, 2000)
Ongoing : Thai Speech Corpus #1 Scope (2001) • Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus - Phonetically-balanced sentences - 5K vocabulary coverage sentences • Corpus for Text-to-Speech Synthesis - 400 phonetically and prosodic-balanced sentences - For probabilistic prosody generation • Dialog speech corpus (collaboration with ATR) - 50 conversations, 2,099 sentences - 5,000 words, 866 phonetically-balanced sentences - 40 speakers (males and females)
Ongoing : Thai Speech Corpus #2 Procedure
Ongoing : Thai Speech Corpus #3 Tools Corpus Editor XML Corpus Plain Text
Ongoing : Thai Speech Corpus #4 Text Sources • Technology Promotion Association (Thailand-Japan) • Amarin Printing Co., Ltd. • Matichon Public Co., Ltd. Project Collaboration • Kasetsart University • Thammasat University • King’s Mongkut University of Technology Thonburi • Prince of Songkhla University
Ongoing : LEXiTRON v 2.0 #1 Scope (2001) Procedure • Entries - 25,000 Thai - English - 25,000 English - Thai • Fields - Translation - Phonetics - Root of vocabulary - Part-of-speech - Synonym - Antonym - Sentence sample
Ongoing : LEXiTRON v 2.0 #2 Wordnet Tools Dictionary DB Phonetic Symbols Corpus-based Sample Sentences
Discussion • Language difficulties; 13 Tai-family languages • Text sources • Common tagset • Resource center • Institutional collaboration