220 likes | 235 Views
eWika: Towards the Digitalization of Philippine Languages. Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural Language Processing Research Lab. Isalin. Translate. MT Research in RP. started in 1993 at UP-Los Ba ň os Dr. Rachel Roxas and Allan Borra
E N D
eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural Language Processing Research Lab Isalin Translate
MT Research in RP • started in 1993 at UP-Los Baňos • Dr. Rachel Roxas and Allan Borra • grammar-based • in 2004 start at DLSU • hybrid approach
ENG-FIL MT System Project • 3-year project • started 2005 • funded by DOST-PCASTRD • composition: • 6 faculty members of College of Computer Studies • 15 computer science majors • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M
Architectural Design of the Program Source Text User Interface Target Text MT: Example-based Output Modeller MT: Rule-based Translator Engine • Language Resources: • Lexicon (electronic dictionary), • Morphological Analyzer & Generator • Part-of-Speech tagger • Grammar, • Corpus (Tagged)
Where do we get the translation rules? Rule-Based approach The boy ate apples. Apply translation rules Kumain ng mga mansanasang batang lalaki.
A B C D B C D A Rule Learned: ABCD C ng DA B Example-Based • Learn the rules from examples Theboyateapples. Kumainngmga mansanasangbatang lalaki.
ABCD C ng DA B Using the rule Themothercookedfish. A B C D Naglutongisdaangnanay. B C D A
ABCD C ng DA B Using the rule Themotherwenthome. A B C D Umuwingbahayangnanay. B C D A
ABCD C ng DA B Limitation of a Rule Theboyate the fish. B C D A
Results of the MT Engine • Qualities of a Good Translation • Clarity – 3.3 • Accuracy – 3.2 • Naturalness - 2.8 • highest score of 5 • 100 respondents (5 linguists)
Challenge! • Language resources • Quality of translation is dependent on it. • Built from almost non-existent digital forms • manual vs. automatic construction Dictionary Grammar Sample Translations
Lexicon • Diksyunaryo ng Wikang Filipino • automatic construction (AeFLEX): • accuracy rate - 57% • Currently contains about 30,000+ entries • Challenge: Lexical resources • translation documents • part-of-speech tagger
Morphological Analyzer and Generator • Dictionary is incomplete • Create a software that: • analyzes – determines the root word • generates – generates the inflected word Given: eating -> eat -> kain -> kumakain • Challenge : Lexical resources • lexicon • part-of-speech tagger
Part-Of-Speech Tagger • automatic association of parts-of-speech to words in a document • Can? – kaya vs. lata • Baba? – chin or go down • Challenge : Lexical resource • corpora • lexicon • morphological analyzer • grammar
Corpora • collection of translation-pair documents • used by the lexicon extractor and part-of-speech tagger, example-based MT • came from translation works of DLSU English majors, verified by linguists • consists of 207,000 words
Lexicon Resource Dependency Lexicon Corpus POS Tagger Morph AG
Bringing it home … • 171 Philippine Languages (SIL) • No Philippine Corpora • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc) • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
eWika: Digitalization of Philippine Languages • Build the Philippine Corpus • Build software tools to study or use the corpus • Across Regions • Across Forms and Genres • Across Languages
Across Regions • Web-based application: GLOBALIZATION • upload, download, tools • Contributors (Main players) • Verifiers • Server: DLSU-M commits to host the server for the next three years. • Terms of Use: Research purposes.
Across Languages • 171 Philippine Languages (SIL List) • start with 8 major languages • Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano • Filipino Sign Language
Across Forms and Genres • In various forms: • Text • Speech • Video: Filipino sign language • In various Genres: • Text – literary & creative, essays, news articles, religious, etc • Speech – scripted, conversations, etc • Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)
The dream of building electronic, online Philippine language resources and tools • Many many many major hurdles to overcome • NEEDED : Language Resources, Tools, & Peopleware