220 likes | 235 Views
The eWika project aims to digitalize Philippine languages and create language resources to support translation and language processing. Started in 1993 at UP-Los Baňos, the project transitioned to DLSU in 2004 with a hybrid ENG-FIL MT System. Language resources include lexicons, morphological analyzers, and rule-based translation engines. Challenges remain in creating accurate and natural translations. The project emphasizes the importance of language resources in maintaining cultural identity. Help build the Philippine Corpus and develop software tools for studying and utilizing it across regions, languages, and forms. Collaboration with contributors and verifiers is essential. Let's work together to preserve and promote Philippine languages!
E N D
eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural Language Processing Research Lab Isalin Translate
MT Research in RP • started in 1993 at UP-Los Baňos • Dr. Rachel Roxas and Allan Borra • grammar-based • in 2004 start at DLSU • hybrid approach
ENG-FIL MT System Project • 3-year project • started 2005 • funded by DOST-PCASTRD • composition: • 6 faculty members of College of Computer Studies • 15 computer science majors • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M
Architectural Design of the Program Source Text User Interface Target Text MT: Example-based Output Modeller MT: Rule-based Translator Engine • Language Resources: • Lexicon (electronic dictionary), • Morphological Analyzer & Generator • Part-of-Speech tagger • Grammar, • Corpus (Tagged)
Where do we get the translation rules? Rule-Based approach The boy ate apples. Apply translation rules Kumain ng mga mansanasang batang lalaki.
A B C D B C D A Rule Learned: ABCD C ng DA B Example-Based • Learn the rules from examples Theboyateapples. Kumainngmga mansanasangbatang lalaki.
ABCD C ng DA B Using the rule Themothercookedfish. A B C D Naglutongisdaangnanay. B C D A
ABCD C ng DA B Using the rule Themotherwenthome. A B C D Umuwingbahayangnanay. B C D A
ABCD C ng DA B Limitation of a Rule Theboyate the fish. B C D A
Results of the MT Engine • Qualities of a Good Translation • Clarity – 3.3 • Accuracy – 3.2 • Naturalness - 2.8 • highest score of 5 • 100 respondents (5 linguists)
Challenge! • Language resources • Quality of translation is dependent on it. • Built from almost non-existent digital forms • manual vs. automatic construction Dictionary Grammar Sample Translations
Lexicon • Diksyunaryo ng Wikang Filipino • automatic construction (AeFLEX): • accuracy rate - 57% • Currently contains about 30,000+ entries • Challenge: Lexical resources • translation documents • part-of-speech tagger
Morphological Analyzer and Generator • Dictionary is incomplete • Create a software that: • analyzes – determines the root word • generates – generates the inflected word Given: eating -> eat -> kain -> kumakain • Challenge : Lexical resources • lexicon • part-of-speech tagger
Part-Of-Speech Tagger • automatic association of parts-of-speech to words in a document • Can? – kaya vs. lata • Baba? – chin or go down • Challenge : Lexical resource • corpora • lexicon • morphological analyzer • grammar
Corpora • collection of translation-pair documents • used by the lexicon extractor and part-of-speech tagger, example-based MT • came from translation works of DLSU English majors, verified by linguists • consists of 207,000 words
Lexicon Resource Dependency Lexicon Corpus POS Tagger Morph AG
Bringing it home … • 171 Philippine Languages (SIL) • No Philippine Corpora • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc) • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
eWika: Digitalization of Philippine Languages • Build the Philippine Corpus • Build software tools to study or use the corpus • Across Regions • Across Forms and Genres • Across Languages
Across Regions • Web-based application: GLOBALIZATION • upload, download, tools • Contributors (Main players) • Verifiers • Server: DLSU-M commits to host the server for the next three years. • Terms of Use: Research purposes.
Across Languages • 171 Philippine Languages (SIL List) • start with 8 major languages • Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano • Filipino Sign Language
Across Forms and Genres • In various forms: • Text • Speech • Video: Filipino sign language • In various Genres: • Text – literary & creative, essays, news articles, religious, etc • Speech – scripted, conversations, etc • Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)
The dream of building electronic, online Philippine language resources and tools • Many many many major hurdles to overcome • NEEDED : Language Resources, Tools, & Peopleware