320 likes | 364 Views
Explore data-driven approaches for developing African language technology, focusing on Swahili, Gikuyu, Luo, and more. Learn about machine translation paradigms and the significance of parallel corpus collections. Overcome challenges in data collection and alignment for successful implementation.
E N D
The Sawa CorpusA Parallel Corpus English - Swahili Guy De Pauw(guy.depauw@aflat.org) Peter Waiganjo Wagacha(waiganjo@aflat.org) Gilles-Maurice de Schryver(gillesmaurice.deschryver@aflat.org)
Resource-scarceness • Language technology vs the digital divide • Digital data increasingly important for African languages (web, mobile phone, …) • But: most research on African languages is rooted in knowledge-based paradigm (↔ LT for Indo-European languages): • Hand-crafted expert systems • Typically high accuracy for domain • Limited portability to other languages and subdomains • Costly development phase • Limited resources (linguistic, expertise, financial, …) • Need for a cheaper and faster (language-independent) alternative for developing African language technology
Data-driven approaches • For Indo-European and Asian languages: the data-driven, corpus-based approach has become the dominant paradigm since the 90’s • Basic methodology: automatically extract linguistic knowledge from annotated text material (corpus) and bootstrap the development of language technology component • Advantages: • language independence: portability (!!!!) • Knowledge acquisition bottleneck data-acquisition bottleneck • Robustness • AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)
Machine Translation 3 paradigms: • Rule-based MT • Statistical MT • Example-based MT data-driven Learn translation from examples: !! Parallel corpus !!
Parallel Corpus Collection of translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level Sawa Corpus: parallel corpus English - Swahili
Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." • UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU • UTANGULIZI • Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, • Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote, Example • Universal Declaration of Human Rights • Preamble • Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, • Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,
3 phases • Data-collection: finding parallel texts • Data-constitution: aligning the parallel texts on word level • Data-exploitation • Statistical Machine Translation • Bootstrapping linguistic annotation
Data Collection • Limited availability of parallel texts English – Kiswahili: • Smaller documents: investment reports, political texts, e.g. Universal Declaration of Human Rights “there is no data, like more data” • Bible, Quran, secular literature • New translations
Data Collection • Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution. e.g. paragraph alignment
Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." • UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU • UTANGULIZI • Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, • Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote, • Universal Declaration of Human Rights • Preamble • Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, • Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,
e.g. sentence alignment • Article 12 • No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. • Everyone has the right to the protection of the law against such interference or attacks. • Kifungu cha 12 • Kila mtu asiingiliwe bila sheria katika mambo yake ya faragha, ya jamaa yake, ya nyumbani mwake au ya barua zake. • Wala asivunjiwe heshima na sifa yake. • Kila mmoja ana haki ya kulindwa na sheria kutokana na pingamizi au mambo kama hayo.
Available data in Sawa Corpus All manually sentence aligned!
Available data in Sawa Corpus All manually sentence aligned!
Available data in Sawa Corpus Thanks to Mahmoud Shokrollahi-Far University College of NabiyeAkram (Iran) All manually sentence aligned!
Available data in Sawa Corpus All manually sentence aligned!
Available data in Sawa Corpus All manually sentence aligned!
Available data in Sawa Corpus All manually sentence aligned!
Available data in Sawa Corpus All manually sentence aligned!
Available data in Sawa Corpus Thanks to Dr. James Omboga Zaja University of Nairobi All manually sentence aligned!
Available data in Sawa Corpus All manually sentence aligned!
Word alignment Most difficult task: relate words between languages No , she , uh , up north ‘s La , yuko , aa , juu kaskazini
Word alignment You caught me skiving , I ‘m afraid . . Samahani , umenidaka nikihepa
Word alignment • Can be done automatically using established tools (GIZA++) • Provide manual reference to evaluate automatic word alignment tools (5000 words)
Current results Still a lot of room for improvement
Word alignment Some alignment patterns are easy No , she , uh , up north ‘s La , yuko , aa , juu kaskazini
Alignment problems I have turned him down nimemkatalia
Morphological decomposition I have turned him down ni+ me+ m+ katalia
Current results Morpheme/Word alignment Better alignment, but more complicated decoding
Future work • Projection of Annotation
Future work • Projection of Annotation • Refine GIZA++ alignment • Part-of-speech tagger
Future work • Projection of Annotation • Refine GIZA++ alignment • Part-of-speech tagger • No data like more data: web-mining & comparable corpora • Example-based MT (omegaT) • Statistical MT (Moses)
Conclusion • Modest, but workable parallel corpus English – Swahili • Bi-directional Machine Translation is now in the cards • Modest, but encouraging word alignment scores • Data-driven approach is viable for African languages