1 / 75

Language technology in Africa: Prospects

Language technology in Africa: Prospects. Arvi Hurskainen University of Helsinki. Why LT for African languages?. LT is currently considered a necessary field of development in most languages. Why should African languages be neglected? . Current state.

kamali
Download Presentation

Language technology in Africa: Prospects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language technology in Africa: Prospects Arvi Hurskainen University of Helsinki

  2. Why LT for African languages? • LT is currently considered a necessary field of development in most languages. • Why should African languages be neglected?

  3. Current state • Compared with other continents, LT in Africa takes its first steps.

  4. Current state • The latest issue of MultiLingual, a periodical with 15,000 subscribers, was supposed to concentrate on LT in Africa. • The only article discussing genuine LT was the one describing Swahili Language Manager (SALAMA) • Another article on Africa was written by a freelancer on public domain localization in South Africa. • That was all for Africa.

  5. In LT the gap between well-resourced and poorly resourced languages is bigger than in any other field. • My impression is that even today half of global investments on LT goes to English.

  6. African languages are triply handicapped: • Commercial sector not interested • Local governments poor – little or no public support • African languages have features that need different approaches than those used in main-stream LT

  7. Language technology (LT) • Labour-intensive • Trivial results quickly • Useful results require several man-years of work • Although the development of LT is expensive, the results can be very rewarding

  8. Language technology (LT) • LT built on a modular basis can result in several kinds of applications • An additional application can make use of earlier modules and thus costs can be reduced • Once developed, LT applications can be widely distributed with minimal cost

  9. Language technology (LT) • Experience of LT in other languages available • Wrong tracks can be avoided • Solutions applied in other languages can be tested in African languages

  10. Language technology (LT) • LT of African languages NOT mere application from other languages • African languages have special features • Very rich morphology • Noun classes • Complex verb formation • Serial verbs • Non-concatenative processes • Reduplication • Inflecting idioms and other multi-word expressions • Tones • Lexical • Grammatical

  11. Feasibility of LT in Africa • Question: If African languages have several special features regarded as ’problems’, is it feasible to develop language technology for those languages? • Answer: Some ‘problems’ can be turned into advantages

  12. Rich morphology • Requires efficient development environment to succeed, but • Can be very useful in disambiguation (= choice of correct interpretation) and syntactic analysis.

  13. Poor morphology vs. rich morphology • Poor morphology (e.g. English) • Easy to analyze morphologically, but • Difficult to disambiguate and analyze syntactically and semantically • Rich morphology (e.g. Bantu languages) • Difficult to analyze morphologically, but • Less difficult to disambiguate and analyze syntactically

  14. LT applications • Applications for end-users: • spelling correctors • hyphenators • grammar checkers • thesauri • electronic dictionaries • MT applications • multilingual speech applications

  15. LT applications • Applications for developers: • dictionary compilers • dictionary evaluators • MT development environments • information retrieval and data mining

  16. Machine Translation (MT) • Text-to-text MT • Official texts (government, AU, UN, SADDEC, business, manuals, teaching) • News texts • Communication through email in international organizations • Speech-to-speech MT • Simultaneous interpretation • Multilingual phone calls

  17. Phases of speech-to-speech MT • Speech recognition • Transforming speech signal to text • Tokenization of text • Identifying ‘words’, punctuation marks, diacritics etc. • Morphological analysis • Analyzing each morphological unit and providing it with codes (tags) • Morphological disambiguation • Determining correct interpretation

  18. Phases of speech-to-speech MT 5. Syntactic mapping • Providing words with syntactic tags • Semantic disambiguation • Choosing the correct semantic meaning • Multi-word units • Isolating multi-word expressions and giving correct interpretation • Idioms • Proverbs • Adjectival expressions • Compound nouns • Serial verb constructions

  19. Phases of speech-to-speech MT • Managing word order • Re-ordering word sequences to meet the rules of the target language • Inclusion and exclusion of pronouns and articles • Producing surface forms of target language • Clean text in target language • Text-to-speech conversion

  20. 1. Tokenization *mtu aliyepata taarifa alipiga simu , kukaa na kungoja

  21. 2. Morphological analysis *mtu "mtu" N CAP 1/2-SG { the } { man } aliyepata "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO taarifa "taarifa" N 9/10-SG { the } { report } AR "taarifa" N 9/10-PL { the } { report } AR alipiga "piga" V 1/2-SG3-SP VFIN { he/she } PAST z [piga] { hit } SVO ACT "piga" V 1/2-SG3-SP VFIN { he/she } PR:a 5/6-SG-OBJ OBJ { it } z [piga] { hit } SVO ACT simu "simu" N 9/10-SG { the } { telephone } "simu" N 9/10-SG { the } { type of sardine or sprat } AN "simu" N 9/10-PL { the } { telephone } "simu" N 9/10-PL { the } { type of sardine or sprat } AN , "," COMMA { , } kukaa "kaa" V INF { to } z [kaa] { sit } SV SVO "kaa" V INF NO-TO z [kaa] { sit } SV SVO na "na" CC { and } "na" AG-PART { by } "na" PREP { with } "na" NA-POSS { of } "na" ADV NOART { past } kungoja "ngoja" V INF { to } z [ngoja] { wait } SV "ngoja" V INF NO-TO z [ngoja] { wait } SV

  22. 3. Disambiguation, isolating MWE *mtu "mtu" N 1/2-SG { the } { man } @SUBJ aliyepata "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO @FMAINVtr+OBJ> taarifa "taarifa" N 9/10-SG { the } { report } AR @OBJ alipiga "piga" V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> simu "simu" <IDIOM { call } , "," COMMA { , } kukaa "kaa" V INF { to } z [kaa] { sit } SV SVO @-FMAINV-n "kaa" V INF NO-TO z [kaa] { sit } SV SVO @-FMAINV-n na "na" CC { and } @CC kungoja "ngoja" V INF { to } z [ngoja] { wait } SV SVO @-FMAINV-n "ngoja" V INF NO-TO z [ngoja] { wait } SV SVO @-FMAINV-n

  23. 4. Isolating MWE • ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call }) ( COMMA { , } ) ( V INF { to } z { sit } SV SVO @-FMAINV-n ) ( CC { and } @CC ) ( V INF { to } z { wait } SV @-FMAINV-n )

  24. 5. Word-per-line format ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) ( COMMA { , } ) (V INF { to } z { sit } SV SVO @-FMAINV-n ) ( CC { and } @CC ) (V INF { to } z { wait } SV @-FMAINV-n )

  25. 6. Copying info on serial verbs ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) ( COMMA { , } ) (V 1/2-SG3-SP VFIN PAST z { sit } SV SVO @FMAINV-n ) ( CC { and } @CC ) (V 1/2-SG3-SP VFIN PAST z { wait } SV SVO @FMAINV-n )

  26. 7. Construct word order ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) ( COMMA { , } ) (V 1/2-SG3-SP VFIN PAST z { sit } SV SVO @FMAINV-n ) ( CC { and } @CC ) (V 1/2-SG3-SP VFIN PAST z { wait } SV SVO @FMAINV-n )

  27. 8. Surface form in target language ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { :got } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { :called } ) ( COMMA { , } ) ( V 1/2-SG3-SP VFIN PAST z { :sat } SV SVO @-FMAINV-n ) ( CC { and } @CC ) (V 1/2-SG3-SP VFIN PAST z { :waited } SV @-FMAINV-n ) Translation: the man who got the report called, sat and waited

  28. Organizing the work • How should the work be organised on the continent of hundreds of languages? • Prioritising languages • ‘Big’ languages first due to their strategic importance • Some minor languages may have special political or scientific importance

  29. Organizing the work • Scientific infrastructure • Such as ELRA (European Language Resource Association) and • ELDA (European Language Resource Distribution Agency) • Africa needs something similar • An initiative was made in the LREC2006 conference in Genova to establish such an infrastructure

  30. Organizing the work • Networking extremely important • Geographical distances between actors are immense • Ensures efficient communication and distribution of ideas • Ensures that the best and tested approaches will become a standard in LT • Motivates in this tough work

  31. Networking • A Wikipedia type forum as an information and discussion centre for LT in Africa http://forums.csc.fi/kitwiki/pilot/view/KitWiki/Community/AfricanActivities

  32. KitWiki/Community/AfricanActivities Organizations, networks and activities related to LT for African languages Key Areas LT Policy LT Resources • Helsinki Corpus of Swahili Corpus Of Swahili LT Research and Development • SALAMA - Swahili Language Manager SALAMA • Nordic Journal of African Studies NJAS LT Training and Education LT Legislation LT Business Activities Other Activities This topic: KitWiki/Community > WebHome > AfricanActivities History: r3 - 29 Jul 2006 - 09:03 - ArviHurskainen

  33. EDULINK initiative • EU has started in 2006 to support networking between higher education institutions • EDULINK-ACP-EU Cooperation Programme in Higher Education

  34. EDULINK initiative • EDULINK is the first ACP-EU Cooperation Programme in Higher Education • EDULINK is financed by the European Commission under the 9th EDF and is managed by the ACP Secretariat.

  35. EDULINK initiative • EDULINK promotes networking of HEIs in ACP States and the eligible EU Member States through funding of joint projects.

  36. EDULINK initiative: Language technology for African languages • Consortium of five universities • Dar-es-Salaam • Nairobi • Ghana • Hawassa (Ethiopia) • Helsinki • Associates • UNISA, Stellenbosch, SA • Trondheim

  37. EDULINK initiative: Language technology for African languages • Aims • Training in LT • Workshops • Training courses • Summer School in LT • Evaluation • Developing new LT • Language corpora • Morphological parsers • Speech technology • MT (further development of SALAMA)

  38. Development environments • Environments with property rights • Can be obtained through licensing for development purposes • Can also be available with nominal price, e.g. xfst package of Xerox • Cannot be included into the product without a separate agreement with the property owner

  39. Development environments • Open domain environments • Free for development • Free for inclusion into a product

  40. Availability of development environments • In morphology • xfst package of Xerox using finite state methods is most popular • Free for development but not free for inclusion into a product

  41. Availability of development environments • In disambiguation and syntactic mapping • CG-2 and Functional Dependency Grammar (FDG) of Connexor • Only through licensing • Not free for inclusion into a product

  42. Availability of development environments • In disambiguation and syntactic mapping • CG-3 is an open source product • Free for developing • Free for inclusion into a product • http://beta.visl.sdu.dk/constraint_grammar.html

  43. Developing open source technology • Efforts to move SALAMA to open domain

  44. Two implementations of SALAMA Comparison of two methods for morphological analysis • Analysis using finite state method (PR) and • Analysis using two-phase method (OS)

  45. Two implementations of SALAMA Finite state method • Good • Very fast, 4.500 w/s in SWATWOL • Facilitates description on more than one level • Two-level description most common

  46. Two implementations of SALAMA Finite state method • Good • The use of two-level rules simplifies the structure of the dictionary • The whole morphology can be described in one phase • Can be used for simulating linguistic processes (good for research purposes)

  47. Two implementations of SALAMA Finite state method • Bad • Difficult in handling non-concatenative processes (does not ‘see behind’) • Writing a reliable rule system is difficult • In constructing the lexicon, the influence of the rules must be anticipated

  48. Two implementations of SALAMA Finite state method • Bad • Because the lexicon is a tree-structure, the whole language should be described with one single lexicon • Difficulties in compiling very large lexicons • No open source platform available

  49. Two implementations of SALAMA Two-phase method - description • In the first phase, the word is described using pattern matching rules • Produces meta-tags with two parts • Example: • unanifundisha “fundisha” [funda] V uSP naTAM niOBJ ishaVE • uSP • u = string in the word • SP = tag meaning subject prefix

  50. Two implementations of SALAMA Two-phase method • In the second phase, meta-tags are rewritten as final tags • uSP > • 1/2-SG2-SP VFIN { you } • 3/4-SG-SP VFIN { it } • 11-SG-SP VFIN { it }

More Related