750 likes | 915 Views
Language technology in Africa: Prospects. Arvi Hurskainen University of Helsinki. Why LT for African languages?. LT is currently considered a necessary field of development in most languages. Why should African languages be neglected? . Current state.
E N D
Language technology in Africa: Prospects Arvi Hurskainen University of Helsinki
Why LT for African languages? • LT is currently considered a necessary field of development in most languages. • Why should African languages be neglected?
Current state • Compared with other continents, LT in Africa takes its first steps.
Current state • The latest issue of MultiLingual, a periodical with 15,000 subscribers, was supposed to concentrate on LT in Africa. • The only article discussing genuine LT was the one describing Swahili Language Manager (SALAMA) • Another article on Africa was written by a freelancer on public domain localization in South Africa. • That was all for Africa.
In LT the gap between well-resourced and poorly resourced languages is bigger than in any other field. • My impression is that even today half of global investments on LT goes to English.
African languages are triply handicapped: • Commercial sector not interested • Local governments poor – little or no public support • African languages have features that need different approaches than those used in main-stream LT
Language technology (LT) • Labour-intensive • Trivial results quickly • Useful results require several man-years of work • Although the development of LT is expensive, the results can be very rewarding
Language technology (LT) • LT built on a modular basis can result in several kinds of applications • An additional application can make use of earlier modules and thus costs can be reduced • Once developed, LT applications can be widely distributed with minimal cost
Language technology (LT) • Experience of LT in other languages available • Wrong tracks can be avoided • Solutions applied in other languages can be tested in African languages
Language technology (LT) • LT of African languages NOT mere application from other languages • African languages have special features • Very rich morphology • Noun classes • Complex verb formation • Serial verbs • Non-concatenative processes • Reduplication • Inflecting idioms and other multi-word expressions • Tones • Lexical • Grammatical
Feasibility of LT in Africa • Question: If African languages have several special features regarded as ’problems’, is it feasible to develop language technology for those languages? • Answer: Some ‘problems’ can be turned into advantages
Rich morphology • Requires efficient development environment to succeed, but • Can be very useful in disambiguation (= choice of correct interpretation) and syntactic analysis.
Poor morphology vs. rich morphology • Poor morphology (e.g. English) • Easy to analyze morphologically, but • Difficult to disambiguate and analyze syntactically and semantically • Rich morphology (e.g. Bantu languages) • Difficult to analyze morphologically, but • Less difficult to disambiguate and analyze syntactically
LT applications • Applications for end-users: • spelling correctors • hyphenators • grammar checkers • thesauri • electronic dictionaries • MT applications • multilingual speech applications
LT applications • Applications for developers: • dictionary compilers • dictionary evaluators • MT development environments • information retrieval and data mining
Machine Translation (MT) • Text-to-text MT • Official texts (government, AU, UN, SADDEC, business, manuals, teaching) • News texts • Communication through email in international organizations • Speech-to-speech MT • Simultaneous interpretation • Multilingual phone calls
Phases of speech-to-speech MT • Speech recognition • Transforming speech signal to text • Tokenization of text • Identifying ‘words’, punctuation marks, diacritics etc. • Morphological analysis • Analyzing each morphological unit and providing it with codes (tags) • Morphological disambiguation • Determining correct interpretation
Phases of speech-to-speech MT 5. Syntactic mapping • Providing words with syntactic tags • Semantic disambiguation • Choosing the correct semantic meaning • Multi-word units • Isolating multi-word expressions and giving correct interpretation • Idioms • Proverbs • Adjectival expressions • Compound nouns • Serial verb constructions
Phases of speech-to-speech MT • Managing word order • Re-ordering word sequences to meet the rules of the target language • Inclusion and exclusion of pronouns and articles • Producing surface forms of target language • Clean text in target language • Text-to-speech conversion
1. Tokenization *mtu aliyepata taarifa alipiga simu , kukaa na kungoja
2. Morphological analysis *mtu "mtu" N CAP 1/2-SG { the } { man } aliyepata "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO taarifa "taarifa" N 9/10-SG { the } { report } AR "taarifa" N 9/10-PL { the } { report } AR alipiga "piga" V 1/2-SG3-SP VFIN { he/she } PAST z [piga] { hit } SVO ACT "piga" V 1/2-SG3-SP VFIN { he/she } PR:a 5/6-SG-OBJ OBJ { it } z [piga] { hit } SVO ACT simu "simu" N 9/10-SG { the } { telephone } "simu" N 9/10-SG { the } { type of sardine or sprat } AN "simu" N 9/10-PL { the } { telephone } "simu" N 9/10-PL { the } { type of sardine or sprat } AN , "," COMMA { , } kukaa "kaa" V INF { to } z [kaa] { sit } SV SVO "kaa" V INF NO-TO z [kaa] { sit } SV SVO na "na" CC { and } "na" AG-PART { by } "na" PREP { with } "na" NA-POSS { of } "na" ADV NOART { past } kungoja "ngoja" V INF { to } z [ngoja] { wait } SV "ngoja" V INF NO-TO z [ngoja] { wait } SV
3. Disambiguation, isolating MWE *mtu "mtu" N 1/2-SG { the } { man } @SUBJ aliyepata "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO @FMAINVtr+OBJ> taarifa "taarifa" N 9/10-SG { the } { report } AR @OBJ alipiga "piga" V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> simu "simu" <IDIOM { call } , "," COMMA { , } kukaa "kaa" V INF { to } z [kaa] { sit } SV SVO @-FMAINV-n "kaa" V INF NO-TO z [kaa] { sit } SV SVO @-FMAINV-n na "na" CC { and } @CC kungoja "ngoja" V INF { to } z [ngoja] { wait } SV SVO @-FMAINV-n "ngoja" V INF NO-TO z [ngoja] { wait } SV SVO @-FMAINV-n
4. Isolating MWE • ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call }) ( COMMA { , } ) ( V INF { to } z { sit } SV SVO @-FMAINV-n ) ( CC { and } @CC ) ( V INF { to } z { wait } SV @-FMAINV-n )
5. Word-per-line format ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) ( COMMA { , } ) (V INF { to } z { sit } SV SVO @-FMAINV-n ) ( CC { and } @CC ) (V INF { to } z { wait } SV @-FMAINV-n )
6. Copying info on serial verbs ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) ( COMMA { , } ) (V 1/2-SG3-SP VFIN PAST z { sit } SV SVO @FMAINV-n ) ( CC { and } @CC ) (V 1/2-SG3-SP VFIN PAST z { wait } SV SVO @FMAINV-n )
7. Construct word order ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) ( COMMA { , } ) (V 1/2-SG3-SP VFIN PAST z { sit } SV SVO @FMAINV-n ) ( CC { and } @CC ) (V 1/2-SG3-SP VFIN PAST z { wait } SV SVO @FMAINV-n )
8. Surface form in target language ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { :got } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { :called } ) ( COMMA { , } ) ( V 1/2-SG3-SP VFIN PAST z { :sat } SV SVO @-FMAINV-n ) ( CC { and } @CC ) (V 1/2-SG3-SP VFIN PAST z { :waited } SV @-FMAINV-n ) Translation: the man who got the report called, sat and waited
Organizing the work • How should the work be organised on the continent of hundreds of languages? • Prioritising languages • ‘Big’ languages first due to their strategic importance • Some minor languages may have special political or scientific importance
Organizing the work • Scientific infrastructure • Such as ELRA (European Language Resource Association) and • ELDA (European Language Resource Distribution Agency) • Africa needs something similar • An initiative was made in the LREC2006 conference in Genova to establish such an infrastructure
Organizing the work • Networking extremely important • Geographical distances between actors are immense • Ensures efficient communication and distribution of ideas • Ensures that the best and tested approaches will become a standard in LT • Motivates in this tough work
Networking • A Wikipedia type forum as an information and discussion centre for LT in Africa http://forums.csc.fi/kitwiki/pilot/view/KitWiki/Community/AfricanActivities
KitWiki/Community/AfricanActivities Organizations, networks and activities related to LT for African languages Key Areas LT Policy LT Resources • Helsinki Corpus of Swahili Corpus Of Swahili LT Research and Development • SALAMA - Swahili Language Manager SALAMA • Nordic Journal of African Studies NJAS LT Training and Education LT Legislation LT Business Activities Other Activities This topic: KitWiki/Community > WebHome > AfricanActivities History: r3 - 29 Jul 2006 - 09:03 - ArviHurskainen
EDULINK initiative • EU has started in 2006 to support networking between higher education institutions • EDULINK-ACP-EU Cooperation Programme in Higher Education
EDULINK initiative • EDULINK is the first ACP-EU Cooperation Programme in Higher Education • EDULINK is financed by the European Commission under the 9th EDF and is managed by the ACP Secretariat.
EDULINK initiative • EDULINK promotes networking of HEIs in ACP States and the eligible EU Member States through funding of joint projects.
EDULINK initiative: Language technology for African languages • Consortium of five universities • Dar-es-Salaam • Nairobi • Ghana • Hawassa (Ethiopia) • Helsinki • Associates • UNISA, Stellenbosch, SA • Trondheim
EDULINK initiative: Language technology for African languages • Aims • Training in LT • Workshops • Training courses • Summer School in LT • Evaluation • Developing new LT • Language corpora • Morphological parsers • Speech technology • MT (further development of SALAMA)
Development environments • Environments with property rights • Can be obtained through licensing for development purposes • Can also be available with nominal price, e.g. xfst package of Xerox • Cannot be included into the product without a separate agreement with the property owner
Development environments • Open domain environments • Free for development • Free for inclusion into a product
Availability of development environments • In morphology • xfst package of Xerox using finite state methods is most popular • Free for development but not free for inclusion into a product
Availability of development environments • In disambiguation and syntactic mapping • CG-2 and Functional Dependency Grammar (FDG) of Connexor • Only through licensing • Not free for inclusion into a product
Availability of development environments • In disambiguation and syntactic mapping • CG-3 is an open source product • Free for developing • Free for inclusion into a product • http://beta.visl.sdu.dk/constraint_grammar.html
Developing open source technology • Efforts to move SALAMA to open domain
Two implementations of SALAMA Comparison of two methods for morphological analysis • Analysis using finite state method (PR) and • Analysis using two-phase method (OS)
Two implementations of SALAMA Finite state method • Good • Very fast, 4.500 w/s in SWATWOL • Facilitates description on more than one level • Two-level description most common
Two implementations of SALAMA Finite state method • Good • The use of two-level rules simplifies the structure of the dictionary • The whole morphology can be described in one phase • Can be used for simulating linguistic processes (good for research purposes)
Two implementations of SALAMA Finite state method • Bad • Difficult in handling non-concatenative processes (does not ‘see behind’) • Writing a reliable rule system is difficult • In constructing the lexicon, the influence of the rules must be anticipated
Two implementations of SALAMA Finite state method • Bad • Because the lexicon is a tree-structure, the whole language should be described with one single lexicon • Difficulties in compiling very large lexicons • No open source platform available
Two implementations of SALAMA Two-phase method - description • In the first phase, the word is described using pattern matching rules • Produces meta-tags with two parts • Example: • unanifundisha “fundisha” [funda] V uSP naTAM niOBJ ishaVE • uSP • u = string in the word • SP = tag meaning subject prefix
Two implementations of SALAMA Two-phase method • In the second phase, meta-tags are rewritten as final tags • uSP > • 1/2-SG2-SP VFIN { you } • 3/4-SG-SP VFIN { it } • 11-SG-SP VFIN { it }