510 likes | 744 Views
SALAMA – Swahili Language Manager. Arvi Hurskainen University of Helsinki. Short history. Morphological description of Swahili started in 1985 - Two-level model using finite state automata Morphological description ready 1989 - To market in 1999 through Lingsoft
E N D
SALAMA – Swahili Language Manager Arvi Hurskainen University of Helsinki
Short history • Morphological description of Swahili started in 1985 • - Two-level model using finite state automata • Morphological description ready 1989 • - To market in 1999 through Lingsoft • - Now integrated to Ms Office 2007 • Disambiguation ‘ready’ 1996 • - Constraint Grammar Parser CG-2 (Connexor) • Language translator ‘ready’ 2003 • Dictionary Compiler ready 2007
Morphological analysis • *serikali • "serikali" N CAP 9/10-SG { the } { government } PERS • "serikali" N CAP 9/10-PL { the } { government } PERS • ya • "ya" GEN-CON 3/4-PL { of } • "ya" GEN-CON 9/10-SG { of } • "ya" GEN-CON 5/6-PL { of } • "ya" GEN-CON 6-PLSG { of } • *tanzania • "*tanzania" N PROPNAME SG { *tanzania } • imefanya • "fanya" V 3/4-PL-SP VFIN { they } PERF:me z [fanya] { do } SVO • "fanya" V 9/10-SG-SP VFIN { it } PERF:me z [fanya] { do } SVO • uteuzi • "uteuzi" N 11-SG { the } DER:verb DER:zi { appointment } @OBJ
Disambiguation • *serikali • "serikali" N 9/10-SG { the } { government } PERS • ya • "ya" GEN-CON 9/10-SG { of } • *tanzania • "*tanzania" N PROPNAME SG { *tanzania } • imefanya • "fanya" V 9/10-SG-SP VFIN { it } PERF:me z [fanya] { do } SVO • uteuzi • "uteuzi" N 11-SG { the } DER:zi { appointment }
Syntactic mapping • *serikali • "serikali" N 9/10-SG { the } { government } PERS @SUBJ • ya • "ya" GEN-CON 9/10-SG { of } @GCON • *tanzania • "*tanzania" N PROPNAME SG { *tanzania } @<GN • imefanya • "fanya" V 9/10-SG-SP VFIN { it } PERF:me z [fanya] { do } SVO @FMAINVtr+OBJ> • uteuzi • "uteuzi" N 11-SG { the } DER:zi { appointment } @OBJ
How and where to describe MWEs? • Two categories of multiword expressions (MWE): • - frozen clusters of words • kwa_ajili_ya PREP { because of } • - clusters of words, the members of which may inflect • aliyenipigia picha { he/she who photographed for me } • atakayenipigia picha { he/she who will photograph for me } • aliyekuwa amekwishanipigia picha { he/she who already had photographed for me } • atakayekuwa amekwishanipigia picha { he/she who will have had photographed for me }
How and where to describe MWEs? • Frozen clusters of words • - may be described in the tokenizer and analyzed as a single unit • kwa ajili ya > kwa_ajili_ya
How and where to describe MWEs? • Inflecting clusters of words • - cannot be described in the tokenizer • - they must be described after analysis • when all necessary word-level linguistic information is available
How and where to describe MWEs? • One possible solution: • - describe frozen MWEs in the tokenizer • - describe inflecting MWEs alter morphological analysis • This was the earlier solution in Swahili Language Manager (SALAMA).
How and where to describe MWEs? • Another solution: • - describe all MWEs after morphological analysis • - exceptions are a few fully lexicalized structures that are written as separate words • This solution is applied in current SALAMA .
How and where to describe MWEs? • In describing inflecting MWEs, the following requirements apply: • - each member of the MWE must be described • - the relative location of each member must be described • - other words and punctuation marks in between members must be allowed • - manipulation of the linguistic information (i.e. tags) must be possible, because the whole cluster will be re-described • - it must be possible to isolate the newly described cluster and treat it as a single lexical unit
CG in describing MWEs • Phase 1. • Analyze and disambiguate text: • ameikubali "kubali" V 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] { accept } SVO AR • shingo "shingo" N 9/10-0-SG { a/the } { neck } • upande "upande" ADV { aside }
CG in describing MWEs • Phase 2. • Identify the MWE and describe its structure: • ameikubali "kubali" V 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] { accept } SVO AR • shingo "shingo" IN 9/10-0-SG { a/the } { neck} • upande "upande" <<IDIOM { accept unwillingly } • Note: Only the last member is reanalyzed, and the new lexical gloss is attached to it. Two words before it are part of the idiom (<<).
CG in describing MWEs • Phase 3. • Modify the other members of the MWE: • ameikubali "kubali" V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] IDIOM-V>> SVO AR • shingo "shingo" IDIOM<> • upande "upande" <<IDIOM { accept unwillingly } • Note: In the verb, gloss in English is removed, but necessary linguistic information is retained. In ‘shingo’, the gloss is removed.
CG in describing MWEs • Phase 4. • Isolate the MWE as a single lexical unit: • ("kubali_shingo_upande" V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } SVO AR IDIOM-V>> { accept unwillingly } )
CG in describing MWEs • Phase 5. • Re-order the constituents: • (V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ SVO AR IDIOM-V>> { accept { it } unwillingly } ) • Note: The order of words, and their inclusion/exclusion is controlled by re-ordering rules.
CG in describing MWEs • Phase 6. • Produce surface form in English: • (V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ SVO AR IDIOM-V>> { has accepted { it } unwillingly } ) • Note: Surface form is constructed using linguistic information inherited from Swahili.
Phase 7. • Final translated form: • he/she has accepted it unwillingly
Types of MWEs • Multiword expressions fall into various part-of-speech categories: • verbs • nouns • adjectives • adverbs • prepositions • multiword names • proverbs
Adverb • After analysis: • kwa • "kwa" PREP { for } • "kwa" PREP { at } • "kwa" PREP { to } • "kwa" PREP { by } • "kwa" PREP { with } • "kwa" PREP { in } • "kwa" GEN-CON-KWA 15-SG { of } • "kwa" GEN-CON-KWA 17-SG { of } • kiasi • "asi" V SBJN VFIN 7/8-SG-OBJ OBJ { it } z [asi] { rebel } SVO AR • "asi" V SBJN 7/8-SG-SP VFIN { it } z [asi] { rebel } SVO AR • "kiasi" N 7/8-SG { the } { quantity } AR • "asi" ADJ A-INFL 7/8-SG { apostate } AR • "kiasi" ADV { reasonably } AR • "kiasi" AD-ADJ AR { amount } • "asi" ADV ADV:ki 9/10-SG { the } { rebel } AR • "asi" ADV ADV:ki 9/10-PL { the } { rebel } AR • kikubwa • "kubwa" ADJ A-INFL 7/8-SG { big }
Adverb • After isolation: • kwa • "kwa" MW>> • kiasi • "kiasi" MW<> • kikubwa • "kubwa" ADV <<MW { to a large extent }
Adverb • Adverbial expressions with genitive structure: • - number of forms limited • kwa bahati mbaya • Rule: • REPLACE ( ADV <<MW { unfortunately } ) TARGET ("baya") • (-2 ("kwa")) (-1 ("bahati")) ; • Modified result: • kwa "kwa_bahati_baya" MW>> bahati mbaya ADV { unfortunately }
Adjective • mapambano • "pambano" N 5/6-PL { the } DER:verb DER:o { contest } • ya • "ya" GEN-CON 3/4-PL { of } • "ya" GEN-CON 9/10-SG { of } • "ya" GEN-CON 5/6-PL { of } • "ya" GEN-CON 6-PLSG { of } • kweli • "kweli" N 9/10-SG { the } { truth } • "kweli" N 9/10-PL { the } { truth } • "kweli" ADV { indeed }
Adjective • Rule: • REPLACE ( ADJ <MW { genuine , serious , unaffected , undoubted , unfeigned , virtual } ) TARGET (“kweli") (-1 GEN-CON) ; • Result: • mapambano • "pambano" N 5/6-PL { the } DER:o { contest } • "ya" MW> • kweli • "kweli" ADJ <MW { genuine , serious , unaffected , undoubted , unfeigned , virtual }
Adjectives • Adjectival expressions with relative structure: • - number of forms limited by the number of noun classes • mtu mwenye akili • Rule: • REPLACE (ADJ <MW { clever , cute }) TARGET ("akili") • (-1 ("enye")) (NOT 0 MW); • Modified result: • mtu "mtu" N 1/2-SG { the } { man } • mwenye "enye_akili" MW> akili ADJ { clever , cute }
Adjectives • Adjectival expressions with relative structure: • - number of forms limited by the number of noun classes • - is often embedded in the verb structure • tendo lililohitimishwa vibaya • Rule: • REPLACE (ADJ <MW { illegitimate }) TARGET ("vibaya") • (-1 ("hitimishwa") + REL) (NOT 0 MW); • Modified result: • tendo "tendo" N 5/6-SG { the } { act } • lililohitimishwa "hitimishwa_vibaya" MW> vibaya ADJ { illegitimate }
Verb • kupambana • "pambana" V INF { to } z [pamba] { contest } PREFR SVO REC • "pambana" V INF { to } z [pamba] { adorn } SVO EXT: REC { each other } :EXT • "pambana" V INF NO-TO z [pamba] { contest } PREFR SVO REC • "pambana" V INF NO-TO z [pamba] { adorn } SVO EXT: REC { each other } :EXT • na • "na" CC { and } • "na" AG-PART { by } • "na" PREP { with } • "na" NA-POSS { of } • "na" ADV NOART { past }
Verb • kupambana • "pambana" V INF { to } z PREFR SVO REC IDIOM-V> • na • "na" <IDIOM { fight with } • One-line format with multiword lexical fom: • kupambana "pambana_na" V INF { to } z PREFR SVO REC IDIOM-V> • na <IDIOM { fight with }
Verb • Rule: • REPLACE (<IDIOM { play piano }) TARGET ("kinanda") • (-1 ([piga])) ; • alipiga • "piga" V 1/2-SG3-SP VFIN { he/she } PAST [piga] { hit } SVO ACT • kinanda • "kinanda" N 778-SG { the } { piano } • One-line format with multiword lexical fom: • alipiga "piga_kinanda" V 1/2-SG3-SP VFIN { he/she } PAST SVO ACT IDIOM-V kinanda { play piano }
Noun • kisomo • "kisomo" N 7/8-SG { the } DER:o { small lesson } • "somo" ADV ADV:ki 5/6-SG { the } DER:verb DER:o { :teaching subject } AR • "somo" ADV ADV:ki 9/10-SG { the } DER:o { namesake } HUM • "somo" ADV ADV:ki 9/10-PL { the } DER:o { namesake } HUM • cha • "cha" GEN-CON 7/8-SG { of } • watu_wazima • "mtu_mzima" N 1/2-PL { the } { :mature persons } • "mtu_mzima" N HUM 1/2-PL { the } { mature person } • Note that part of the MWE already fixed in tokenizer: (mtu_mzima).
Noun • kisomo • "kisomo" N 7/8-SG { the } MW-N>> • cha • "cha" MW<> • watu_wazima • "mtu_mzima" <<MW { :adult education } • Note that part of the MWE already fixed in tokenizer: (mtu_mzima).
Types of MWEs • Nouns with genitive structure: • - number of forms limited, often sg and pl • suala la jinsia • masuala ya jinsia • Rule: • REPLACE (<<MW { :gender issue }) TARGET ("jinsia") • (-2 ("suala")) (-1 GEN-CON); • Modified result: • suala "suala la jinsia" N 5/6-SG { the } AR MW-N la jinsia { :genderissue } • masuala "suala la jinsia" N 5/6-PL { the } AR MW-N ya jinsia { :genderissue }
Proper names • Proper names with multiple members: • - fixed form • Wizara ya Mawasiliano na Uchukuzi • REPLACE (<<<<MW { *ministry of *communication et *transport }) TARGET ("uchukuzi") • (-4 ("wizara")) (-3 ("ya")) (-2 ("mawasiliano")) (-1 ("na")) ; • *wizara "wizara ya mawasiliano na uchukuzi" N 9/10-SG { the } AR MW-N ya *mawasiliano na *uchukuzi { *ministry of *communication et *transport }
Proverbs • - ‘fixed’ form • - one rule for different variants • Baada ya dhiki faragha. • Baada ya dhiki faraja. • Baada ya dhiki faraji. • REPLACE (<<PROVERB { *after trouble there is relief } ) TARGET ("faragha") OR ("faraja") OR ("faraji") • (-2 ("baada_ya")) (-1 ("dhiki")) ;
Proverbs • - ‘fixed’ form • "*baada_ya_dhiki_faragha" PROVERB>> { *after trouble there is relief } • "*baada_ya_dhiki_faraja" PROVERB>> { *after trouble there is relief } • "*baada_ya_dhiki_faraji" PROVERB>> { *after trouble there is relief }
Serial verbs • Swahili uses serial verb constructions, where only the first verb inflects and the subsequent verbs are in infinitive.
Serial verb construction analyzed • *mtu • "mtu" N CAP 1/2-SG { the } { man } • aliyepata • "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO • taarifa • "taarifa" N 9/10-SG { the } { report } AR • "taarifa" N 9/10-PL { the } { report } AR • alipiga • "piga" V 1/2-SG3-SP VFIN { he/she } PAST z [piga] { hit } SVO ACT • "piga" V 1/2-SG3-SP VFIN { he/she } PR:a 5/6-SG-OBJ OBJ { it } z [piga] { hit } SVO ACT • simu • "simu" N 9/10-SG { the } { telephone } • "simu" N 9/10-SG { the } { type of sardine or sprat } AN • "simu" N 9/10-PL { the } { telephone } • "simu" N 9/10-PL { the } { type of sardine or sprat } AN • , • "," COMMA { , } • kukaa • "kaa" V INF { to } z [kaa] { sit } SV SVO • "kaa" V INF NO-TO z [kaa] { sit } SV SVO • na • "na" CC { and } • "na" AG-PART { by } • "na" PREP { with } • "na" NA-POSS { of } • "na" ADV NOART { past } • kungoja • "ngoja" V INF { to } z [ngoja] { wait } SV • "ngoja" V INF NO-TO z [ngoja] { wait } SV
Serial verb construction disambiguated • *mtu • "mtu" N 1/2-SG { the } { man } @SUBJ • aliyepata • "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO @FMAINVtr+OBJ> • taarifa • "taarifa" N 9/10-SG { the } { report } AR @OBJ • alipiga • "piga" V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> • simu • "simu" <IDIOM { call } • , • "," COMMA { , } • kukaa • "kaa" V INF { to } z [kaa] { sit } SV SVO @-FMAINV-n • "kaa" V INF NO-TO z [kaa] { sit } SV SVO @-FMAINV-n • na • "na" CC { and } @CC • kungoja • "ngoja" V INF { to } z [ngoja] { wait } SV SVO @-FMAINV-n • "ngoja" V INF NO-TO z [ngoja] { wait } SV SVO @-FMAINV-n
The sentence contains an idiom. • Idiom isolated: • ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call }) ( COMMA { , } ) ( V INF { to } z { sit } SV SVO @-FMAINV-n ) ( CC { and } @CC ) ( V INF { to } z { wait } SV @-FMAINV-n )
Idiom isolated, a word-per-line format: • ( N 1/2-SG { the } { man } @SUBJ ) • ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) • ( N 9/10-SG { the } { report } @OBJ ) • (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) • ( COMMA { , } ) • (V INF { to } z { sit } SV SVO @-FMAINV-n ) • ( CC { and } @CC ) • (V INF { to } z { wait } SV @-FMAINV-n )
Linguistic information copied to other members of the verb series: • ( N 1/2-SG { the } { man } @SUBJ ) • ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) • ( N 9/10-SG { the } { report } @OBJ ) • (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) • ( COMMA { , } ) • (V 1/2-SG3-SP VFIN PAST z { sit } SV SVO @FMAINV-n ) • ( CC { and } @CC ) • (V 1/2-SG3-SP VFIN PAST z { wait } SV SVO @FMAINV-n )
The surface form in English converted: • ( N 1/2-SG { the } { man } @SUBJ ) • ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { :got } SVO @FMAINVtr+OBJ> ) • ( N 9/10-SG { the } { report } @OBJ ) • (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { :called } ) • ( COMMA { , } ) • ( V 1/2-SG3-SP VFIN PAST z { :sat } SV SVO @-FMAINV-n ) • ( CC { and } @CC ) • (V 1/2-SG3-SP VFIN PAST z { :waited } SV @-FMAINV-n ) • the man who got the report called, sat and waited
Problems in identifying MWEs • A construction, which seems a MWE, may also be a normal sequence of words.
Problematic cases • Original analysis: • amechukua "chukua" V 1/2-SG3-SP VFIN { he/she } PERF:me [chukua] { take} SVO • hatua "hatua" N 9/10-0-PL { step } AR • tatu "tatu" NUM 9/10-PL CARD { three } • Marking the idiom (wrong): • amechukua "chukua" V 1/2-SG3-SP VFIN { he/she } PERF:me SVO IDIOM-V> • hatua "hatua" <IDIOM { take action } • tatu "tatu" NUM 9/10-PL CARD { three }
Safe cases • Safe case: • amepiga "piga" V 1/2-SG3-SP VFIN { he/she } PERF:me [piga] { hit } SVO • hatua "hatua" N 9/10-0-SG { a/the } { step } AR • amepiga "piga" V 1/2-SG3-SP VFIN { he/she } PERF:me SVO IDIOM-V> • hatua "hatua" <IDIOM { advance } • he/shehasadvanced
MWEs in dictionary compilation • MWEs as separate dictionary entries: • {tia} V [tia] { put into, pour into, bring about, cause } 296 • {tia_akili} V IDIOM-V { take note of } 1 • [akili] taz. [tia_akili] V IDIOM-V { take note of } 1 • When sorted, the entries are located correctly in dictionary.
MWEs in dictionary compilation • MWEs as separate dictionary entries: • {afya} N 9/10 { health, sound condition } AR 1226 • [afya] taz. [bwana_afya] MW> N 9/6 { health officer } 10 • [afya] taz. [enye_afya] MW> ADJ { bonny } 17 • [afya] taz. [enye_nguvu_na_afya] MW>>> ADJ { hale } 1
MWEs in dictionary compilation • MWEs with use examples in dictionary: • {piga} V (piga) { hit, beat } 647 • {piga_picha} V IDIOM-V { photograph } 40 • [piga_picha] <ALA> Ikulu kunywa chai na kupiga [piga_picha] picha na Rais Mkapa (the State House to drink tea and to photograph and President Mkapa) • [piga_picha] <ALA> wapige [piga_picha] picha, alionekana kugoma (they should photograph, he/she was seen to boycott) • [piga_picha] <DWE> Au kumpiga [piga_picha] picha au hata kupeana naye (Or to photograph or even to give each other with him/her) • [piga_picha] <DWE> kutoka Ujerumani, walijitahidi kupiga [piga_picha] picha za ukumbusho na kiongozi wao (from Germany, they made an effort to photograph the commemoration and their leader)
MWEs in dictionary compilation • MWEs with use examples in dictionary: • {piga_ramli} V IDIOM-V { divine } 4 • [piga_ramli] <KIO> anakwenda kwa mganga ili kupiga [piga_ramli] ramli na kuongeza imani za ushirikina (he/she goes to the medical person in order to divine and to increase the faith in superstition) • [piga_ramli] <KIO> ikambidi amtume mtaalam wa kupiga [piga_ramli] ramli kuhusu nyota hiyo (he/she was obliged to send to him/her the expert of divining concerning this star) • [piga_ramli] <KIO> kwenda kwa mganga wa kupiga [piga_ramli] ramli, hujui kuwa imani ya (going to the medical person of divining, you do not know that the faith of) • [piga_ramli] <RAI> kuachana na mtindo wa kupiga [piga_ramli] ramli (to leave with the style of divining)
Conclusion • Detailed description of MWEs necessary at least in two applications • - machine translation • - automatic dictionary compilation