400 likes | 530 Views
Multi-word Expressions and CG. How should MWEs be described?. Questions discussed in a workshop on MWEs – ACL 2007. Is it sufficient to use purely statistical methods for the extraction of MWEs from corpora, or is it necessary to harness human knowledge and linguistic insights?.
E N D
Multi-word Expressions and CG How should MWEs be described?
Questions discussed in a workshop on MWEs – ACL 2007 • Is it sufficient to use purely statistical methods for the extraction of MWEs from corpora, or is it necessary to harness human knowledge and linguistic insights?
Questions discussed in a workshop on MWEs – ACL 2007 • Is fully automatic MWE extraction feasible, or will manual validation always be required?
Questions discussed in a workshop on MWEs – ACL 2007 • What is the nature of MWEs, and how can they be defined formally?
Questions discussed in a workshop on MWEs – ACL 2007 • To what extent can definitions and extraction procedures be generalised to other languages, other text types and other types of MWEs?
Questions discussed in a workshop on MWEs – ACL 2007 • Can and should we distinguish subtypes of MWEs for NLP applications?
Questions discussed in a workshop on MWEs – ACL 2007 • Is it sufficient to use purely statistical methods for the extraction of MWEs from corpora, or is it necessary to harness human knowledge and linguistic insights? • Comment: Underlying the question, there is a fundamental misunderstanding on what languages are about. And what is bad in knowledge and linguistic insight?
Questions discussed in a workshop on MWEs – ACL 2007 • Is fully automatic MWE extraction feasible, or will manual validation always be required? • Comment: Hopefully yes for both.
Questions discussed in a workshop on MWEs – ACL 2007 • What is the nature of MWEs, and how can they be defined formally? • Comment: • - At least they are not the same as collocations. • - Absence of one to one mapping of members in translation. • - Hints to a single semantic concept.
Questions discussed in a workshop on MWEs – ACL 2007 • To what extent can definitions and extraction procedures be generalised to other languages, other text types and other types of MWEs? • Comment: I think they are generalizable.
Questions discussed in a workshop on MWEs – ACL 2007 • Can and should we distinguish subtypes of MWEs for NLP applications? • Comment: Definitely yes. They often comprise separate POS categories.
How and where to describe MWEs? • Two categories of MWEs: • - frozen clusters of words • - clusters of words, the members of which may inflect
How and where to describe MWEs? • Frozen clusters of words • - may be described in the tokenizer and analyzed as a single unit
How and where to describe MWEs? • Inflecting clusters of words • - cannot be described in the tokenizer • - they must be described after analysis • when all necessary linguistic information is available
How and where to describe MWEs? • One possible solution: • - describe frozen MWEs in the tokenizer • - describe inflecting MWEs alter morphological analysis • This was the earlier solution in Swahili Language Manager (SALAMA)
How and where to describe MWEs? • Another solution: • - describe all MWEs after morphological analysis • - exceptions are a few fully lexicalized structures that are written as separate words • This solution is applied in current SALAMA
How and where to describe MWEs? • In describing inflecting MWEs, the following requirements apply: • - each member must be described • - the relative location of each member must be described • - other words and punctuation marks in between members must be allowed • - manipulation of the linguistic information (i.e. tags) must be possible, because the whole cluster will be described anew • - it must be possible to isolate the newly described cluster and treat it as a single lexical unit
CG in describing MWEs • In SALAMA, CG-2 was used for describing MWEs
CG in describing MWEs • Phase 1. • Analyze text: • ameikubali "kubali" V 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] { accept } SVO AR • shingo "shingo" N 9/10-0-SG { a/the } { neck } • upande "upande" ADV { aside }
CG in describing MWEs • Phase 2. • Identify the MWE and describe its structure: • ameikubali "kubali" V 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] { accept } SVO AR • shingo "shingo" IN 9/10-0-SG { a/the } { neck} • upande "upande" <<IDIOM { accept unwillingly } • Note: Only the last member is affected, and the new lexical gloss is attached to it
CG in describing MWEs • Phase 3. • Remodify the other members of the MWE: • ameikubali "kubali" V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] IDIOM-V>> SVO AR • shingo "shingo" IDIOM<> • upande "upande" <<IDIOM { accept unwillingly } • Note: Gloss in English is rewritten, but necessary linguistic information in verb is retained
CG in describing MWEs • Phase 4. • Isolate the MWE as a single lexical unit: • ("kubali_shingo_upande" V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } SVO AR IDIOM-V>> { accept unwillingly } )
CG in describing MWEs • Phase 5. • Surface form in English: • (V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ SVO AR IDIOM-V>> { has accepted { it } unwillingly } ) • Phase 6. • he/she has accepted it unwillingly • Note 1: Surface form is written using lexical and linguistic information • Note 2: The order of words, and their inclusion/exclusion is controlled by re-ordering rules
Problematic cases • Original analysis: • amechukua "chukua" V 1/2-SG3-SP VFIN { he/she } PERF:me [chukua] { take} SVO • hatua "hatua" N 9/10-0-PL { step } AR • tatu "tatu" NUM 9/10-PL CARD { three } • Marking the idiom (wrong): • amechukua "chukua" V 1/2-SG3-SP VFIN { he/she } PERF:me SVO IDIOM-V> • hatua "hatua" <IDIOM { take action } • tatu "tatu" NUM 9/10-PL CARD { three }
Safe cases • Safe case: • amepiga "piga" V 1/2-SG3-SP VFIN { he/she } PERF:me [piga] { hit } SVO • hatua "hatua" N 9/10-0-SG { a/the } { step } AR • amepiga "piga" V 1/2-SG3-SP VFIN { he/she } PERF:me SVO IDIOM-V> • hatua "hatua" <IDIOM { advance } • he/shehasadvanced
Types of MWEs • Several types of MWEs, and each needs to be treated in a specific way
Types of MWEs • Idiomatic expressions: • - they often include a verb as a member • - a large number of surface forms • Alipiga kinanda. • REPLACE (<IDIOM { play piano }) TARGET ("kinanda") • (-1 ([piga])) ; • "<*alipiga>" "piga_kinanda" V 1/2-SG3-SP VFIN { he/she } PAST SVO ACT IDIOM-V "<kinanda>" { play piano }
Types of MWEs • Nouns with genitive structure: • - number of forms limited, often sg and pl • suala la jinsia • masuala ya jinsia • REPLACE (<<MW { :gender issue }) TARGET ("jinsia") • (-2 ("suala")) (-1 GEN-CON); • "<suala>" "suala_la_jinsia" N 5/6-SG { the } AR MW-N "<la>" "<jinsia>" { :gender issue } • "<masuala>" "suala_la_jinsia" N 5/6-PL { the } AR MW-N "<ya>" "<jinsia>" { :gender issue }
Types of MWEs • Adjectival expressions with relative structure: • - number of forms limited by the number of noun classes • mtu mwenye akili • REPLACE (ADJ <MW { clever , cute }) TARGET ("akili") • (-1 ("enye")) (NOT 0 MW); • "<mtu>" "mtu" N 1/2-SG { the } { man } • "<mwenye>" "enye_akili" MW> "<akili>" ADJ { clever , cute }
Types of MWEs • Adjectival expressions with relative structure: • - number of forms limited by the number of noun classes • - is often embedded in the verb structure • tendo lililohitimishwa vibaya • REPLACE (ADJ <MW { illegitimate }) TARGET ("vibaya") • (-1 ("hitimishwa") + REL) (NOT 0 MW); • "<tendo>" "tendo" N 5/6-SG { the } { act } • "<lililohitimishwa>" "hitimishwa_vibaya" MW> "<vibaya>" ADJ { illegitimate }
Types of MWEs • Adverbial expressions with genitive structure: • - number of forms limited • kwa bahati mbaya • REPLACE ( ADV <<MW { unfortunately } ) TARGET ("baya") • (-2 ("kwa")) (-1 ("bahati")) ; • "<kwa>" "kwa_bahati_baya" MW>> "<bahati>" "<mbaya>" ADV { unfortunately }
Types of MWEs • Proper names with several members: • - fixed form • Wizara ya Mawasiliano na Uchukuzi • REPLACE (<<<<MW { *ministry of *communication et *transport }) TARGET ("uchukuzi") • (-4 ("wizara")) (-3 ("ya")) (-2 ("mawasiliano")) (-1 ("na")) ; • "<*wizara>" "wizara_ya_mawasiliano_na_uchukuzi" N 9/10-SG { the } AR MW-N "<ya>" "<*mawasiliano>" "<na>" "<*uchukuzi>" { *ministry of *communication et *transport }
Types of MWEs • Proverbs: • - ‘fixed’ form • - one rule for different variants • Baada ya dhiki faragha. • Baada ya dhiki faraja. • Baada ya dhiki faraji. • REPLACE (<<PROVERB { *after trouble there is relief } ) TARGET ("faragha") OR ("faraja") OR ("faraji") • (-2 ("baada_ya")) (-1 ("dhiki")) ;
Types of MWEs • Proverbs: • - ‘fixed’ form • "*baada_ya_dhiki_faragha" PROVERB>> { *after trouble there is relief } • "*baada_ya_dhiki_faraja" PROVERB>> { *after trouble there is relief } • "*baada_ya_dhiki_faraji" PROVERB>> { *after trouble there is relief }
MWEs in dictionary compilation • MWEs as separate dictionary entries • {tia} V [tia] { put into, pour into, bring about, cause } 296 • {tia_akili} V IDIOM-V { take note of } 1 • [akili] taz. [tia_akili] V IDIOM-V { take note of } 1
MWEs in dictionary compilation • MWEs as separate dictionary entries • {afya} N 9/10 { health, sound condition } AR 1226 • [afya]a taz. [bwana_afya] MW> N 9/6 { health officer } 10 • [afya]a taz. [enye_afya] MW> ADJ { bonny } 17 • [afya]a taz. [enye_nguvu_na_afya] MW>>> ADJ { hale } 1
MWEs in dictionary compilation • MWEs with use examples: • {piga} V (piga) { hit, beat } 647 • {piga picha} V IDIOM-V { photograph } 40 • [piga picha] <ALA> Ikulu kunywa chai na kupiga [piga picha] picha na Rais Mkapa (the State House to drink tea and to photograph and President Mkapa) • [piga picha] <ALA> wapige [piga picha] picha, alionekana kugoma (they should photograph, he/she was seen to boycott) • [piga picha] <DWE> Au kumpiga [piga picha] picha au hata kupeana naye (Or to photograph or even to give each other with him/her) • [piga picha] <DWE> kutoka Ujerumani, walijitahidi kupiga [piga picha] picha za ukumbusho na kiongozi wao (from Germany, they made an effort to photograph the commemoration and their leader)
MWEs in dictionary compilation • MWEs with use examples: • {piga ramli} V IDIOM-V { divine } 4 • [piga ramli] <KIO> anakwenda kwa mganga ili kupiga [piga ramli] ramli na kuongeza imani za ushirikina (he/she goes to the medical person in order to divine and to increase the faith in superstition) • [piga ramli] <KIO> ikambidi amtume mtaalam wa kupiga [piga ramli] ramli kuhusu nyota hiyo (he/she was obliged to send to him/her the expert of divining concerning this star) • [piga ramli] <KIO> kwenda kwa mganga wa kupiga [piga ramli] ramli, hujui kuwa imani ya (going to the medical person of divining, you do not know that the faith of) • [piga ramli] <RAI> kuachana na mtindo wa kupiga [piga ramli] ramli (to leave with the style of divining)
Conclusion • Detailed description of MWEs necessary at least in two applications • - machine translation • - automatic dictionary compilation
Conclusion • Improvements needed for CG parser • - possibility for ordering replace rules • - more possibilities for controlling the deletion and/or replacement of morphemes