330 likes | 562 Views
6th Intex Workshop & 10 years of (Silberztein, 1993). Sofia, 28-30 May 2003. Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions. Cvetana Krstev, Du ško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana. Motivation. general
E N D
6th Intex Workshop &10 years of (Silberztein, 1993) Sofia, 28-30 May 2003 6th Intex Workshop, Sofia 28-30 May 2003
Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions Cvetana Krstev, Duško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana 6th Intex Workshop, Sofia 28-30 May 2003
Motivation • general • use of different tools • use of multilingual resources • comparison of results in NLP • specific • inclusion of Serbian language in MULTEXT-East specification and production of Slovenian Intex resources • production of tagged Serbian translation of Orwell's 1984 6th Intex Workshop, Sofia 28-30 May 2003
MULTEXT-East morphosyntactic specification • aim exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features • scope: English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian 6th Intex Workshop, Sofia 28-30 May 2003
Nouns (N) Verbs (V) Adjectives (A) Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C) Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Articles (T) Residuals (X) 14 MULTEXT-East types or PoS - new types cannot be introduced 6th Intex Workshop, Sofia 28-30 May 2003
Type attributes • Each type has a set of attributes that are appropriate to it • Each type attribute has its position in MSD description • It is not recommended to add new attributes to a type 6th Intex Workshop, Sofia 28-30 May 2003
Attribute values • a set of values is added to each attribute • each value is coded by one alphanumeric character • the new values can be added to the attributes, if necessary Types Verb attributes Adjective attributes 6th Intex Workshop, Sofia 28-30 May 2003
Adjective attribute values/1 Adjective (A) 13 positions = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x - -------------- -------------- - 2 Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x - -------------- -------------- - 6th Intex Workshop, Sofia 28-30 May 2003
Adjective attribute values/2 = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x - -------------- -------------- - 4 Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x - -------------- -------------- - 5 Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x ...(various more values).. * 6th Intex Workshop, Sofia 28-30 May 2003
Adjective attribute values/3 6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Animate no n x x x x x yes y x x x x x - -------------- -------------- - 9 Formation nominal n x compound c x - -------------- -------------- - ... various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR 6th Intex Workshop, Sofia 28-30 May 2003
An example from the Slovenian MULTEXT-East dictionary čistejši čist Afcfda lemmačist (Engl. clean) correspondsto the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in comparative form (c), feminine gender (f), dual number (d), and accusative case (a). čistejši čist Afcmsa--n lemma čist (Engl. clean) corresponds to the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in comparative form (c), masculine gender (m), singular (s), accusative case (a), and not animate (n). 6th Intex Workshop, Sofia 28-30 May 2003
The first sentence of the Slovene translation of Orwell's 1984 tagged <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> 6th Intex Workshop, Sofia 28-30 May 2003
Intex MSD for Serbian • one DELAS entrycyist,A17 • one of its corresponding DELAF entries cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g :bemp5g • produced by the regular expression A17.exp .............. ijemu/:bems3g:bems7g:bens3g:bens7g + iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g + .............. 6th Intex Workshop, Sofia 28-30 May 2003
Attributes and their values for Serbian adjectives in DELAS/DELAF 6th Intex Workshop, Sofia 28-30 May 2003
Syntactic and semantic marks in Serbian DELAS 6th Intex Workshop, Sofia 28-30 May 2003
Problems of correspondence between MULTEXT-East MSD and Intex/1 • The necessity to enforce the existing coding schema to a particular language Example: How to encode present and past gerund active? In Serbian, for the verbići (Engl.to go) those gerunds areidućiandišavši There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it. 6th Intex Workshop, Sofia 28-30 May 2003
Problems/2 • the common encoding schema does not guarantee that true standardization would be achieved Example: only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type. 6th Intex Workshop, Sofia 28-30 May 2003
Problems/3 • Encoding of verb tenses = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x - -------------- -------------- - 3 Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x 6th Intex Workshop, Sofia 28-30 May 2003
Problems/3 • The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imelis past participle of the verbimeti(Engl. to have), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verbbiti (Engl.to be)and conditional if used with the conditional form of the same copula verb. 6th Intex Workshop, Sofia 28-30 May 2003
Problems/3 <w lemma="Winston" ana="Npmsn">Winston</w> <w lemma="Smith" ana="Npmsn">Smith</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="imeti" ana="Vmps-sma">imel</w> .......................................... <w lemma="da" ana="Css">da</w> <w lemma="biti" ana="Vcc">bi</w> <w lemma="on" ana="Pp3msa--y-n">ga</w> <w lemma="imeti" ana="Vmps-sma">imel</w> 6th Intex Workshop, Sofia 28-30 May 2003
Problems/4 • different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian. • certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear). 6th Intex Workshop, Sofia 28-30 May 2003
Application of MSDIntex mapping to Serbian 1984 {S}{Bio,biti.V77:Gsm} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {vedar,.A18:akms1g:akms4q} ({i,.CONJ} + {i,.PAR}) {hladan,.A18:akms1g:akms4q} {aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g} ({dan,.A1+PP:akms1g:aems4q} + {dan,dati.V103+Perf+Tr+Iref+Ref:Tms}) ; {S} ({na,.PREP+p4} + {na,.PREP+p7}) {cyasovnicima,.?} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.?} . 6th Intex Workshop, Sofia 28-30 May 2003
Tool that facilitates the lemmatization and disambiguation 6th Intex Workshop, Sofia 28-30 May 2003
Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {vedar,.A18:akms1g} (i,.CONJ) {hladan,.A18:akms1g} {aprilski,.A2+PosQ:adms1g} {dan,.N1:ms1q} ; {S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.Num+Car} . 6th Intex Workshop, Sofia 28-30 May 2003
Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V" . "---------------"; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/; ........ 6th Intex Workshop, Sofia 28-30 May 2003
Tagged Serbian 1984 using MULTEXT-East MSD <w lemma="biti" ana="Vmps-sman-n---p">Bio</w> <w lemma="jesam" ana="Va-p3s-an-y---p">je</w> <w lemma="vedar" ana="Afpms1n">vedar</w> <w lemma="i" ana="Ccs">i</w> <w lemma="hladan" ana="Afpms1n">hladan</w> <w lemma="aprilski" ana="Aopms1y">aprilski</w> <w lemma="dan" ana="Ncmsn--n">dan</w> <w lemma="na" ana="Sps-">na</w> <w lemma="cyasovnik" ana="Ncmpl--n">cyasovnicima</w> <w lemma="jesam" ana="Va-p3s-an-y---p">je</w> <w lemma="izbijati" ana="Vmps-snan-n---e">izbijalo</w> <w lemma="trinaest" ana="Mc---l">trinaest</w> 6th Intex Workshop, Sofia 28-30 May 2003
Conclusion • It is possible to convert from Intex to MULTEXT-East • It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code 6th Intex Workshop, Sofia 28-30 May 2003
Type Gender Number Case Definitness Type attributes Types Clitic Animate Owner_Number Owner_Person Owned_Number Noun attributes 6th Intex Workshop, Sofia 28-30 May 2003
Type VForm Tense Person Number Gender Voice Type attributes Types Negative Definitness Clitic Case Animate Clitic_s Aspect Verb Attributes 6th Intex Workshop, Sofia 28-30 May 2003
Type Degree Gender Number Case Definitness Type attributes Types Clitic Animate Formation Owner_Number Owner_Person Owned_Number Adjective attributes 6th Intex Workshop, Sofia 28-30 May 2003
Adverb attributes • Type • Degree • Clitic • Number • Person • Wh_Type Type attributes Types 6th Intex Workshop, Sofia 28-30 May 2003
indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i) Verb attributes participle (p) gerund (g) supine (u) transgressive (t) quotative (q) Values of the attribute Vform of the type Verb 6th Intex Workshop, Sofia 28-30 May 2003
Value of the attribute Tense of the type Verb • present (p) • imperfect (i) • future (f) • past (s) • pluperfect (l) • aorist (a) Verb attributes 6th Intex Workshop, Sofia 28-30 May 2003