1 / 33

6th Intex Workshop & 10 years of (Silberztein, 1993)

6th Intex Workshop & 10 years of (Silberztein, 1993). Sofia, 28-30 May 2003. Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions. Cvetana Krstev, Du ško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana. Motivation. general

morela
Download Presentation

6th Intex Workshop & 10 years of (Silberztein, 1993)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6th Intex Workshop &10 years of (Silberztein, 1993) Sofia, 28-30 May 2003 6th Intex Workshop, Sofia 28-30 May 2003

  2. Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions Cvetana Krstev, Duško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana 6th Intex Workshop, Sofia 28-30 May 2003

  3. Motivation • general • use of different tools • use of multilingual resources • comparison of results in NLP • specific • inclusion of Serbian language in MULTEXT-East specification and production of Slovenian Intex resources • production of tagged Serbian translation of Orwell's 1984 6th Intex Workshop, Sofia 28-30 May 2003

  4. MULTEXT-East morphosyntactic specification • aim exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features • scope: English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian 6th Intex Workshop, Sofia 28-30 May 2003

  5. Nouns (N) Verbs (V) Adjectives (A) Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C) Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Articles (T) Residuals (X) 14 MULTEXT-East types or PoS - new types cannot be introduced 6th Intex Workshop, Sofia 28-30 May 2003

  6. Type attributes • Each type has a set of attributes that are appropriate to it • Each type attribute has its position in MSD description • It is not recommended to add new attributes to a type 6th Intex Workshop, Sofia 28-30 May 2003

  7. Attribute values • a set of values is added to each attribute • each value is coded by one alphanumeric character • the new values can be added to the attributes, if necessary Types Verb attributes Adjective attributes 6th Intex Workshop, Sofia 28-30 May 2003

  8. Adjective attribute values/1 Adjective (A) 13 positions = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x - -------------- -------------- - 2 Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x - -------------- -------------- - 6th Intex Workshop, Sofia 28-30 May 2003

  9. Adjective attribute values/2 = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x - -------------- -------------- - 4 Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x - -------------- -------------- - 5 Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x ...(various more values).. * 6th Intex Workshop, Sofia 28-30 May 2003

  10. Adjective attribute values/3 6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Animate no n x x x x x yes y x x x x x - -------------- -------------- - 9 Formation nominal n x compound c x - -------------- -------------- - ... various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR 6th Intex Workshop, Sofia 28-30 May 2003

  11. An example from the Slovenian MULTEXT-East dictionary čistejši čist Afcfda lemmačist (Engl. clean) correspondsto the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in comparative form (c), feminine gender (f), dual number (d), and accusative case (a). čistejši čist Afcmsa--n lemma čist (Engl. clean) corresponds to the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in comparative form (c), masculine gender (m), singular (s), accusative case (a), and not animate (n). 6th Intex Workshop, Sofia 28-30 May 2003

  12. The first sentence of the Slovene translation of Orwell's 1984 tagged <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> 6th Intex Workshop, Sofia 28-30 May 2003

  13. Intex MSD for Serbian • one DELAS entrycyist,A17 • one of its corresponding DELAF entries cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g :bemp5g • produced by the regular expression A17.exp .............. ijemu/:bems3g:bems7g:bens3g:bens7g + iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g + .............. 6th Intex Workshop, Sofia 28-30 May 2003

  14. Attributes and their values for Serbian adjectives in DELAS/DELAF 6th Intex Workshop, Sofia 28-30 May 2003

  15. Syntactic and semantic marks in Serbian DELAS 6th Intex Workshop, Sofia 28-30 May 2003

  16. Problems of correspondence between MULTEXT-East MSD and Intex/1 • The necessity to enforce the existing coding schema to a particular language Example: How to encode present and past gerund active? In Serbian, for the verbići (Engl.to go) those gerunds areidućiandišavši There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it. 6th Intex Workshop, Sofia 28-30 May 2003

  17. Problems/2 • the common encoding schema does not guarantee that true standardization would be achieved Example: only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type. 6th Intex Workshop, Sofia 28-30 May 2003

  18. Problems/3 • Encoding of verb tenses = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x - -------------- -------------- - 3 Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x 6th Intex Workshop, Sofia 28-30 May 2003

  19. Problems/3 • The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imelis past participle of the verbimeti(Engl. to have), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verbbiti (Engl.to be)and conditional if used with the conditional form of the same copula verb. 6th Intex Workshop, Sofia 28-30 May 2003

  20. Problems/3 <w lemma="Winston" ana="Npmsn">Winston</w> <w lemma="Smith" ana="Npmsn">Smith</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="imeti" ana="Vmps-sma">imel</w> .......................................... <w lemma="da" ana="Css">da</w> <w lemma="biti" ana="Vcc">bi</w> <w lemma="on" ana="Pp3msa--y-n">ga</w> <w lemma="imeti" ana="Vmps-sma">imel</w> 6th Intex Workshop, Sofia 28-30 May 2003

  21. Problems/4 • different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian. • certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear). 6th Intex Workshop, Sofia 28-30 May 2003

  22. Application of MSDIntex mapping to Serbian 1984 {S}{Bio,biti.V77:Gsm} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {vedar,.A18:akms1g:akms4q} ({i,.CONJ} + {i,.PAR}) {hladan,.A18:akms1g:akms4q} {aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g} ({dan,.A1+PP:akms1g:aems4q} + {dan,dati.V103+Perf+Tr+Iref+Ref:Tms}) ; {S} ({na,.PREP+p4} + {na,.PREP+p7}) {cyasovnicima,.?} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.?} . 6th Intex Workshop, Sofia 28-30 May 2003

  23. Tool that facilitates the lemmatization and disambiguation 6th Intex Workshop, Sofia 28-30 May 2003

  24. Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {vedar,.A18:akms1g} (i,.CONJ) {hladan,.A18:akms1g} {aprilski,.A2+PosQ:adms1g} {dan,.N1:ms1q} ; {S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.Num+Car} . 6th Intex Workshop, Sofia 28-30 May 2003

  25. Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V" . "---------------"; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/; ........ 6th Intex Workshop, Sofia 28-30 May 2003

  26. Tagged Serbian 1984 using MULTEXT-East MSD <w lemma="biti" ana="Vmps-sman-n---p">Bio</w> <w lemma="jesam" ana="Va-p3s-an-y---p">je</w> <w lemma="vedar" ana="Afpms1n">vedar</w> <w lemma="i" ana="Ccs">i</w> <w lemma="hladan" ana="Afpms1n">hladan</w> <w lemma="aprilski" ana="Aopms1y">aprilski</w> <w lemma="dan" ana="Ncmsn--n">dan</w> <w lemma="na" ana="Sps-">na</w> <w lemma="cyasovnik" ana="Ncmpl--n">cyasovnicima</w> <w lemma="jesam" ana="Va-p3s-an-y---p">je</w> <w lemma="izbijati" ana="Vmps-snan-n---e">izbijalo</w> <w lemma="trinaest" ana="Mc---l">trinaest</w> 6th Intex Workshop, Sofia 28-30 May 2003

  27. Conclusion • It is possible to convert from Intex to MULTEXT-East • It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code 6th Intex Workshop, Sofia 28-30 May 2003

  28. Type Gender Number Case Definitness Type attributes Types Clitic Animate Owner_Number Owner_Person Owned_Number Noun attributes 6th Intex Workshop, Sofia 28-30 May 2003

  29. Type VForm Tense Person Number Gender Voice Type attributes Types Negative Definitness Clitic Case Animate Clitic_s Aspect Verb Attributes 6th Intex Workshop, Sofia 28-30 May 2003

  30. Type Degree Gender Number Case Definitness Type attributes Types Clitic Animate Formation Owner_Number Owner_Person Owned_Number Adjective attributes 6th Intex Workshop, Sofia 28-30 May 2003

  31. Adverb attributes • Type • Degree • Clitic • Number • Person • Wh_Type Type attributes Types 6th Intex Workshop, Sofia 28-30 May 2003

  32. indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i) Verb attributes participle (p) gerund (g) supine (u) transgressive (t) quotative (q) Values of the attribute Vform of the type Verb 6th Intex Workshop, Sofia 28-30 May 2003

  33. Value of the attribute Tense of the type Verb • present (p) • imperfect (i) • future (f) • past (s) • pluperfect (l) • aorist (a) Verb attributes 6th Intex Workshop, Sofia 28-30 May 2003

More Related