1 / 24

Lexical acquisition through particular adjectival endings for Croatian

Lexical acquisition through particular adjectival endings for Croatian. Božo Bekavac, Krešimir Šojat Institute of Linguistics, Faculty of Philosophy, Zagreb. Motivation & Goals. Recognition of unknown words  necessary for many NLP applications No attempt for Croatian so far

Download Presentation

Lexical acquisition through particular adjectival endings for Croatian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical acquisition through particular adjectival endings for Croatian Božo Bekavac, Krešimir Šojat Institute of Linguistics, Faculty of Philosophy, Zagreb

  2. Motivation & Goals • Recognition of unknown words necessary for many NLP applications • No attempt for Croatian so far • Focus on recognition of adjectives based on characteristic endings • Addition of recognized adjectives into general lexicon • Creation of dynamic rule-based resource

  3. Approach • assumption  adjectives unrecognized by the common lexicon tend to follow regular derivational patterns • e.g. cyberski (cyber-),imunobioloških(immuno-biological), eurooptimističnog (eurooptimistic) • Focus on adjectives, but applicable to other parts of speech

  4. Resources used • Croatian Morphological Lexicon (CML) - 621.000 types generated from ca 33.000 lemmas • 30 M newspaper corpus consisting of 195.534 types • (There will always be words not covered by general lexicons)

  5. Adjectivesin Croatian (1)Multext - East specification 1) type (qualificative, possessive) 2) degree (positive, comparative, superlative) 3) gender (masculine, feminine, neuter) 4) number (singular, plural) 5)case (nominative, genitive, dative, accusative, vocative, locative, instrumental) 6) definiteness 7) animate (relevant only for masculine-singular-accusative)

  6. Adjectivesin Croatian (2) • Adjectives: an open and productive class of words • Morphologic features: derivation+inflection • Derivation: • suffix (e.g. Tomislav  Tomislavov) • prefix + suffix (e.g. nad morem  nadmorski) • compound + suffix (e.g. primorsko-goranski, srednjoškolski)

  7. Adjectivesin Croatian (3) • Inflection (e. g singular): dvojb- en dvojb- ena dvojb- enu dvojb- en dvojb- enu dvojb- enim

  8. Consequence • Potential number for adjectival MSD interpretation is 256 • A great number of suffixes  overlapping of suffixes (endings and ends) of different POS  especially between Adjectives and Nouns

  9. Internal homography (1) • wheresame token represents different word-forms of the same lemma • EXAMPLE: the word-form modalnom of the lemma modalan has five possible MSDs – Amsd, Amsl, Afsi, Ansd, Ansl • All different MSDs with internal homography grouped under the same ending –alnom

  10. Internal homography (2) • modalan Afpmsan-n, Afpmsnn • modalna Afpfsnn, Afpfsny, Afpfsvy, Afpmsan-y, Afpmsgn • modalni Afpnpan, Afpnpay, Afpnpnn, Afpnpny, Afpnpvy, Afpnsgn, Afpfpan • modalne Afpfpay, Afpfpnn, Afpfpny, Afpfpvy, Afpfsgn, Afpfsgy, Afpmpan, ...

  11. External homography (1) • where the same token represents different word-forms(i.e. MSD interpretations) of two or more lemmas • EXAMPLE: kos • nounkos (Nmsn) of the lemma kos(blackbird) • adjectivekos (Amsa; Amsn) of the lemma kos (slant)

  12. External homography (2) - endings • Adjectival endings regularly homographic with those of other parts of speech were not taken into consideration at all • Adjectival paradigms that are partially homographic  only unambiguous endings used

  13. External homography: endings/ends

  14. Order of processing Temporary lexicon of unknown adjectives CML (common lexicon) RECOGNIZER (lexical transducer) Generation of all word-forms

  15. Lexical transducer –alan.grf 24 transducers i.e. different paradigms used alne Variables Output modalne,modalan.A+ 453,452/0/442 af bk mod

  16. Lexical transducer–alan.grf applied on running text ambijentalni,ambijentalan.A+453,452/0/442 af bk bijenalna,bijenalan.A+ 453,452/0/442 af bk cerebrospinalne,cerebrospinalan.A+453,452/0/442 af bk doktrinalnom,doktrinalan.A+453,452/0/442 af bk dvodimenzionalnima,dvodimenzionalan.A+453,452/0/442 af bk dvokanalnom,dvokanalan.A+453,452/0/442afbk ... inflectional pattern code

  17. Temporary  final lexicon • Results of lexical transducers stored in temporary lexicon • Inflectional pattern code and lemma used for generation of all wfs of recognized A • Such order of processing correctly recognizes wf dvojben as A and does not missclasify wfs with same ends (e. g. bazen) • Results of generation stored in final lexicon

  18. Final lexicon aboridžinska,aboridžinski.A:qtfsn-:qtfsv-:qtrpa-:qtrpn-:qtrpv- aboridžinske,aboridžinski.A:qtfpa-:qtfpn-:qtfpv-:qtfsg-:qtmpa- aboridžinski,aboridžinski.A:qtmpn-:qtmpv-:qtmsay--:qtmsn-:qtmsv- aboridžinskih,aboridžinski.A:qtfpg-:qtmpg-:qtrpg- aboridžinskim,aboridžinski.A:qtfpd-:qtfpi-:qtfpl-:qtmpd-:qtmpi-:qtmpl-:qtmsi-:qtrpd-:qtrpi-:qtrpl-:qtrsi-...

  19. Results (1)

  20. Results (2) • 13.933 new adjectival word-formsfound by recognizer • 5.035 word-forms belong to different lemmas • 4.511 new lemmas added into the CML(after manual inspection)  393 type err! • Precision: 97.01 %

  21. Problems • Beside inevetable type errors 131 wfs misclassified due to : • NE endings/ends homographic with adjectival endings (Joško, Aljaska) • Small amount of other POS still not present in the CML (ekosustav) • Foreign words and words of foreign origin (certificate)

  22. Solution • AD 1) is to preprocess the corpus with NERC system developed for Croatian (Bekavac, 2005) • AD2)the problem will be solved after the automatic disambiguation of word-forms when added into the CML • AD3)foreign words used in their original spelling (e.g. certificate) are not being added into the CML by default not big amount

  23. Frequency of particular adjectival endingsfound in corpus

  24. Conclusion and future work • Dynamic resource highly efficient for specific domains • Applied order of processing  overgeneration of word-forms is avoided • FW  to apply same metodology on other open word classess (Nouns and Verbs)

More Related