1 / 27

Morphological Normalization and Collocation Extraction

Morphological Normalization and Collocation Extraction. Jan Šnajder , Bojana Dalbelo Bašić, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder @ fer.hr, bojana . dalbelo @ fer.hr, marko.tadic @ ffzg.hr

Download Presentation

Morphological Normalization and Collocation Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphological Normalizationand Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder@fer.hr, bojana.dalbelo@fer.hr, marko.tadic@ffzg.hr Seminar at the K. U. Leuven, Department of Computing ScienceLeuven2008-05-08 K.U. LeuvenLeuven2008-05-08

  2. Morphological Normalization Jan Šnajder, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder@fer.hr, bojana.dalbelo@fer.hr, marko.tadic@ffzg.hr Seminar at the K. U. Leuven, Department of Computing ScienceLeuven2008-05-08 K.U. LeuvenLeuven2008-05-08

  3. Talk overview • who we are? • what are we doing? • morphological processing: normalization • lemmatization vs. stemming • Mollex: a system for normalization of Croatian • usage in document indexing and text classification • collocations as features • collocation extraction by co-occurrence measures • usage of genetic programming K.U. LeuvenLeuven2008-05-08

  4. Who we are? • University of Zagreb, Croatia • founded 1669, 52,500 undergraduate students • two faculties in the same mission • build the systems that will develop and enable the usage of language resources and tools for Croatian K.U. LeuvenLeuven2008-05-08

  5. Who we are 2? • Faculty of Humanities andSocial Sciences • Institute/Department ofLinguistics • dealing with basiccomputational linguistic tasks for Croatian • compiling and processing large scale language resources • Croatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebank • tagger, lemmatizer • chunker, parser • NERC system K.U. LeuvenLeuven2008-05-08

  6. Who we are 3? • Faculty of Electrical Engineering and Computing • Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab • Knowledge Technogies Laboratory Group deals with • text preprocessing techniques for Croatian for machine learning procedures • dimensionality reduction and document clustering in the vector space model + visualisation • automatic indexing ofdocuments • intelligent, language specificinformation retrieval andextraction K.U. LeuvenLeuven2008-05-08

  7. What are we doing? • working jointly on several research projects • AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) • RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects) • Croatian language resources and their annotation2007-2011, prof. Marko Tadić • Knowledge discovery in textual data2007-2011, prof. Bojana Dalbelo Bašić • CADIAL: Computer Aided Document Indexing for Accessing Legislation • joint Flemish-Croatian project • 2007-2009 • prof. Marie-Francine Moens & prof. Bojana Dalbelo Bašić K.U. LeuvenLeuven2008-05-08

  8. Morphological processing • computational linguistic / NLP task • important for inflectionally rich languages, e.g. • Croatian noun in 14 word-forms (7 cases, 2 numbers): N: student studenti G: studenta studenata D: studentu studentima A: studenta studente V: studentu studenti L: studentu studentima I: studentom studentima • unlike English noun in 2(3?) word-forms (2 numbers+ possesive?): Sg: student Poss: (student’s) Pl: students • present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ... K.U. LeuvenLeuven2008-05-08

  9. Morphological processing 2 • three basic subtasks in inflection processing • generation of (all) word-forms (WFs) of a lexeme • analysis of WFs i.e. recognizing the values of morphosyntactical categories of a WF in text • recognizing to which lexeme(s) a WF belongs to • the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e.g. • information retrieval, text mining, document indexing • normalization: conflating the morphological variants of a word to a single representative form • two main ways to do that • linguistically motivated: lemmatization • computationally motivated: stemming K.U. LeuvenLeuven2008-05-08

  10. Morphological processing 3 • lemmatization • replacing the WF with its proper base WF, usually calledlemma • e.g. mapping theoretical maximum of (e.g. 14) WFs to 1 lemma • lexicon based • large lexicons of all (generated) WFs needed • preparation expensive in time and manpower • mostly realized by databases • algorithmic based • mostly FST: compact, efficient, fast • lexicon of lemmas and their inflectional patterns needed anyway K.U. LeuvenLeuven2008-05-08

  11. Morphological processing 4 • stemming • reducing the WF from the end by truncating the possible endings • does not have to respect the linguistic boundaries vuk+Ø > *vu+kØ vuk+a > *vu+ka vuč+e>*vu+če • reducing all the WFs to a common beginning • problems where there are many morphonological adaptations sla+ti > *?+slati šalj+em>*?+šaljem K.U. LeuvenLeuven2008-05-08

  12. Morphological normalization • Croatian language (like most Slavic languages) is morphologically complex • elaborated inflectional and derivational morphology • problematic for most NLP applications • requires the use of substantial linguistic knowledge • our lexicon based approach to normalization is somewhere in between lemmatization and stemming • suitable for other inflectionally complex languages K.U. LeuvenLeuven2008-05-08

  13. Croatian Morphology • high degree of affixation • word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension • inflection • nouns: declination (7 cases, 2 numbers)‏ • verbs: conjugation (tenses, persons, numbers, genders)‏ • adjectives: declination (7 cases, 2 numbers, 3 genders), comparison (3 degrees), and definiteness • derivation • a large number of rules for deriving nouns from verbs, verbs from nouns, possessive adjectives, ... K.U. LeuvenLeuven2008-05-08

  14. Croatian Morphology 2 • inflection examples • adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, brz, brza, brzo, brzom, brzomu, brži, bržeg, brža, brži, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj, ... • noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini • adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinsko, brzinskog, brzinskoga,… • adverb: brzo, brže, najbrže, brzinski • derivation examples • brz > brzina > brzinski > … K.U. LeuvenLeuven2008-05-08

  15. Croatian Morphology 3 • high degree of homography • vode = voda (water) | voditi (to lead) | vod (a platoon) • requires disambiguation (POS/MSD tagging)‏ • affix ambiguity • many ambiguous suffixation rules • e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i • e.g. bolnic-a / bolnic-om vs. brodolom / brodolom-a • possible mismatches at inflectional level • narančast / narančast-om vs. ruž / ruž-om (not ruža) • possible mismatches at derivational level • e.g. kralj / kralj-ica vs. stan / stan-ica K.U. LeuvenLeuven2008-05-08

  16. Lexicon based normalization • lexicon-based morphological normalisation • a morphological lexicon associates to each WF its morphological norm (lemma, stem,...) and, optionally, a MSD • incorporates linguistic knowledge and thus avoids aforementioned pitfalls • drawbacks • made by linguists, expensive and time-consuming • problems with coverage (neologisms, jargons, …)‏ • our approach • rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora K.U. LeuvenLeuven2008-05-08

  17. Our approach • acquisition of inflectional lexicon • input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism • normalisation of word-forms • inflectional (lemmatization)‏ • inflectional + derivational • comparable to stemming (but more precise)‏ • advantages • can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation)‏ • provides good lexicon coverage • requires only limited linguistic expertise K.U. LeuvenLeuven2008-05-08

  18. Morphology representation • e.g. noun inflectional paradigm • vojnik (soldier)‏ Case Singular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima K.U. LeuvenLeuven2008-05-08

  19. Morphology representation 2 • defines inflectional and derivational rules • uses functions as building blocks: • A) condition functions • B) string transformation functions • each defined using a higer-order function • e.g. • sfx • sfx('a') • sfx('a')('vojnik') = 'vojnika' • sfx(‘e’) alt(pal) • (sfx('e') alt(pal))('vojnik') = 'vojniče' K.U. LeuvenLeuven2008-05-08

  20. Morphology representation 3 Case Singular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima • (s.ends('k','g','h')(s) consGroup(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’)alt(pal), sfx(‘i’)  alt(sib), sfx(‘ima’)  alt(sib), sfx(‘e’)})‏ K.U. LeuvenLeuven2008-05-08

  21. Morphology representation 4 • suitable also for more complex paradigms (c, {null, sfx(‘a’), sfx(‘u’), ..., sfx(‘ima’)}  {sfx(‘og’), sfx(‘om’), ..., sfx(‘ima’)}  {sfx(‘i’) alt(jot), sfx(‘eg’)  alt(jot), ..., sfx(‘ima’)  alt(jot)}  {sfx(‘i’)  alt(jot)  pfx(‘naj’), ..., sfx(‘ima’)  alt(jot)  pfx(‘naj’)}) K.U. LeuvenLeuven2008-05-08

  22. Morphology representation 5 • advantages • resembles to morphology description as found in traditional grammar books • requires minimum amount of linguistic knowledge • highly expressive: arbitrary HOF functions can be defined • can be aplied to other morphologically similar languages • implemented in Haskell • purely functional programming language • requires minimum programming skills K.U. LeuvenLeuven2008-05-08

  23. Lexicon acquisition • uses inflectional rules + raw corpora to extract lemmas and their paradigms • uses frequency counts of WFs attested in the corpus • much of the ambiguity is resolved bylanguage-dependent heuristics • plausibility, priority • linguistic quality is not vital • word-form conflation rather than generation • human intervention is not required K.U. LeuvenLeuven2008-05-08

  24. Results • example lexicon • acquired from 20 Mw newspaper corpus • based on 90 inflectional and >300 derivational rules • contains ca 42,000 lemmas associated with over 500,000 WFs • performance • linguistic quality F1 = 88% per type • coverage 96% per type and 98% per token • understemming = 7% • overstemming < 4% • can be improved further by manual editing K.U. LeuvenLeuven2008-05-08

  25. Derivational normalization • inflectional lexicon is partitioned into equivalence classes based on derivational rules • degree of normalisation depends on the number of derivational rules used • problem with semantics • context, degrees • derivation is not so semantically regular as inflection K.U. LeuvenLeuven2008-05-08

  26. References and applications • Reference • Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexicafor Morphological Normalisation // Information Processing and Management, 2008. (in press) • Applied in document indexing • projects AIDE & CADIALwww.cadial.org • Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107-117. • Applied in text classification • Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325-339. K.U. LeuvenLeuven2008-05-08

  27. Thank youfor your attention! K.U. LeuvenLeuven2008-05-08

More Related