Morphological Normalization and Collocation Extraction

Morphological Normalizationand Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder@fer.hr, bojana.dalbelo@fer.hr, marko.tadic@ffzg.hr Seminar at the K. U. Leuven, Department of Computing ScienceLeuven2008-05-08 K.U. LeuvenLeuven2008-05-08

Morphological Normalization Jan Šnajder, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder@fer.hr, bojana.dalbelo@fer.hr, marko.tadic@ffzg.hr Seminar at the K. U. Leuven, Department of Computing ScienceLeuven2008-05-08 K.U. LeuvenLeuven2008-05-08

Talk overview • who we are? • what are we doing? • morphological processing: normalization • lemmatization vs. stemming • Mollex: a system for normalization of Croatian • usage in document indexing and text classification • collocations as features • collocation extraction by co-occurrence measures • usage of genetic programming K.U. LeuvenLeuven2008-05-08

Who we are? • University of Zagreb, Croatia • founded 1669, 52,500 undergraduate students • two faculties in the same mission • build the systems that will develop and enable the usage of language resources and tools for Croatian K.U. LeuvenLeuven2008-05-08

Who we are 2? • Faculty of Humanities andSocial Sciences • Institute/Department ofLinguistics • dealing with basiccomputational linguistic tasks for Croatian • compiling and processing large scale language resources • Croatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebank • tagger, lemmatizer • chunker, parser • NERC system K.U. LeuvenLeuven2008-05-08

Who we are 3? • Faculty of Electrical Engineering and Computing • Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab • Knowledge Technogies Laboratory Group deals with • text preprocessing techniques for Croatian for machine learning procedures • dimensionality reduction and document clustering in the vector space model + visualisation • automatic indexing ofdocuments • intelligent, language specificinformation retrieval andextraction K.U. LeuvenLeuven2008-05-08

What are we doing? • working jointly on several research projects • AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) • RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects) • Croatian language resources and their annotation2007-2011, prof. Marko Tadić • Knowledge discovery in textual data2007-2011, prof. Bojana Dalbelo Bašić • CADIAL: Computer Aided Document Indexing for Accessing Legislation • joint Flemish-Croatian project • 2007-2009 • prof. Marie-Francine Moens & prof. Bojana Dalbelo Bašić K.U. LeuvenLeuven2008-05-08

Morphological processing • computational linguistic / NLP task • important for inflectionally rich languages, e.g. • Croatian noun in 14 word-forms (7 cases, 2 numbers): N: student studenti G: studenta studenata D: studentu studentima A: studenta studente V: studentu studenti L: studentu studentima I: studentom studentima • unlike English noun in 2(3?) word-forms (2 numbers+ possesive?): Sg: student Poss: (student’s) Pl: students • present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ... K.U. LeuvenLeuven2008-05-08

Morphological processing 2 • three basic subtasks in inflection processing • generation of (all) word-forms (WFs) of a lexeme • analysis of WFs i.e. recognizing the values of morphosyntactical categories of a WF in text • recognizing to which lexeme(s) a WF belongs to • the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e.g. • information retrieval, text mining, document indexing • normalization: conflating the morphological variants of a word to a single representative form • two main ways to do that • linguistically motivated: lemmatization • computationally motivated: stemming K.U. LeuvenLeuven2008-05-08

Morphological processing 3 • lemmatization • replacing the WF with its proper base WF, usually calledlemma • e.g. mapping theoretical maximum of (e.g. 14) WFs to 1 lemma • lexicon based • large lexicons of all (generated) WFs needed • preparation expensive in time and manpower • mostly realized by databases • algorithmic based • mostly FST: compact, efficient, fast • lexicon of lemmas and their inflectional patterns needed anyway K.U. LeuvenLeuven2008-05-08

Morphological processing 4 • stemming • reducing the WF from the end by truncating the possible endings • does not have to respect the linguistic boundaries vuk+Ø > *vu+kØ vuk+a > *vu+ka vuč+e>*vu+če • reducing all the WFs to a common beginning • problems where there are many morphonological adaptations sla+ti > *?+slati šalj+em>*?+šaljem K.U. LeuvenLeuven2008-05-08

Morphological normalization • Croatian language (like most Slavic languages) is morphologically complex • elaborated inflectional and derivational morphology • problematic for most NLP applications • requires the use of substantial linguistic knowledge • our lexicon based approach to normalization is somewhere in between lemmatization and stemming • suitable for other inflectionally complex languages K.U. LeuvenLeuven2008-05-08

Croatian Morphology • high degree of affixation • word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension • inflection • nouns: declination (7 cases, 2 numbers)‏ • verbs: conjugation (tenses, persons, numbers, genders)‏ • adjectives: declination (7 cases, 2 numbers, 3 genders), comparison (3 degrees), and definiteness • derivation • a large number of rules for deriving nouns from verbs, verbs from nouns, possessive adjectives, ... K.U. LeuvenLeuven2008-05-08

Croatian Morphology 2 • inflection examples • adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, brz, brza, brzo, brzom, brzomu, brži, bržeg, brža, brži, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj, ... • noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini • adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinsko, brzinskog, brzinskoga,… • adverb: brzo, brže, najbrže, brzinski • derivation examples • brz > brzina > brzinski > … K.U. LeuvenLeuven2008-05-08

Croatian Morphology 3 • high degree of homography • vode = voda (water) | voditi (to lead) | vod (a platoon) • requires disambiguation (POS/MSD tagging)‏ • affix ambiguity • many ambiguous suffixation rules • e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i • e.g. bolnic-a / bolnic-om vs. brodolom / brodolom-a • possible mismatches at inflectional level • narančast / narančast-om vs. ruž / ruž-om (not ruža) • possible mismatches at derivational level • e.g. kralj / kralj-ica vs. stan / stan-ica K.U. LeuvenLeuven2008-05-08

Lexicon based normalization • lexicon-based morphological normalisation • a morphological lexicon associates to each WF its morphological norm (lemma, stem,...) and, optionally, a MSD • incorporates linguistic knowledge and thus avoids aforementioned pitfalls • drawbacks • made by linguists, expensive and time-consuming • problems with coverage (neologisms, jargons, …)‏ • our approach • rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora K.U. LeuvenLeuven2008-05-08

Our approach • acquisition of inflectional lexicon • input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism • normalisation of word-forms • inflectional (lemmatization)‏ • inflectional + derivational • comparable to stemming (but more precise)‏ • advantages • can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation)‏ • provides good lexicon coverage • requires only limited linguistic expertise K.U. LeuvenLeuven2008-05-08

Morphology representation • e.g. noun inflectional paradigm • vojnik (soldier)‏ Case Singular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima K.U. LeuvenLeuven2008-05-08

Morphology representation 2 • defines inflectional and derivational rules • uses functions as building blocks: • A) condition functions • B) string transformation functions • each defined using a higer-order function • e.g. • sfx • sfx('a') • sfx('a')('vojnik') = 'vojnika' • sfx(‘e’) alt(pal) • (sfx('e') alt(pal))('vojnik') = 'vojniče' K.U. LeuvenLeuven2008-05-08

Morphology representation 3 Case Singular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima • (s.ends('k','g','h')(s) consGroup(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’)alt(pal), sfx(‘i’)  alt(sib), sfx(‘ima’)  alt(sib), sfx(‘e’)})‏ K.U. LeuvenLeuven2008-05-08

Morphology representation 4 • suitable also for more complex paradigms (c, {null, sfx(‘a’), sfx(‘u’), ..., sfx(‘ima’)}  {sfx(‘og’), sfx(‘om’), ..., sfx(‘ima’)}  {sfx(‘i’) alt(jot), sfx(‘eg’)  alt(jot), ..., sfx(‘ima’)  alt(jot)}  {sfx(‘i’)  alt(jot)  pfx(‘naj’), ..., sfx(‘ima’)  alt(jot)  pfx(‘naj’)}) K.U. LeuvenLeuven2008-05-08

Morphology representation 5 • advantages • resembles to morphology description as found in traditional grammar books • requires minimum amount of linguistic knowledge • highly expressive: arbitrary HOF functions can be defined • can be aplied to other morphologically similar languages • implemented in Haskell • purely functional programming language • requires minimum programming skills K.U. LeuvenLeuven2008-05-08

Lexicon acquisition • uses inflectional rules + raw corpora to extract lemmas and their paradigms • uses frequency counts of WFs attested in the corpus • much of the ambiguity is resolved bylanguage-dependent heuristics • plausibility, priority • linguistic quality is not vital • word-form conflation rather than generation • human intervention is not required K.U. LeuvenLeuven2008-05-08

Results • example lexicon • acquired from 20 Mw newspaper corpus • based on 90 inflectional and >300 derivational rules • contains ca 42,000 lemmas associated with over 500,000 WFs • performance • linguistic quality F1 = 88% per type • coverage 96% per type and 98% per token • understemming = 7% • overstemming < 4% • can be improved further by manual editing K.U. LeuvenLeuven2008-05-08

Derivational normalization • inflectional lexicon is partitioned into equivalence classes based on derivational rules • degree of normalisation depends on the number of derivational rules used • problem with semantics • context, degrees • derivation is not so semantically regular as inflection K.U. LeuvenLeuven2008-05-08

References and applications • Reference • Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexicafor Morphological Normalisation // Information Processing and Management, 2008. (in press) • Applied in document indexing • projects AIDE & CADIALwww.cadial.org • Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107-117. • Applied in text classification • Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325-339. K.U. LeuvenLeuven2008-05-08

Thank youfor your attention! K.U. LeuvenLeuven2008-05-08

Morphological Normalization and Collocation Extraction