140 likes | 309 Views
UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010. A. Loponen, J. Paik* & K. Jarvelin kalervo.jarvelin@uta.fi *ISI, Kolkata, India. Outline. Introduction Some Remarks on Bengali Experimental Systems YASS GRALE STALE Experiments Results Discussion.
E N D
UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010 A. Loponen, J. Paik* & K. Jarvelin kalervo.jarvelin@uta.fi *ISI, Kolkata, India
Outline • Introduction • Some Remarks on Bengali • Experimental Systems • YASS • GRALE • STALE • Experiments • Results • Discussion
Introduction • UTA participated in the monolingual Bengali ad hoc Track. Bengali is complicated for IR. • UTA experimented with three language normalizers: one stemmer, YASS, and two lemmatizers, GRALE and StaLe. • UTA submitted 6 runs to the Track • title (T) runs (GRALE and StaLe) • title-and-description (TD) runs (all systems), and • title-description-and-narrative (TDN) run (StaLe)
Some Remarks on Bengali • Bengali is a highly inflectional • one root can produce 20+ morphological forms • generated by adding suffixes • nouns and pronouns inflect in nominative, objective, genitive and locative • productive in compound words • combinations of nouns, pronouns, adjectives and verbs • new words are also formed by derivation • Word form normalization is likely to increase the term weights and affect retrieval effectiveness • Proper names are often either abstract nouns or adjectives • no capitalization of proper names
Experimental Systems: YASS • YASS is a corpus based purely unsupervised statistical stemmer • Handles languages which are based on suffixing • A string distance measure to cluster the lexicon such that each cluster expected to contain all the morphological variations of a root word appearing in the corpus. • Delivers stems based on the training collection
Experimental Systems: GRALE • GRALE is a graph-based lemmatizer for agglutinative languages • Two-step algorithm • extracts a set of frequent suffixes by measuring their n-gram frequency from the given corpus and • picks up case suffixes manually identified by a native speaker • words are then considered as nodes of a graph • a directed edge from node u to v exists if v can be generated from u by addition of a suffix taken from the selected suffix set. • The graph built over the lexicon is directed and acyclic.
bala | tell Pivot bala+te balai | name bala+te+i balai+ke balai+ke+i balai+er GRALE Principle
Experimental Systems: StaLe • StaLe is a statistical, rule-based lemmatizer – also for OOV processing • Two phases: • one-time creation of the transformation rules for a given language, • multi-time lemma generation for input words • The training data set for Bengali was obtained by randomly selecting 11000 unique inflected tokens from the FIRE2008 test corpus • The training data set consisted of nouns only
Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r # cf hens -> h # cf en -> e # cf # count cf confidence factor StaLe Principle
Experiments • The Bengali test collection • 123047 documents with an average length of 362 words • 50 topics – average query lengths • title (T) queries was 6, • TD queries 17, • TDN queries 44 words. • The recall base has 510 relevant documents • Lemur v4.7 Indri search engine
UTA Runs • T TD TDN • YASS x • GRALE x x • StaLe x x x • Baseline x x x
Results T TD TDN StaLeMAP 33.74 44.88 50.58 P@1030.80 37.00 41.40 GRALEMAP34.58 44.51 - P@10 30.20 37.40 - YASSMAP - 45.11 - P@10 - 38.00 - BlineMAP 27.37 38.92 44.55 P@10 25.20 32.80 38.00
Discussion • All normalizers better than the baselines • Also quite competitive in the campaign • The MAP range of others was … • Differences between normalizers are minor • Their benefits lie elsewhere: • Generality • Training cost • Form of output