UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010

UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010 A. Loponen, J. Paik* & K. Jarvelin kalervo.jarvelin@uta.fi *ISI, Kolkata, India

Outline • Introduction • Some Remarks on Bengali • Experimental Systems • YASS • GRALE • STALE • Experiments • Results • Discussion

Introduction • UTA participated in the monolingual Bengali ad hoc Track. Bengali is complicated for IR. • UTA experimented with three language normalizers: one stemmer, YASS, and two lemmatizers, GRALE and StaLe. • UTA submitted 6 runs to the Track • title (T) runs (GRALE and StaLe) • title-and-description (TD) runs (all systems), and • title-description-and-narrative (TDN) run (StaLe)

Some Remarks on Bengali • Bengali is a highly inflectional • one root can produce 20+ morphological forms • generated by adding suffixes • nouns and pronouns inflect in nominative, objective, genitive and locative • productive in compound words • combinations of nouns, pronouns, adjectives and verbs • new words are also formed by derivation • Word form normalization is likely to increase the term weights and affect retrieval effectiveness • Proper names are often either abstract nouns or adjectives • no capitalization of proper names

Experimental Systems: YASS • YASS is a corpus based purely unsupervised statistical stemmer • Handles languages which are based on suffixing • A string distance measure to cluster the lexicon such that each cluster expected to contain all the morphological variations of a root word appearing in the corpus. • Delivers stems based on the training collection

Experimental Systems: GRALE • GRALE is a graph-based lemmatizer for agglutinative languages • Two-step algorithm • extracts a set of frequent suffixes by measuring their n-gram frequency from the given corpus and • picks up case suffixes manually identified by a native speaker • words are then considered as nodes of a graph • a directed edge from node u to v exists if v can be generated from u by addition of a suffix taken from the selected suffix set. • The graph built over the lexicon is directed and acyclic.

bala | tell Pivot bala+te balai | name bala+te+i balai+ke balai+ke+i balai+er GRALE Principle

Experimental Systems: StaLe • StaLe is a statistical, rule-based lemmatizer – also for OOV processing • Two phases: • one-time creation of the transformation rules for a given language, • multi-time lemma generation for input words • The training data set for Bengali was obtained by randomly selecting 11000 unique inflected tokens from the FIRE2008 test corpus • The training data set consisted of nouns only

Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r # cf hens -> h # cf en -> e # cf # count cf confidence factor StaLe Principle

Experiments • The Bengali test collection • 123047 documents with an average length of 362 words • 50 topics – average query lengths • title (T) queries was 6, • TD queries 17, • TDN queries 44 words. • The recall base has 510 relevant documents • Lemur v4.7 Indri search engine

UTA Runs • T TD TDN • YASS x • GRALE x x • StaLe x x x • Baseline x x x

Results T TD TDN StaLeMAP 33.74 44.88 50.58 P@1030.80 37.00 41.40 GRALEMAP34.58 44.51 - P@10 30.20 37.40 - YASSMAP - 45.11 - P@10 - 38.00 - BlineMAP 27.37 38.92 44.55 P@10 25.20 32.80 38.00

Discussion • All normalizers better than the baselines • Also quite competitive in the campaign • The MAP range of others was … • Differences between normalizers are minor • Their benefits lie elsewhere: • Generality • Training cost • Form of output

UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010

UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010

Presentation Transcript

Ad Hoc Networking

Ad Hoc Arbitration

ad hoc network

AD-HOC NETWORK

AD HOC NETWORKING

MMF Ad Hoc

Track reconstruction in the LHC experiments

Participation Ad-Hoc Legal Ad-Hoc Joint status

Ad Hoc Ideas

Ad hoc Group

Ad-Hoc Cells

Ad Hoc Chorale

Ad Hoc Networks

Ad Hoc Networking

Research at NESL and Ad-Hoc Localization

Ad Hoc Networks

Ad Hoc Networks

AD HOC NETWORKS

Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With Python | Edureka

Ad hoc Group

Track reconstruction in the LHC experiments