1 / 14

UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010

UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010. A. Loponen, J. Paik* & K. Jarvelin kalervo.jarvelin@uta.fi *ISI, Kolkata, India. Outline. Introduction Some Remarks on Bengali Experimental Systems YASS GRALE STALE Experiments Results Discussion.

jaron
Download Presentation

UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UTA Stemming and Lemmatization Experiments in the Bengali ad hoc Track at FIRE 2010 A. Loponen, J. Paik* & K. Jarvelin kalervo.jarvelin@uta.fi *ISI, Kolkata, India

  2. Outline • Introduction • Some Remarks on Bengali • Experimental Systems • YASS • GRALE • STALE • Experiments • Results • Discussion

  3. Introduction • UTA participated in the monolingual Bengali ad hoc Track. Bengali is complicated for IR. • UTA experimented with three language normalizers: one stemmer, YASS, and two lemmatizers, GRALE and StaLe. • UTA submitted 6 runs to the Track • title (T) runs (GRALE and StaLe) • title-and-description (TD) runs (all systems), and • title-description-and-narrative (TDN) run (StaLe)

  4. Some Remarks on Bengali • Bengali is a highly inflectional • one root can produce 20+ morphological forms • generated by adding suffixes • nouns and pronouns inflect in nominative, objective, genitive and locative • productive in compound words • combinations of nouns, pronouns, adjectives and verbs • new words are also formed by derivation • Word form normalization is likely to increase the term weights and affect retrieval effectiveness • Proper names are often either abstract nouns or adjectives • no capitalization of proper names

  5. Experimental Systems: YASS • YASS is a corpus based purely unsupervised statistical stemmer • Handles languages which are based on suffixing • A string distance measure to cluster the lexicon such that each cluster expected to contain all the morphological variations of a root word appearing in the corpus. • Delivers stems based on the training collection

  6. Experimental Systems: GRALE • GRALE is a graph-based lemmatizer for agglutinative languages • Two-step algorithm • extracts a set of frequent suffixes by measuring their n-gram frequency from the given corpus and • picks up case suffixes manually identified by a native speaker • words are then considered as nodes of a graph • a directed edge from node u to v exists if v can be generated from u by addition of a suffix taken from the selected suffix set. • The graph built over the lexicon is directed and acyclic.

  7. bala | tell Pivot bala+te balai | name bala+te+i balai+ke balai+ke+i balai+er GRALE Principle

  8. Experimental Systems: StaLe • StaLe is a statistical, rule-based lemmatizer – also for OOV processing • Two phases: • one-time creation of the transformation rules for a given language, • multi-time lemma generation for input words • The training data set for Bengali was obtained by randomly selecting 11000 unique inflected tokens from the FIRE2008 test corpus • The training data set consisted of nouns only

  9. Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r # cf hens -> h # cf en -> e # cf # count cf confidence factor StaLe Principle

  10. Experiments • The Bengali test collection • 123047 documents with an average length of 362 words • 50 topics – average query lengths • title (T) queries was 6, • TD queries 17, • TDN queries 44 words. • The recall base has 510 relevant documents • Lemur v4.7 Indri search engine

  11. UTA Runs • T TD TDN • YASS x • GRALE x x • StaLe x x x • Baseline x x x

  12. Results T TD TDN StaLeMAP 33.74 44.88 50.58 P@1030.80 37.00 41.40 GRALEMAP34.58 44.51 - P@10 30.20 37.40 - YASSMAP - 45.11 - P@10 - 38.00 - BlineMAP 27.37 38.92 44.55 P@10 25.20 32.80 38.00

  13. Discussion • All normalizers better than the baselines • Also quite competitive in the campaign • The MAP range of others was … • Differences between normalizers are minor • Their benefits lie elsewhere: • Generality • Training cost • Form of output

More Related