ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task AvinashYadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India

Contents • Introduction • Adhoc retrieval task participation • Morpheme Extraction Task participation • Conclusion

Introduction • Stemmer • ISMstemmer • Evaluation

Stemmer • Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat • Approaches for Stemming • Language based approach • Statistical approach

ISMstemmer • statistical stemmer • based on suffix extraction • suffix frequency • algorithm

Data Preprocessing Convert the corpus into single file John asked a girl with an apple of Kashmir, “ do you have the time”. She said,“yes”. John asked a girl with an apple of Kashmir do you have the time she said yes John asked girl with apple Kashmir you time she said yes File 1 … File n File 2 Cleaning of data John asked girl with apple Kashmir you time she said yes John asked girl with apple Kashmir you time she said yes Single File John asked a girl with an apple of Kashmir do you have the time she said yes Removing Stop Words Convert file into Single Column

Data preprocessing (contd….) • unique words extracted • Hindi- 4,90,391 • English-7,95,144

Find valid suffixes gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba Reverse the words of single column file 17% de gni niot Sort the reversed list Find suffix according to threshold 40% gni

Threshold used • English: 0.01 - 0.1% • Hindi: 0.1 – 1.0%

Stemming of corpus dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba add agre admitt allott abuild agree ambl angl aborn admitt allott admira activa addi acquisi absorp absolu Stem the reversed words with reversed valid suffixes Reverse stemmed words to get the original words

Note: If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed aging king ag k

Evaluation of ISMstemmer • For evaluation of ISMstemmer we have participated in: • Monolingual Adhoc retrieval task in English and Hindi Languages • Morpheme Extraction Task (MET) of FIRE-2012

Adhoc Retrieval Task(ART) Participation • Monolingual task • Languages chosen: • English • Approach • Results • Hindi • Approach • Results

ART: English Approach: • Indexing: • Search Engine used: Indri(IndriBuildIndex) • Retrieval: • Search engine used: Lemur (RetEval) • Data Provided: • Corpus from The Telegraph and BD News • 50 query set

ART: English (contd….) • Results:

ART: Hindi Approach: • Indexing: • Search Engine used: Indri (IndriBuildIndex) • Retrieval: • Search Engine used: Indri (IndriRunQuery) • Data Provided: • Corpus from Navbharat Times and AmarUjala • 50 query set

ART: Hindi (contd….) • Results:

Morpheme Extraction Task Participation • Tool submitted • Results

MET Tool Submission. • ISMstemmer submitted • evaluated at IR Labs: DAIICT, Gujarat • tested on 6 languages of South Asian origin • has given efficient results with 3 languages

MET Results: • BENGALI Institute Language MAP Obtained Baseline Bengali 0.2740 JU Bengali 0.3307 DCU Bengali 0.3300 IIT-KGP Bengali 0.3225 CVPR-Team1 Bengali 0.3159 ISM Bengali 0.3103 CVPR-Team2+ Bengali NA

MET Results (contd….) 2. GUJARATI Institute Language MAP Obtained Baseline Gujarati 0.2677 ISM Gujarati 0.2824 3. MARATHI Institute Language MAP Obtained Baseline Marathi 0.2320 ISM Marathi 0.2797 IIT-B Marathi 0.2684

MET Results (contd….) 4. ODIA Institute Language MAP Obtained Baseline Odia 0.1537 IIIT-BhOdia0.1537 ISM Odia 0.1537 5. HINDI Institute Language MAP Obtained Baseline Hindi 0.2821 DCU Hindi 0.2963 ISM Hindi 0.2793

MET Results (contd….) 6. TAMIL Institute Language MAP Obtained Baseline Tamil NA AUCEG Tamil NA ISM Tamil NA NA : results are not available, due non-availability of qrels

Reasons for Underperformance with Hindi • overstemming • undesired stemming of proper nouns

Overstemming • This refers to words that shouldn’t be grouped together by stemming, but are. Example – • accent, accentual, accentuate Stem word – accent • accept, acceptant, acceptor Stem word – accept • access, accessible, accession Stem word – access • due to overstemming it may be possible that these all group into wrong stem - acce

Undesired stemming of proper nouns • proper nouns should not be stemmed as they are not inflected Example – Beijing It will get stemmed to Beij

Conclusion ART: • English: not satisfactory Hindi: poor Reasons: • overstemming • undesired stemming of proper nouns MET: • performed efficiently with Bengali, Gujarati and Marathi languages • performed up to the mark with Odia • underperformed with Hindi

References 1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata. 2. www.isical.ac.in/~fire/ (as on 06.12.2012) 3. Christopher D. Manning, HinrichSchütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9. 4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012) 5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012) 6. www.lemurproject.org (as on 06.12.2012) 7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)

References (contd…) 8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011). 9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China. 10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81. 11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012) 12. How Effective Is Suffixing? Donna Harman. lister Hill Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209

THANK YOU!!

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task