1 / 20

Marathi – Marathi Monolingual Information Retrieval

Marathi – Marathi Monolingual Information Retrieval. Mr. Ashish Almeida Prof. Pushpak Bhattacharyya. Overview. Morphological analyzer Suffix processing Stop-words Future work. Present work. Search “ भारत ” – bhaarat – Bharat Will not match pages which has terms such as

zariel
Download Presentation

Marathi – Marathi Monolingual Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya

  2. Overview • Morphological analyzer • Suffix processing • Stop-words • Future work

  3. Present work • Search “भारत” – bhaarat – Bharat • Will not match pages which has terms such as • भारताचा – bharataachaa - Of Bharat • भारतात – bharataat - In Bharat • Lack of large size corpus • Unavailability of tools

  4. Corpus Statistics- Marathi • 99,275 Documents (510 MB) • Maharashtra times • Sakal News • April 2004 to September 2007 • UTF-8 encoding • XML tags • DOC - document • DOCNO – document identifier • TEXT - article

  5. Document: example <DOC> <DOCNO>MaharashtraC06E811C6B.htm.txt</DOCNO> <TEXT> मोहफूल वेचण्यास गेलेल्या तरुणावर बिबट्याचा हल्ला (attack of a leapord on a young man who has gone to collect flowers of Moha) इस्लापूर, ता. २२ - चारोळी आणि मोहफूल वेचण्यासाठी जंगलात गेलेल्या एका आदिवासी तरुणावर बिबट्याने अचानक हल्ला केल्याने तो तरुण गंभीर जखमी झाला आहे. ही घटना शुक्रवारी (ता. २०) मुळझरा (ता. किनवट) या गावाच्या जंगलात घडली. ....... इस्लापूर वन परिक्षेत्र कार्यालयाअंतर्गत येणाऱ्या मुळझरा येथील आदिवासी तरुण मनोहर . . . . . . </TEXT> </DOC>

  6. Topics • 100 topics • Aligned with English topics • XML tags • num : query identifier • title: title of the query • desc: description • narr: Additional information about the query • Cover all issues –local, international

  7. Topic example <top><num>1<title>ट्वेंटी-२० विश्वचषकातील भारताचे क्रीडापटुत्व (India’s championship in tewnty-20 Worldcup)<desc>पहिल्या आयसीसी विश्व ट्वेंटी-२० सर्वोत्कृष्ट-विजेता-स्पर्धेतील भारताच्या विजयाचे वृत्त देणारा लेख शोधा.</desc><narr>ट्वेंटी-२० विश्चचषक स्पर्धेमधील पाकिस्तान विरूद्ध भारताचा विजय, ह्या ऐतिहासिक विजया निमित्त खेळाडूंनी केलेले विक्रम त्यांनी मिळविलेली बक्षिसे व पुरस्काराची रक्कम सामनावीराचे तसेच मालिकावीराचे नाव, माजी खेळाडूंनी आणि जगभरातील लोकांनी केलेली प्रशंसा यासंदर्भात आम्ही उचित माहिती मिळवत आहोत.</top>

  8. Tools • Terrier • Open source IR system • Models • TF-IDF (Vector space model) • DFR-BM25 (Probabilistic) • Both models available in Terrier • Evaluation against relevance judged document for 25 queries

  9. Lemmatizer Vs stemmer • भारताला bhaarataalaa – for Bharat • भारताचाbhaarataachaa - of Bharat • भारतातbhaarataat – in Bharat • भारतावरbhaarataavar – on Bharat • Lemmatizer finds Lemma • भारत • Stemmer finds stem: Longest unchangeable word prefix • भारता

  10. Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get attached after another suffix • Example: • घरासमोरचादेखिल • घरा-समोर-चा-देखिल • gharaa-samor-chaa-dekhil • house-front- of-also • Root word: घर (ghar) (house)

  11. Morphological analyzer • Use of Marathi morphology analyzer • Better matching words • राम versus रामा • Gives all possible roots • Selects first root – most frequent • Used at indexing and query processing end

  12. Lemmatizer Results

  13. Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes • Suffixes of space and time • वर – var - on • समोर – samor - in front of • मध्ये – madhye - in • नंतर -nanter – after • Created manually • 66 words list

  14. Stop-words • Most frequently occurring words • Little discriminatory value • Occur in 80 % or more documents • Selected stop-words • ती, ते, या, ून, अस,  आह, ये, हो, कर, त

  15. Results suffix indexing and stop-words

  16. P-R graph • Precision-recall graph for all four cases is show below

  17. Future work • Morphological analyzer • Accuracy 94.5 % • Needs to be improved • Heuristic suffix stripping: unknown words • Handle derivational morphology • Spelling variations, common spelling mistakes

  18. Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, • http://in.indiatimes.com/aboutus.cms • Sakal: Sakal Media Group • http://www.sakaal.in/

  19. References • http://ir.dcs.gla.ac.uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval • Jacques Savoy, Searching strategies for the Bulgarian language • Morphological Analyzer, CFILT

  20. Thank you

More Related