160 likes | 287 Views
ISM@FIRE MET-2013. Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad. Contents. Introduction to Morpheme ISMStemmer Result of MET at FIRE-2013 Problems in ISMStemmer Conclusion. Morpheme.
E N D
ISM@FIRE MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad
Contents • Introduction to Morpheme • ISMStemmer • Result of MET at FIRE-2013 • Problems in ISMStemmer • Conclusion
Morpheme • In linguistics, a morpheme is the smallest grammatical unit in a language. • Every word comprises one or more morphemes. • Morphological analysis is the process of segmenting a word into its component. e.g."Unbreakable" comprises three morphemes: un- (a morpheme signifying "not") -break- (the stem, a free morpheme), and -able (a morpheme signifying "can be done").
Stemmer • Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Reasons: • search engines are based on string matching • similarity of a document wrt a query mostly determined by exact term overlap • vocabulary mismatch as natural language documents use different form of a word for the same content
Why stemming? (contd…) For children education is very important Example – Suppose we have to search some information about “education” doc 1 What is the reason we educate children doc 2 Query: education Government aims to make people educated doc 3 Educating young minds is the job of a teacher doc 4
Why stemming? (contd…) For children education is very important By stemming: Original word -education, educate Stemmed word - educat doc 1 What is the reason we educate children doc 2 Query: education Government aims to make people educated doc 3 Educating young minds is the job of a teacher doc 4
ISMstemmer • Approaches for Stemming • Language based approach • Statistical approach ISMStemmeris statistical • Based on suffix extraction • Suffix identified applying Apriori Algorithm (Agrawal and Srikant, 1994)
ISMStemmer algorithm Single Colum Refined File aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling aborn absolu absorp abuild aquisi activa add add admira admitt admitt agre agree allott allott ambl angl Generate valid suffixes(AprioriAlgo) Strip off valid suffixes to get stems
Suffix Generation Input is Single Column Sorted Refined File aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling Valid Suffixes ing ed tion . . . . . . er ment dedda dettolla … noitidda noitulosba … gnidliubagnieera Gnilgng ….. Reverse the unique sorted word file • Generate frequent suffixes (of length 1-character, 2-characters and so on). • Find valid suffixes whose frequency is above a pre-decided threshold value α.
Evaluation of ISMstemmer • For evaluation of ISMstemmer we have participated in: Morpheme Extraction Task (MET) of FIRE-2013 • ISMstemmer submitted • evaluated at IR Labs: DAIICT, Gujarat • tested on 5 languages of South Asian origin • has given efficient results with 3 languages
Results ( Linguistic Evaluation) • Tamil:Precision: 80.22%; non-affixes: 80.22%Recall: 18.86%; non-affixes: 18.86%F-measure: 30.54%; non-affixes: 30.54%Bengali:Precision: 60.64%; non-affixes: 60.64%Recall: 32.15%; non-affixes: 32.15%F-measure: 42.02%; non-affixes: 42.02%
Post-hoc Analysis • Over stemming • accent, accentual, accentuate– accent • accept, acceptant, acceptor– accept • access, accessible, accession– access due to overstemming acce • Stemming of Named Entities 1. Beijing Beij
Future plan • Need to consider the prefix as well -Clustering based on prefix • Identification NEs (Use o NERs) • ….
THANK YOU! . . Questions?