140 likes | 273 Views
Words and Terms. Stopwords and Stemmers. Last Weeks Key Points. The web is huge The web has a high rate of change The need to speed interactive searching by preparing an index in advance The need for lexical analysis and spam filtering. Worth Storing Every Word.
E N D
Words and Terms Stopwords and Stemmers
Last Weeks Key Points • The web is huge • The web has a high rate of change • The need to speed interactive searching by preparing an index in advance • The need for lexical analysis and spam filtering
Worth Storing Every Word • Is it worth recording every page containing “the” ?
Stop Words • Words which are too frequent to differentiate between pages and are not worth storing in the index are called “stop words” • Good list at • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
PERL Example for Stopwords ….. Foreach $stopword (@stopwords) { $line=~ s/$stopword//g }
Stemming • Reducing morphological variants of words to a standard underlying form • e.g. calculate, calculates, calculations to calculat- • improves recall at the expense of precision
Simple PERL Stemmer $line=~ s/s / /g $line=~ s/ed / /g
Porter Stemming Algorithm • Well known, effective stemmer, which does not use a dictionary • uses measure m • C(VC)mV • where • C is a sequence of consonants • V is a sequence of vowels
-sses -ing -at -y -ss - -ate -i Porter Algorithm Step 1 Stem only vowels Stem only vowels
-aliti -icate -able -al -ic - Porter Algorithm Step 2-4 Measure >0 Measure >0 Measure >1
Official PERL Implementation http://www.tartarus.org/ ~martin/PorterStemmer/index.html
Dictionary Based Stemmers • Dictionary of stems • cf vector based methods • Dictionary of words • effective handling of irregular forms • Proper Name/Controlled Vocabulary Lists • Equivalent Term/Thesaurii
Problems with stemming • Always worsens precision hoping to improve recall • Causes (sometimes odd misretrieval) • “bled” vs “bleeding” • incorrect term conflation “plastered” to “plaster” • Do we really want to improve recall on the web ?
Keypoints • Stop words • Why we need them • Porter Stemmer • Easy way to improve matches • Other Forms of Stemmer • Dictionaries and Thesauri