1 / 14

Words and Terms

Words and Terms. Stopwords and Stemmers. Last Weeks Key Points. The web is huge The web has a high rate of change The need to speed interactive searching by preparing an index in advance The need for lexical analysis and spam filtering. Worth Storing Every Word.

lloyd
Download Presentation

Words and Terms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Words and Terms Stopwords and Stemmers

  2. Last Weeks Key Points • The web is huge • The web has a high rate of change • The need to speed interactive searching by preparing an index in advance • The need for lexical analysis and spam filtering

  3. Worth Storing Every Word • Is it worth recording every page containing “the” ?

  4. Stop Words • Words which are too frequent to differentiate between pages and are not worth storing in the index are called “stop words” • Good list at • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

  5. PERL Example for Stopwords ….. Foreach $stopword (@stopwords) { $line=~ s/$stopword//g }

  6. Stemming • Reducing morphological variants of words to a standard underlying form • e.g. calculate, calculates, calculations to calculat- • improves recall at the expense of precision

  7. Simple PERL Stemmer $line=~ s/s / /g $line=~ s/ed / /g

  8. Porter Stemming Algorithm • Well known, effective stemmer, which does not use a dictionary • uses measure m • C(VC)mV • where • C is a sequence of consonants • V is a sequence of vowels

  9. -sses -ing -at -y -ss - -ate -i Porter Algorithm Step 1 Stem only vowels Stem only vowels

  10. -aliti -icate -able -al -ic - Porter Algorithm Step 2-4 Measure >0 Measure >0 Measure >1

  11. Official PERL Implementation http://www.tartarus.org/ ~martin/PorterStemmer/index.html

  12. Dictionary Based Stemmers • Dictionary of stems • cf vector based methods • Dictionary of words • effective handling of irregular forms • Proper Name/Controlled Vocabulary Lists • Equivalent Term/Thesaurii

  13. Problems with stemming • Always worsens precision hoping to improve recall • Causes (sometimes odd misretrieval) • “bled” vs “bleeding” • incorrect term conflation “plastered” to “plaster” • Do we really want to improve recall on the web ?

  14. Keypoints • Stop words • Why we need them • Porter Stemmer • Easy way to improve matches • Other Forms of Stemmer • Dictionaries and Thesauri

More Related