690 likes | 715 Views
Learn about the use of stop lists, stemming algorithms, and manual vs automatic indexing in information retrieval. Discover the advantages and disadvantages of each approach.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #4 February 3, 1999
This Class • Automatic indexing • Stop lists • How stemming is used in IR • Stemming algorithms • Frakes: Chapter 8
Disadvantages of Manual Indexing • Human effort • Controlled vocabulary per collection • Subjective (intersection about 40%)
Advantages of Manual Indexing • Human experts who use indexing aids such as “scope notes” describing allowable vocabulary and usage achieve good indexing uniformity
Which is better? • Salton - claims result of automatic comparable to manual • Blair - claims manual better • Often, manual indexing not a practical option
Automatic indexing • At this stage single words • Consider: • the usage of stop lists • which tokens to include • stemming algorithms
Stop lists • A stop list is a list of terms which are not included in an index • Inverted lists may be saved and used for identifying phrases
Why use stop words? • Lunh 1957 observed many of most frequently occurring words worthless as index terms • The 10 most frequently occurring terms account for 20-30% of the word occurrences • Eliminating stop words saves index space and computation time
Stop lists • Traditionally most frequently occurring English words. • Among the top 200 are words such as “time” “war” “home” etc.
Stop list for collection • “computer, machine, program, source, language” in a computer science collection
Stop lists • Commercial systems use only few stop words • ORBIT uses only 8, “and, an, by, from, of , the, with” • Lists of stop words appear in literature (Frakes)
Which tokens to include? • English words (not stop words) • Include numbers? • How to deal with hyphens? • Case sensitive?
Include numbers? • Numbers - not good discriminators • Important in some contexts • Usually systems allow tokens to include digits but not to begin with one • So B6 (vitamin) but not 6
Include hyphens? • Break into distinct terms or • Single term with hyphen • Chemical/abstracts service-hyphenated to single term • LEXIS/NEXIS - break apart into two terms if they occur in a title or abstract
Punctuation and case • Punctuation is sometimes important for example “command.com” “OS/2” • Case - convert to lower case or not
Commercial systems • Commercial systems prefer to enhance recall • Usually case insensitive • Index numbers • Very few stop words
Recognizing names • People’s names - “Bill Clinton” • Company names - IBM & big blue • Places • New York City, NYC, the big apple
Stemming algorithms • Affix removing stemmers • Dictionary lookup stemmers • n-gram stemmers • Successor variety stemmers
Stemming algorithms Conflation methods Manual Automatic Affix Removal Successor Variety Dictionary Lookup n-grams Longest Match Simple Removal
Stemming • Conflation - combining non identical words which refer to the same principal concept • Done manually or automatically • Automatic algorithms called stemmers
Stemming is used to: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term
Stemming during indexing • Index terms are stemmed words • Saves dictionary space • One inverted index list for all variants • Saves inverted index file space when position information in document not included • Query terms are also stemmed
Index is not stemmed • In this case the index contains words • No compression is achieved • No information is lost • Enables wild card searches • Enables long phrases searches when position information included
Providing term variants during search • A stemming algorithm generates term variants • Term variants added to query automatically (query expansion) or • The user is provided with term variants and decides which ones to include
Example • A user searching for “system users” is provided in the CATALOG system with term variants for “users” and “system”
Example (cont.) Search term: users TermOccurrences 1. user 15 2. users 1 3. used 3 4. using 2 • User selects variants to include in query
Stemmer correctness • A stemmer can be incorrect by either • Under-stemming or by • Over-stemming • Over-stemming can reduce precision • Under-stemming can affect recall
Over-stemming • Terms with different meanings are conflated • “considerate”, and “consider” and “consideration” should not be stemmed to “con”, with “contra”, “contact”, etc.
Under-Stemming • Prevents related terms from being conflated • Under-stemming “consideration” to “considerat” prevents conflating it with “consider”
Evaluating stemmers • In information retrieval stemmers are evaluated by their: • effect on retrieval and • compression rate, and • not linguistic correctness
Evaluating stemmers • Studies have shown that stemming has a positive effect on retrieval. • Performance of algorithms comparable • Results vary between test collections
Affix removal stemmers • Remove • suffixes and and/or • prefixes from terms • leaving a stem
Affix removal stemmers • In English stemmers are suffix removers • In other languages, for example Hebrew remove both prefix and suffix • Keshehalachnu --> halach • Nelechna --> halach
Affix removal stemmers • Most affix removal stemmers in use are: • iterative - for example, “consideration” stemmed first to “considerat” then to “consider” • longest match stemmers using a set of stemming rules arranged on a ‘longest match’ principal (Lovins)
A simple stemmer • Harman experimented • concluded minimal stemming helpful • Her simple stemmer changes: • Plural to singular • Third person to first person
A simple stemmer (Harman) if word ends in “ies” but not “eies” or “aies” then “ies”->“y”; elsein “es” but not “aes”, “ees” or “oes” then “es”->e; elsein “s” but not “us” or “ss” then “s”->NULL endif
A simple stemmer • Algorithm changes: • “skies” to “sky”, • “retrieves” to “retrieve”, and • “doors” to “door” but not “corpus” or “wellness” • “dies” to “dy”?
The Paice/Husk stemmer • Uses a table of rules grouped into sections • Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) • A formis any word or part of a word considered for stemming
The Paice/Husk stemmer • Each rule specifies a deletion or a replacement of an ending • The order of the rules in each section is important. • Rules tried until one can be applied, and the current form is updated
Rule structure • Each rule contains 5 parts (2 are optional): • An ending (one or more characters in reverse order) • An optional “intact” flag “*” denoting form not yet stemmed
Rule structure • A digit (>=0) specifying no. characters to remove • An optional string to append (after removal) • A rule ending with “>“ denotes stemming should continue “.” terminating the stemming process
Examples of rules • “sei3y>“ • if form ends in “ies” then replace the last 3 letters by y and continue stemming ( “tries” becomes “try”)
Examples of rules • “mu*2.” • if form ends with “um” and word is intact remove 2 last letters and terminate stemming. • “maximum” is stemmed to “maxim”, but “presum” from “presumably” remains unchanged
Examples of rules • “ylp0.” - if word terminates in “ply” terminate. Next rule “yl2>“ does not remove “ly” from “multiply” • “nois4j>“ causes “sion” to be replaced by “j”. • “j” acts as dummy ending • “provision” converted to “provij” and then to “provid”
The Algorithm terminate:=false; whilenot terminate and there is a section for last letter of form do Use last letter of form to select a section in the table of rules applied:=false; while more rules in section andnot applied do {Check the applicability of the current rule} if form ending does not match the current rule’s exit if form matches ending but intact flag is on and the form is not intact exit if acceptability conditions are not satisfied exit apply rule to form (delete and append as rule specifies) applied:=true; if rule ends in “.” then terminate:=true; endwhile if not applied then terminate=true; endwhile
Acceptability conditions • Attempt to prevent over-stemming • Without them “rent”, “rant”, “rice”, “rate”, “ration” “river” reduce to “r”. • There are 2 rules:
Acceptability conditions • If form starts with a vowelthen at least 2 letters must remain (owed/owing->ow but not ear->e) • If a form starts with a consonantthen at least 3 letters must remain after stemming, and at least one of them must be a vowel or “y” (saying->say, crying->cry, but not string->str, meant->me, or cement->ce)
Acceptability conditions • These rules cause error in the stemming of some short-rooted words • (doing, dying, being). • These could be dealt with separately with a table lookup
Stemming “separately” • Use “y” section. Mismatch “ylb1>”, “yli3y>”, “ylp0.”. Match “yl2>”. Form becomes “separate” • Use “e1>“ in “e” section. Form changes to “separat” • Use t section. Mismatch with “tacilp4y.”. Match with “ta2>“. Form changes to “separ” • Match with “ra2.”. So “sep”