Automatic Indexing and Stemming in Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #4 February 3, 1999

This Class • Automatic indexing • Stop lists • How stemming is used in IR • Stemming algorithms • Frakes: Chapter 8

Disadvantages of Manual Indexing • Human effort • Controlled vocabulary per collection • Subjective (intersection about 40%)

Advantages of Manual Indexing • Human experts who use indexing aids such as “scope notes” describing allowable vocabulary and usage achieve good indexing uniformity

Which is better? • Salton - claims result of automatic comparable to manual • Blair - claims manual better • Often, manual indexing not a practical option

Automatic indexing • At this stage single words • Consider: • the usage of stop lists • which tokens to include • stemming algorithms

Stop lists • A stop list is a list of terms which are not included in an index • Inverted lists may be saved and used for identifying phrases

Why use stop words? • Lunh 1957 observed many of most frequently occurring words worthless as index terms • The 10 most frequently occurring terms account for 20-30% of the word occurrences • Eliminating stop words saves index space and computation time

Stop lists • Traditionally most frequently occurring English words. • Among the top 200 are words such as “time” “war” “home” etc.

Stop list for collection • “computer, machine, program, source, language” in a computer science collection

Stop lists • Commercial systems use only few stop words • ORBIT uses only 8, “and, an, by, from, of , the, with” • Lists of stop words appear in literature (Frakes)

Which tokens to include? • English words (not stop words) • Include numbers? • How to deal with hyphens? • Case sensitive?

Include numbers? • Numbers - not good discriminators • Important in some contexts • Usually systems allow tokens to include digits but not to begin with one • So B6 (vitamin) but not 6

Include hyphens? • Break into distinct terms or • Single term with hyphen • Chemical/abstracts service-hyphenated to single term • LEXIS/NEXIS - break apart into two terms if they occur in a title or abstract

Punctuation and case • Punctuation is sometimes important for example “command.com” “OS/2” • Case - convert to lower case or not

Commercial systems • Commercial systems prefer to enhance recall • Usually case insensitive • Index numbers • Very few stop words

Recognizing names • People’s names - “Bill Clinton” • Company names - IBM & big blue • Places • New York City, NYC, the big apple

Stemming algorithms • Affix removing stemmers • Dictionary lookup stemmers • n-gram stemmers • Successor variety stemmers

Stemming algorithms Conflation methods Manual Automatic Affix Removal Successor Variety Dictionary Lookup n-grams Longest Match Simple Removal

Stemming • Conflation - combining non identical words which refer to the same principal concept • Done manually or automatically • Automatic algorithms called stemmers

Stemming is used to: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term

Stemming during indexing • Index terms are stemmed words • Saves dictionary space • One inverted index list for all variants • Saves inverted index file space when position information in document not included • Query terms are also stemmed

Index is not stemmed • In this case the index contains words • No compression is achieved • No information is lost • Enables wild card searches • Enables long phrases searches when position information included

Providing term variants during search • A stemming algorithm generates term variants • Term variants added to query automatically (query expansion) or • The user is provided with term variants and decides which ones to include

Example • A user searching for “system users” is provided in the CATALOG system with term variants for “users” and “system”

Example (cont.) Search term: users TermOccurrences 1. user 15 2. users 1 3. used 3 4. using 2 • User selects variants to include in query

Stemmer correctness • A stemmer can be incorrect by either • Under-stemming or by • Over-stemming • Over-stemming can reduce precision • Under-stemming can affect recall

Over-stemming • Terms with different meanings are conflated • “considerate”, and “consider” and “consideration” should not be stemmed to “con”, with “contra”, “contact”, etc.

Under-Stemming • Prevents related terms from being conflated • Under-stemming “consideration” to “considerat” prevents conflating it with “consider”

Evaluating stemmers • In information retrieval stemmers are evaluated by their: • effect on retrieval and • compression rate, and • not linguistic correctness

Evaluating stemmers • Studies have shown that stemming has a positive effect on retrieval. • Performance of algorithms comparable • Results vary between test collections

Affix removal stemmers • Remove • suffixes and and/or • prefixes from terms • leaving a stem

Affix removal stemmers • In English stemmers are suffix removers • In other languages, for example Hebrew remove both prefix and suffix • Keshehalachnu --> halach • Nelechna --> halach

Affix removal stemmers • Most affix removal stemmers in use are: • iterative - for example, “consideration” stemmed first to “considerat” then to “consider” • longest match stemmers using a set of stemming rules arranged on a ‘longest match’ principal (Lovins)

A simple stemmer • Harman experimented • concluded minimal stemming helpful • Her simple stemmer changes: • Plural to singular • Third person to first person

A simple stemmer (Harman) if word ends in “ies” but not “eies” or “aies” then “ies”->“y”; elsein “es” but not “aes”, “ees” or “oes” then “es”->e; elsein “s” but not “us” or “ss” then “s”->NULL endif

A simple stemmer • Algorithm changes: • “skies” to “sky”, • “retrieves” to “retrieve”, and • “doors” to “door” but not “corpus” or “wellness” • “dies” to “dy”?

The Paice/Husk stemmer • Uses a table of rules grouped into sections • Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) • A formis any word or part of a word considered for stemming

The Paice/Husk stemmer • Each rule specifies a deletion or a replacement of an ending • The order of the rules in each section is important. • Rules tried until one can be applied, and the current form is updated

Rule structure • Each rule contains 5 parts (2 are optional): • An ending (one or more characters in reverse order) • An optional “intact” flag “*” denoting form not yet stemmed

Rule structure • A digit (>=0) specifying no. characters to remove • An optional string to append (after removal) • A rule ending with “>“ denotes stemming should continue “.” terminating the stemming process

Examples of rules • “sei3y>“ • if form ends in “ies” then replace the last 3 letters by y and continue stemming ( “tries” becomes “try”)

Examples of rules • “mu*2.” • if form ends with “um” and word is intact remove 2 last letters and terminate stemming. • “maximum” is stemmed to “maxim”, but “presum” from “presumably” remains unchanged

Examples of rules • “ylp0.” - if word terminates in “ply” terminate. Next rule “yl2>“ does not remove “ly” from “multiply” • “nois4j>“ causes “sion” to be replaced by “j”. • “j” acts as dummy ending • “provision” converted to “provij” and then to “provid”

The Algorithm terminate:=false; whilenot terminate and there is a section for last letter of form do Use last letter of form to select a section in the table of rules applied:=false; while more rules in section andnot applied do {Check the applicability of the current rule} if form ending does not match the current rule’s exit if form matches ending but intact flag is on and the form is not intact exit if acceptability conditions are not satisfied exit apply rule to form (delete and append as rule specifies) applied:=true; if rule ends in “.” then terminate:=true; endwhile if not applied then terminate=true; endwhile

Acceptability conditions • Attempt to prevent over-stemming • Without them “rent”, “rant”, “rice”, “rate”, “ration” “river” reduce to “r”. • There are 2 rules:

Acceptability conditions • If form starts with a vowelthen at least 2 letters must remain (owed/owing->ow but not ear->e) • If a form starts with a consonantthen at least 3 letters must remain after stemming, and at least one of them must be a vowel or “y” (saying->say, crying->cry, but not string->str, meant->me, or cement->ce)

Acceptability conditions • These rules cause error in the stemming of some short-rooted words • (doing, dying, being). • These could be dealt with separately with a table lookup

Stemming “separately” • Use “y” section. Mismatch “ylb1>”, “yli3y>”, “ylp0.”. Match “yl2>”. Form becomes “separate” • Use “e1>“ in “e” section. Form changes to “separat” • Use t section. Mismatch with “tacilp4y.”. Match with “ta2>“. Form changes to “separ” • Match with “ra2.”. So “sep”

Other examples

Automatic Indexing and Stemming in Information Retrieval

Automatic Indexing and Stemming in Information Retrieval

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval