510 likes | 721 Views
This Class. How stemming is used in IR Stemming algorithms Frakes: Chapter 8 Kowalski: pages 67-76. Stemming algorithms. Affix removing stemmers Dictionary lookup stemmers n-gram stemmers Successor variety stemmers. Stemming. Conflation - combining morphological term variants
E N D
This Class • How stemming is used in IR • Stemming algorithms • Frakes: Chapter 8 • Kowalski: pages 67-76
Stemming algorithms • Affix removing stemmers • Dictionary lookup stemmers • n-gram stemmers • Successor variety stemmers
Stemming • Conflation - combining morphological term variants • Done manually or automatically • Automatic algorithms called stemmers
Stemming algorithms Conflation methods Manual Automatic Affix Removal Successor Variety Dictionary Lookup n-grams Longest Match Simple Removal
Stemming is used for: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term
Stemming during indexing • Index terms are stemmed words • Saves dictionary space • One inverted index list for all variants • Saves inverted index file space when position information in document not included • Query terms are also stemmed
Index is not stemmed • In this case the index contains words • No compression is achieved • No information is lost • Enables wild card searches • Enables long phrase searches when position information included
Providing term variants during search • A stemming algorithm generate term variants • Term variants added to query automatically (query expansion) or • The user is provided with term variants and decides which ones to include
Example • A user searching for ystem users?is provided in the CATALOG system with term variants for sers?and ystem
Example (cont.) Search term: users Term Occurrences 1. user 15 2. users 1 3. used 3 4. using 2 • User selects variants to include in query
Stemmer correctness • A stemmer can be incorrect by either • Under-stemming or by • Over-stemming • Over-stemming can reduce precision • Under-stemming can affect recall
Over-stemming • Terms with different meanings are conflated • onsiderate? and onsider?and onsideration should not be stemmed to on? with ontra? ontact? etc.
Under-Stemming • Prevents related terms from being conflated • Under-stemming onsideration?to onsiderat? prevents conflating it with onsider
Evaluating stemmers • In information retrieval stemmers are evaluated by their: • effect on retrieval and • compression rate, and • not linguistic correctness
Evaluating stemmers • Studies have shown that stemming has a positive effect on retrieval. • Performance of algorithms comparable • Results vary between test collections
Affix removal stemmers • Remove • suffixes and and/or • prefixes from terms • leaving a stem
Affix removal stemmers • In English stemmers are suffix removers • In other languages, for example Hebrew, both prefix and suffix are removed
Affix removal stemmers • Most affix removal stemmers in use are: • iterative - for example, onsideration?stemmed first to onsiderat?then to onsider • longest match stemmers using a set of stemming rules.
A simple stemmer • Harman experimented • concluded minimal stemming helpful • Her simple stemmer changes: • Plural to singular • Third person to first person
A simple stemmer • Algorithm changes: • kies?to ky? ies->y • etrieves?to etrieve? es->s, and • oors?to oor? s->NULL • (leaves orpus?or ellness? • ies?to y?
A simple stemmer 1. word ends in es?but not ies?or ies?change end to ? 2. word endsin s? but not es? es?or es?change to ? 3. word endsin ?but not s?or s? remove s
The Paice/Husk stemmer • Uses a table of rules grouped into sections • Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) • A form is any word or part of a word considered for stemming
The Paice/Husk stemmer • Each rule specifies a deletion or a replacement of an ending • The order of the rules in each section is important. • Rules tried until one can be applied, and the current form is updated
Rule structure • Each rule contains 5 parts (2 are optional): • An ending (one or more characters in reverse order) • An optional ntact?flag ??denoting form not yet stemmed
Rule structure • A digit (>=0) specifying no. characters to remove • An optional string to append (after removal) • A rule ending with ??denotes stemming should continue ?? terminating the stemming process
Examples of rules • ei3y>? • if form ends in es?then replace the last 3 letters by ?and continue stemming ( ries?becomes ry?
Examples of rules • u*2.? • if form ends with m?and word is intact remove 2 last letters and terminate stemming. • aximum?is stemmed to axim? but resum?from resumably?remains unchanged
Examples of rules • lp0.?- if word terminates in ly?terminate. Next rule l2>?does not remove y?from ultiply • ois4j>?causes ion?to be replaced by ? • ?acts as dummy ending • rovision?converted to rovij?and then to rovid
Acceptability conditions • Rule not applied unless conditions satisfied • Attempt to prevent over-stemming • Without them ent? ant? ice? ate? ation?iver?reduce to ? • There are 2 rules:
Acceptability conditions • If form starts with a vowel then at least 2 letters must remain (owed/owing->ow but not ear->e) • If a form starts with a consonant then at least 3 letters must remain, and at least one must be a vowel or (saying->say, crying->cry, but not string->str, meant->me, or cement->ce)
Acceptability conditions • These rules cause error in the stemming of some short-rooted words • (doing, dying, being). • These could be dealt with separately with a table lookup
Example with Paice stemming • eparately?- use ?section • mismatch ylb1>, yli3y>, ylp0. • match yl2>. Form becomes eparate? • use rule 1>?in ?section • form changes to eparat?- use t section • mismatch with acilp4y.? match with a2>? change form to epar • use r section, match with a2.? So ep
n-grams • Fixed length consecutive series of ?characters • Bigrams: • Sea colony -> (se ea co ol lo on ny) • Trigrams • Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#)
Usage of n-grams • Used in world war II by cryptographers • Spell checking • Text compression • Signature files • Stemming
n-gram temmers • Adamson and Borcham (1974) • Method for grouping term variants • Language independent
n-gram temmers • Each term transformed to n-gram • A similarity value is generated between any pair of terms in database, resulting in a similarity matrix
n-gram temmers • A clustering method (single link) groups highly similar terms into clusters • Most matrix elements had value 0. • Used a cutoff value of 0.6 for their clustering algorithm
Dice Coefficient • Many formulas for computing set similarity • Dice coefficient: S=2(|A B|)/(|A|+|B|) • 0 S 1 • S=1 if A=B, S=0 if A B=
Sets of Unique Bigrams • Let A and B denote the sets of unique bigrams associated with two terms, and let C=A B • statistics -> (st ta at ti is st ti ic cs) • Set of unique bigrams for statistics: A={at cs ic is st ta ti}, |A|=7
n-gram temmers • statistical= (st ta at ti is st ti ic ca al) • Set of unique bigrams for statistical B= {al at ca ic is st ta ti}, |B|=8 • C={at ic is ta st ti}, |C|=6 • S=2|C|/(|A|+|B|)=2x6/(7+8)=.8
Table lookup method • Ideally, a table is constructed with stem for every word • Stemming - look up word find stem • There is no such data for English • Systems use a combination of dictionary lookup and conflation rules
Dictionary lookup method • INQUERY uses Kstem • Kstem is a morphological analyzer that conflates word variants to root form
Dictionary lookup method • Tries to avoid collapsing words with different meaning to same root • The original word or a stemmed version is looked up in a dictionary and replaced by the best stem
Successor variety stemmer • Based on work in structural linguistic (Hafer and Weiss) • Performed less well than affix removing stemmers • Given a set of words, the successor variety (SV) of a string is the number of different characters that follow it in words in the set
Successor variety stemmers • Terms : {able, axle, accident, ape, about, apply, application, applies} • The SV of p?is 2 p?is followed by ?in pe?and by ?in pply application and applies • The SV of ?is 4 ?followed in set by ? ?? and
SVs for pply?and pplies * denotes a break point at peak
Segmenting words • 4 ways: • Cut-off SV is reached • SV eaks • A substring of a word is equal to another word in the set eadable?breaks into ead?and ble • Entropy based method
Selecting a stem • First segment is selected if it occurs in at most 12 words, • Otherwise the second segment is selected (3 segments are unlikely)