160 likes | 301 Views
Stemming Algorithms. 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃 哲修 9142609 張家豪. From www.mis.nsysu.edu.tw/~syhwang/Courses/IR/ Stem mingAlgorithms. ppt , modified by Sumanta. The Porter Algorithm. Word = Stem + Affix(es) E.g., generalizations = general + ization + s
E N D
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪 Fromwww.mis.nsysu.edu.tw/~syhwang/Courses/IR/StemmingAlgorithms.ppt, modified by Sumanta
The Porter Algorithm • Word = Stem + Affix(es) • E.g., generalizations = general + ization + s • Stemming is the determination of the stem of a given word • Porter’s stemmer is a rule-based algorithm • E.g., ational → ate (apply: relational → relate) • Porter’s stemmer is heuristic, in that it is a practical method not guaranteed to be optimal
The Porter Stemmer: Definitions • Definitions • CONSONANT: a letter other than A, E, I, O, U, and Y preceded by consonant (in TOY, consonants are T,Y; in SYZYGY they are S, Z, G) • VOWEL: any other letter • With this definition all words and parts of words are of form: [C](VC)m[V] C=string of one or more consonants (con+) and [C] indicates arbitrary presence of the contents, i.e., possibly empty string as well. V=string of one or more vowels and [V] indicates arbitrary … • E.g., • Troubles • C VC VC = C(VC)2 • m is the measure of the word • m = 0: TR, EE, TREE, Y, BY • m = 1: TROUBLE, OATS, TREES, IVY • m = 2: TROUBLES, PRIVATE, OATEN, ORRERY NLE
Rule Format • Rules are of the form (condition) S1→ S2 where S1 and S2 are suffixes. Given a set of rules, only the one with the longest matching suffix S1 is applies. • Conditions: 1. m --- measure of the stem m = k or m > k, where k is an integer 2.*X --- the stem ends with a given letter X 3.*v*--- the stem contains a vowel 4.*d --- the stem ends in double consonant 5.*o --- the stem ends with a consonant-vowel-consonant sequence, where the final consonant is not w, x or y, (e.g., wil, hop) • Rules are divided into sets and in each successive step one set of rules is applied.
Porter Steps • Each step corresponds to a set of rules. The rules in a step are examined in sequence , and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem); step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }
Examples/Problems Step1a Step 4 • computers→ computer → comput • singing → sing • generalizations → • information → • instructor → • Try words of your own … Step1b
Porter’s Mishaps • On-line Porter’s at http://textanalysisonline.com/nltk-porter-stemmer gives • gas (noun) →ga • gases (plural) →gase • gasses (verb, present tense) →gass • gassing (verb, present continuous) →gass • gaseous (adjective)→gaseou This is notgood – all these words should ideally reduce to the same stem. • Trade-off: More rules (accurate but slow) vs Less rules (efficient but sometimes wrong). • Google does give different results for gas and gases, so maybe they use these Porter rules:-)