M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2009 Classification: Rules February 10, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Rules February 10, 2009 Slide 2

Today's Topics COMP527: Data Mining Introduction Rule Sets vs Rule Lists Constructing Rules-based Classifiers • 1R • PRISM • Reduced Error Pruning • RIPPER Rules with Exceptions Classification: Rules February 10, 2009 Slide 3

Rules-Based Classifiers COMP527: Data Mining Idea: Learn a set of rules from the data. Apply those rules to determine the class of the new instance. For example: R1. If blood-type=Warm and lay-eggs=True then Bird R2. If blood-type=Cold and flies=False then Reptile R3. If blood-type=Warm and lay-eggs=False then Mammal Hawk: flies=True, blood-type=Warm, lay-eggs=True, class=??? R1 is True, so the classifier predicts that Hawk = Bird. Yay! Classification: Rules February 10, 2009 Slide 4

Rules-Based Classifiers COMP527: Data Mining A rule rcovers an instance x if the attributes of the instance satisfy the condition of the rule. The coverage of a rule is the percentage of records that satisfy the condition. The accuracy of a rule is the percentage of covered records that satisfy the condition and the conclusion. For example, a rule might cover 10/50 records (coverage 20%) of which 8 are correct (accuracy 80%). Classification: Rules February 10, 2009 Slide 5

Rule Set vs Rule List COMP527: Data Mining Rules can either be grouped as a set or an ordered list. Set: The rules make independent predictions. Every record is covered by 0..1 rules (hopefully 1!)‏ R1. If flies=True and lays-eggs=True and lives-in-water=False then Bird R2. If flies=False and lives-in-water=True and lays-eggs=True then Fish R3. If blood-type=Warm and lays-eggs=False then Mammal R4. If blood-type=Warm and lays-eggs=True then Reptile Doesn’t matter which order we evaluate these rules. Classification: Rules February 10, 2009 Slide 6

Rule Set vs Rule List COMP527: Data Mining List: The rules make dependent predictions. Every record is covered by 0..* rules (hopefully 1..*!)‏ R1. If flies=True and lays-eggs=True then Bird R2. If blood-type=Warm and lays-eggs=False then Mammal R3. If lives-in-water=True then Fish R4. If lays-eggs=True then Reptile Does matter which order we evaluate these rules. If all records are covered by at least one rule, then rule set or list is considered Exhaustive. Classification: Rules February 10, 2009 Slide 7

Constructing Rules-Based Classifiers COMP527: Data Mining Covering approach: At each stage, a rule is found that covers some instances. “Separate and Conquer” -- Choose rule that identifies many instances, separate them out, repeat. But first a very very simple classifier called “1R”. 1R because the rules all test one particular attribute. Classification: Rules February 10, 2009 Slide 8

1R Classifier COMP527: Data Mining Idea: Construct one rule for each attribute/value combination predicting the most common class for that combination. Example Data: Classification: Rules February 10, 2009 Slide 9

1R Classifier COMP527: Data Mining Rules generated: Now choose the attribute with the fewest errors. Randomly decide on Outlook. So 1R will simply use the outlook attribute to predict the class for new instances. Classification: Rules February 10, 2009 Slide 10

1R Algorithm COMP527: Data Mining foreach attribute, • foreach value of that attribute, find class distribution for attr/value conc = most frequent class make rule: attribute=value -> conc • calculate error rate of ruleset select ruleset with lowest error rate Almost not worth wasting a slide on, really! Classification: Rules February 10, 2009 Slide 11

PRISM Classifier COMP527: Data Mining Instead of always looking to the full data set, after constructing each rule we could remove the instances that the rule covers before looking for a new rule. Start with a high coverage rule, then increase its accuracy by adding more conditions to it. Want to maximise the accuracy of each rule: maximise the ratio of positive instances/covered instances. Finished adding conditions when p/t = 1, or no more instances to look at Classification: Rules February 10, 2009 Slide 12

PRISM Classifier COMP527: Data Mining Following Witten (pg 6, 108+)‏ • If X then recommendation=hard Find highest coverage ratio condition for X: Age = Young 2/8 Age = Pre-presbyopic 1/8 Age = Presbyopic 1/8 Prescription = Myope 3/12 Prescription = Hypermetrope 1/12 Astigmatism = no 0/12 Astigmatism = yes 4/12 Tear-Production = Reduced 0/12 Tear-Production = Normal 4/12 Select astigmatism = yes (arbitrarily over Tear-Production = Normal)‏ Classification: Rules February 10, 2009 Slide 13

PRISM Classifier Age Spectacle prescription Astigmatism Tear production rate Recommended lenses Young Myope Yes Reduced None Young Myope Yes Normal Hard Young Hypermetrope Yes Reduced None Young Hypermetrope Yes Normal hard Pre-presbyopic Myope Yes Reduced None Pre-presbyopic Myope Yes Normal Hard Pre-presbyopic Hypermetrope Yes Reduced None Pre-presbyopic Hypermetrope Yes Normal None Presbyopic Myope Yes Reduced None Presbyopic Myope Yes Normal Hard Presbyopic Hypermetrope Yes Reduced None Presbyopic Hypermetrope Yes Normal None COMP527: Data Mining This covers: Now need to add another condition to make it more accurate. If astigmatism = yes and X then recommendation = hard Classification: Rules February 10, 2009 Slide 14

PRISM Classifier COMP527: Data Mining Best condition is Tear-Production = normal (4/6)‏ New rule: astigmatism=yes and tear-production = normal But still some inaccuracy... Age=Young (2/2) or Prescription = Myope (3/3) both have 100% ratio in remaining instances. Choose the greater coverage. • If astigmatism = yes and tear-production = normal and prescription = myope then recommendation = hard Repeat the process, removing the instances covered by this rule. Then repeat for all classes. Classification: Rules February 10, 2009 Slide 15

PRISM Classifier COMP527: Data Mining Try with the other example data set. If X then play=yes Outlook=overcast is (4/4) Already perfect, remove instances and look again. Classification: Rules February 10, 2009 Slide 16

PRISM Classifier COMP527: Data Mining With reduced dataset, if X then play=yes Sunny (2/5) rainy (3/5) hot (0/2) mild (3/5) cool (2/3) high (1/5) normal (4/5) false (4/6) true (¼)‏ Select humidity=normal (4/5) and look for another rule as not perfect Classification: Rules February 10, 2009 Slide 17

PRISM Classifier COMP527: Data Mining If humidity=normal and X then play=yes If we could use 'and-not' we could have: • and-not (temperature=cool and windy=true)‏ But instead: • rainy (2/3), sunny (2/2), cool (2/3), mild (2/2), false(3/3), true (1/2)‏ So we select windy=false to maximise t and add that to the rule. Classification: Rules February 10, 2009 Slide 18

PRISM Algorithm COMP527: Data Mining for each class C • initialise E to the complete instance set • while E contains instances with class C create empty rule R if X then C until R is perfect (or no more attributes)‏ • for each attribute A not in R, and each value v, • consider adding A=v to R • select A and v to maximise accuracy of p/t • add A=v to R remove instances covered by R from E Classification: Rules February 10, 2009 Slide 19

Issues with PRISM COMP527: Data Mining Overfitting. As we saw, we had 4/5 and then lost one positive instance to lose the one negative instance. But with more examples maybe it was 199/200 and we'd need to lose 40 positive to remove it... that's crazy. Measure 2: Information Gain • p * (log(p/t) - log (P/T))‏ p/t as before (positive over total)‏ P and T are positive and total before new condition. Emphasises large number of positive examples. Use this in PRISM in place of maximising p/t. Classification: Rules February 10, 2009 Slide 20

Over-Fitting Avoidance COMP527: Data Mining We could split the training set into a growing set and a pruning set. Grow rules out using the first, and then try to cut the rules back with the pruning set. Two strategies: Reduced-error pruning: Build full rule set then prune rules Incremental reduced-error pruning: Simplify rules as built Can re-split the data after each rule. Let's look at this one. Classification: Rules February 10, 2009 Slide 21

Incremental Reduced Error Pruning COMP527: Data Mining initialise E to instance set until E is empty: • split E into Grow and Prune (ratio 2:1)‏ • for each class C in Grow generate best rule for C using Prune: • calc worth(R) and worth(R-finalCondition)‏ • while worth(R-) > worth(R), prune rule • from rules for different classes, select largest worth(R)‏ • remove instances covered by rule Classification: Rules February 10, 2009 Slide 22

Rule Worth? COMP527: Data Mining How can we generate the worth of the rules? (Witten 203)‏ • ( p + (N -n)) / T • true positive + true negative / total number of instances • positive + totalNegative - negativesCovered / totalInstances • p=2000, t=3000 --> 1000 + N / T • p=1000, t=1001 --> 999 + N / T • p / t • Same problem as before p=1,t=1 vs p=1000,t=1001 • Simple but intuitive algorithm for worth? Classification: Rules February 10, 2009 Slide 23

Issue with Grow/Prune Splitting COMP527: Data Mining Say we have 1000 examples, and we split 2:1 for train/test (666,334), then 2:1 for grow/prune (444,222) ... we're building our rules on less than half of our data! Depending on the dataset, classes may be absent from the training set, or the distributions may be very wrong, or any number of other statistical problems with random sampling to this degree. Ameliorated in Incremental as re-split often. But might still want to perform the algorithm several times and pick the best. Classification: Rules February 10, 2009 Slide 24

RIPPER COMP527: Data Mining Rules-based classifier from industry. If 2 classes, then learn rules for one and default the other If more than 2 classes, start with smallest until you have 2. Information Gain to grow rules Measure for pruning: p-n / p + n (positive/negative examples covered in pruning set)‏ Uses 'Description Length' metric -- Ockham's Razor says that the simplest solution is the best, so here the simplest rule set is the best. (Not going into how to calculate this)‏ Classification: Rules February 10, 2009 Slide 25

RIPPER Algorithm COMP527: Data Mining Repeated Incremental Pruning to Produce Error Reduction split E into Grow/Prune BUILD: • repeat until no examples, or DL of ruleset >minDL(rulesets)+64, or error >50% GROW: add conditions until rule is 100% by IG PRUNE: prune last to first while worth metric W increases for each rule R, for each class C: • split E into Grow/Prune • remove all instances from Prune covered by other rules • GROW and PRUNE two competing rules: R1 is new rule built from scratch R2 is generated by adding conditions to R • prune using worth metric A for reduced dataset • replace R by R, R1 or R2 with smallest DL if uncovered instances of C, return to BUILD to make more rules calculate DL for ruleset and ruleset with each rule omitted, delete any rule that increases the DL remove instances covered by rules generated DL = Description Length, Metric W= p+1/t+2, Metric A=p+N-n/T Classification: Rules February 10, 2009 Slide 26

Rules with Exceptions COMP527: Data Mining If we get more data after a ruleset has been generated, it might be useful to add exceptions to rules. If X then class1 unless Y then class2 Consider our humidity rule: if humidity=normal then play=yes unless temperature=cool and windy=true then play = no Exceptions developed with the Induct system, called 'ripple-down rules' Classification: Rules February 10, 2009 Slide 27

Further Reading COMP527: Data Mining • Witten, Sections 3.3, 3.5, 3.6, 4.1, 4.4 • Dunham Section 4.6 • Han, Section 6.5 • Berry and Browne, Chapter 8 Classification: Rules February 10, 2009 Slide 28

M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009