320 likes | 696 Views
Decision Tree Learning. Kelby Lee. Overview. What is a Decision Tree ID3 REP IREP RIPPER Application. What is Decision Tree. What is Decision Tree. Select best attribute that classifies examples Top Down Start with concept that represents all Greedy Algorithm
E N D
Decision Tree Learning Kelby Lee
Overview • What is a Decision Tree • ID3 • REP • IREP • RIPPER • Application
What is Decision Tree • Select best attribute that classifies examples • Top Down • Start with concept that represents all • Greedy Algorithm • Select attribute that classifies maximum examples • Does not backtrack • ID3
ID3 Algorithm • ID3(Examples, Target_attribute, Attributes) • Create a Root node for the tree • If Examples all positive? • Return Single Node Tree Root, with label = + • If Examples all negative? • Return Single node Tree Root, with label = - • If Attributes is empty • Return single-node tree Root, label = most common value of Target_attribute in Examples
ID3 Algorithm • Otherwise • A Best_Attribute (Attributes, Examples) • Root A • For each value vi of A • Add a new tree branch • Examples_svi is a subset of Examples for vi • If Examples_svi is empty? • Add leaf node label = most common value of Target_attribute • Add a new sub tree: ID3(Examples_svi, Target_attribute, Attributes – {A})
Selecting Best Attribute • New property of Attribute: Information Gain • Information Gain: Measures how well a given attribute separates the training examples according to their target classification
{E1+, E2+} att1 {E1+, E2+, E3-, E4-} {E3-, E4-} {E1+, E3-} att2 {E1+, E2+, E3-, E4-} {E2+, E4-} Information Gain att1 = 1 att2 = 0.5
Tree Pruning • Overfit and Simplify • Simplify Tree • In most cases it improves accuracy
REP • Reduced Error Pruning • Deletes Single Conditions or Single Rules • Improves on Noisy Data • O(n4) on large data sets
IREP • Incremental Reduced Error Pruning • Produces one rule at a time and eliminates all examples covered by that rule • Stops when no positive examples or pruning produces unacceptable error
IREP Algorithm PROCEDURE IREP(Pos, Neg) BEGIN Ruleset := 0 WHILE Pos != 0 DO /* Grow and Prune a New Rule */ split (Pos, Neg) into (GrowPos, GrowNeg) Rule := GrowRule( GrowPos, GrowNeg ) Rule := PruneRule( Rule, PrunePos, PruneNeg )
IREP Algorithm IF error rate of Rule on ( PrunePos, PruneNeg ) exceeds 50% THEN RETURN Ruleset ELSE Add Rule to Ruleset Remove examples covered by Rule from ( Pos, Neg ) ENDIF ENDWHILE RETURN Ruleset END
RIPPER • Repeated Grow and Simplify produces quite different results than REP • Repeatedly prune the rule set to minimize the error • Repeated Incremental Pruning to Produce Error Reduction (RIPPER)
RIPPER Algorithm PROCEDURE RIPPERk (Pos, Neg) BEGIN Ruleset : = IREP(Pos, Neg) REPEAT k TIMES Ruleset := Optimize(Ruleset, Pos, Neg) UncovPos : = Pos \ {data covered by Ruleset } UncovNeg : = Neg \ {data covered by Ruleset } Ruleset : = Ruleset IREP(UncovPos, UncovNeg) ENDREPEAT END
Optimization Function FUNCTION Optimize (Ruleset, Pos, Neg) BEGIN FOR each rule r Ruleset do split ( Pos, Neg) into (GrowPos, GrowNeg) and (PrunePos, PruneNeg) /* Compute Replacement for r */ r’ : = GrowRule (GrowPos, GrowNet) r’ : = PruneRule ( r’, PrunePos, PruneNeg ) guided by error of Ruleset \ {c} {c’}
Optimization Function /* Compute Replacement for r */ r’’ : = GrowRule (GrowPos, GrowNet) r’’ : = PruneRule ( r’, PrunePos, PruneNeg ) guided by error of Ruleset \ {c} {c’’} Replace c in Ruleset with best of c, c’, c’’ guided by description length of Compress(Ruleset\{c} {x}) ENDFOR RETURN Ruleset END
RIPPER Data 3,6.0E+00,6.0E+00,4.0E+00,none,35,empl_contr,7.444444444444445E+00,14,false,9,gnr,true,full,true,full,good. 2,4.5E+00,4.0E+00,3.913333333333334E+00,none,40,empl_contr,7.444444444444445E+00,4,false,10,gnr,true,half,true,full,good. 3,5.0E+00,5.0E+00,5.0E+00,none,40,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,12,avg,true,half,true,half,good. 2,4.6E+00,4.6E+00,3.913333333333334E+00,tcf,38,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,1.109433962264151E+01,ba,true,half,true,half,good.
RIPPER Names file good,bad. dur: continuous. wage1: continuous. wage2: continuous. wage3: continuous. cola: none, tcf, tc. hours: continuous. pension: none, ret_allw, empl_contr. stby_pay: continuous. shift_diff: continuous. educ_allw: false, true. holidays: continuous. vacation: ba, avg, gnr. lngtrm_disabil: false, true. dntl_ins: none, half, full. bereavement: false, true. empl_hplan: none, half, full.
RIPPER Output Final hypothesis is: bad :- wage1<=2.8 (14/3). bad :- lngtrm_disabil=false (5/0). default good (34/1). =====================summary================== Train error rate: 7.02% +/- 3.41% (57 datapoints) << Hypothesis size: 2 rules, 4 conditions Learning time: 0.01 sec
RIPPER Hypothesis bad 14 3 IF wage1 <= 2.8 . bad 5 0 IF lngtrm_disabil = false . good 34 1 IF . .
IDS • Intrusion Detection System
IDS • Use Data Mining to Detect Anomaly • Better than Pattern Matching since may be possible to detect undiscovered attacks
RIPPER IDS data 86,543520084,192168000120,2698,192168000190,22,6,17,40,2096,158723779,14054,normal. 87,543520084,192168000190,22,192p168p0p120,2698,6,16,40,58387,39130843,46725,normal. ........................... 11,543520084,192168000190,80,192168000120,2703,6,16,40,58400,39162494,46738,anomaly. 12,543520084,192168000190,80,192168000120,2703,6,16,1500,58400,39162494,45277,anomaly.
RIPPER IDS names normal,anomaly. recID: ignore. timestamp: symbolic. sourceIP: set. sourcePORT: symbolic. destIP: set. destPORT: symbolic. protocol: symbolic. flags: symbolic. length: symbolic. winsize: symbolic. ack: symbolic. checksum: symbolic.
RIPPER Output Final hypothesis is: anomaly :- sourcePORT='80' (33/0). anomaly :- destPORT='80' (35/0). anomaly :- ack='7.01238e+07' (3/0). anomaly :- ack='7.03859e+07' (2/0). default normal (87/0). =================summary===================== Train error rate: 0.00% +/- 0.00% (160 datapoints) << Hypothesis size: 4 rules, 8 conditions Learning time: 0.01 sec
RIPPER Output anomaly 33 0 IF sourcePORT = 80 . anomaly 35 0 IF destPORT = 80 . anomaly 3 0 IF ack = 7.01238e+07 . anomaly 2 0 IF ack = 7.03859e+07 . normal 87 0 IF . .
Conclusion • What is a Decision Tree • ID3 • REP • IREP • RIPPER • Application