E N D
CS60050 MachineLearning Decision TreeClassifier SlidestakenfromcoursematerialsofTan,Steinbach,Kumar
IllustratingClassificationTask Learning algorithm Induction Learn Model Model 10 TrainingSet Apply Model Deduction 10 TestSet
Intuitionbehindadecisiontree • Askaseriesofquestionsaboutagivenrecord • Eachquestionisaboutoneoftheattributes • Answertoonequestiondecideswhatquestiontoask next(orifanextquestionisneeded) • Continueaskingquestionsuntilwecaninfertheclass ofthegivenrecord
ExampleofaDecisionTree SplittingAttributes Refund Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO 10 Model: DecisionTree TrainingData
Structureofadecisiontree • Decisiontree:hierarchicalstructure • Onerootnode:noincomingedge,zeroor more outgoingedges • Internalnodes:exactlyoneincomingedge,twoor moreoutgoingedges • Leaforterminal nodes:exactlyoneincomingedge,no outgoingedge • Eachleafnodeassignedaclasslabel • Eachnon-leafnodecontainsatestconditionon oneoftheattributes
ApplyingaDecisionTree Classifier Tree Induction algorithm Induction Learn Model Model 10 TrainingSet Apply Model Decision Tree Deduction 10 TestSet
ApplyModeltoTestData TestData Startfromtherootoftree. Refund 10 Yes No MarSt NO Married Single,Divorced Onceadecisiontree hasbeenconstructed (learned),itiseasyto applyittotestdata TaxInc NO <80K >80K YES NO
ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO
ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO
ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO
ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO
ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO AssignCheatto“No” Married Single,Divorced TaxInc NO <80K >80K YES NO
LearningaDecisionTreeClassifier Tree Induction algorithm Induction Learn Model Model 10 TrainingSet Apply Model Decision Tree Deduction Howtolearnadecisiontree? 10 TestSet
ADecisionTree(seenearlier) SplittingAttributes Refund Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO 10 Model: DecisionTree TrainingData
AnotherDecision Treeonsamedataset Single, Divorced MarSt Married Refund NO No Yes TaxInc NO <80K >80K YES NO Therecouldbemorethanonetree thatfitsthesamedata! 10
Challengeinlearningdecisiontree • Exponentiallymanydecisiontreescanbe constructedfromagivensetofattributes • Someofthetreesaremore‘accurate’orbetter classifiersthantheothers • Findingtheoptimaltreeiscomputationallyinfeasible • Efficientalgorithmsavailabletolearna • reasonablyaccurate(althoughpotentially • suboptimal)decisiontreeinreasonabletime • Employsgreedystrategy • Locallyoptimalchoicesaboutwhichattributetouse nexttopartitionthedata
DecisionTreeInduction • ManyAlgorithms: • Hunt’sAlgorithm(oneoftheearliest) • CART • ID3,C4.5 • SLIQ,SPRINT
GeneralStructureofHunt’sAlgorithm • LetDtbe the setoftrainingrecords thatreachanodet • GeneralProcedure: • –IfDtcontainsrecordsthatall belongthesameclassy,thent • t • isaleafnodelabeledasyt • IfDtisanemptyset,thentisa leafnodelabeledbythedefault classyd • IfDtcontainsrecordsthat • belongtomorethanoneclass, useanattributetesttosplitthe dataintosmallersubsets. • Recursively applythe • proceduretoeachsubset. 10 Dt ?
Hunt’sAlgorithm 10 Defaultclassis“Don’t cheat”sinceitisthe majorityclassinthe dataset
Hunt’sAlgorithm 10 Fornow,assumethat “Refund”hasbeen decidedtobethebest attributeforsplittingin someway(tobe discussedsoon)
10 Don’t Cheat Marital Status Single, Divorced Married Cheat Don’t Cheat
Hunt’sAlgorithm Refund Don’t Cheat Yes No Don’t Cheat Don’t Cheat Refund Refund Yes No Yes No 10 Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Single, Divorced Married Married Don’t Cheat Taxable Income Cheat Don’t Cheat <80K >=80K Don’t Cheat Cheat
TreeInduction • Greedystrategy • Splittherecordsbasedonanattributetest thatoptimizescertaincriterion • Issues • Determinehowtosplittherecords • Howtospecifytheattributetestcondition? • Howtodeterminethebestsplit? • Determinewhentostopsplitting
TreeInduction • Greedystrategy • Splittherecordsbasedonanattributetest thatoptimizescertaincriterion • Issues • Determinehowtosplittherecords • Howtospecifytheattributetestcondition? • Howtodeterminethebestsplit? • Determinewhentostopsplitting
HowtoSpecifyTestCondition? • Dependsonattributetypes • Nominal:twoormoredistinctvalues (special case:binary)E.g.,maritalstatus:{single, divorced,married} • Ordinal:twoormoredistinctvaluesthathave anordering.E.g.shirtsize:{S,M,L,XL} • Continuous:continuousrangeofvalues • Dependsonnumberofwaystosplit • 2-waysplit • Multi-waysplit
SplittingBasedonNominalAttributes • Multi-waysplit:Useasmany partitionsasdistinct values. • CarType • FamilyLuxury • Sports • Binarysplit:Dividesvaluesintotwosubsets. • Needtofindoptimalpartitioning. CarTypeCarType OR {Sports, Luxury} {Family, Luxury} {Family} {Sports}
SplittingBasedonOrdinalAttributes • Multi-waysplit:Useasmany partitionsasdistinct values. • Size • SmallLarge • Medium • Binarysplit:Dividesvaluesintotwosubsets. • Needtofindoptimalpartitioning. SizeSize OR {Small, Medium} {Medium, Large} {Large} {Small} Size {Small, Large} • Whataboutthissplit? {Medium}
SplittingBasedonContinuousAttributes • Differentwaysofhandling • Discretizationtoformanordinalcategorical attribute • Static–discretizeonceatthebeginning • Dynamic–rangescanbefoundbyequalinterval bucketing,equalfrequencybucketing (percentiles),or clustering. • BinaryDecision:(A<v)or(Av) • considerallpossiblesplitsandfindsthebestcut • canbemorecomputeintensive
SplittingBasedonContinuousAttributes Taxable Income >80K? Taxable Income? <10K >80K Yes No [10K,25K) [25K,50K) [50K,80K) (i)Binarysplit (ii)Multi-waysplit
TreeInduction • Greedystrategy. • Splittherecordsbasedonanattributetest thatoptimizescertaincriterion. • Issues • Determinehowtosplittherecords • Howtospecifytheattributetestcondition? • Howtodeterminethebestsplit? • Determinewhentostopsplitting
Whatismeantby“determinebestsplit” BeforeSplitting:10recordsofclass0, 10recordsofclass1 Own Car? Car Type? Student ID? Family Sports Luxury c1 No Yes c20 c10c11 C0:6 C1:4 C0:4 C1:6 C0:1 C1:3 C0:8 C1:0 C0:1 C1:7 C0:1 C1:0 C0:1 C1:0 C0:0 C1:1 ... ... C0:0 C1:1 Whichtestconditionisthebest?
HowtodeterminetheBestSplit • Greedyapproach: • –Nodeswithhomogeneousclassdistribution arepreferred • Needameasureofnodeimpurity: C0:5 C1:5 C0:9 C1:1 Non-homogeneous, Highdegreeofimpurity Homogeneous, Lowdegreeofimpurity
MeasuresofNodeImpurity • GiniIndex • Entropy • Misclassificationerror
HowtoFindtheBestSplit BeforeSplitting: M0 A? B? Yes No Yes No NodeN1 NodeN2 NodeN3 NodeN4 M2 M3 M4 M1 M12 M34 Gain=M0–M12vs M0–M34
MeasuresofNodeImpurity • GiniIndex • Entropy • Misclassificationerror
MeasureofImpurity:GINIIndex • GiniIndexforagivennodet: • GINI(t)1[p(j|t)]2 j p(j|t)istherelativefrequencyofclassjatnodet
ExamplesforcomputingGINI GINI(t)1[p(j|t)]2 j P(C1)=0/6=0 P(C2)=6/6=1 Gini=1–P(C1)2–P(C2)2=1–0–1=0 P(C1)=1/6 P(C2)=5/6 Gini= 1–(1/6)2–(5/6)2=0.278 P(C1)=2/6 P(C2)=4/6 Gini= 1–(2/6)2–(4/6)2=0.444
MeasureofImpurity:GINIIndex • GiniIndexforagivennodet: • GINI(t)1[p(j|t)]2 • j • p(j|t)istherelativefrequencyofclassjatnodet • Maximum(1-1/nc)whenrecordsareequally distributedamongallclasses,implyingleast • interestinginformation[nc:numberofclasses] • Minimum(0.0)whenallrecordsbelongtooneclass, implyingmostinterestinginformation
SplittingBasedonGINI • UsedinCART,SLIQ,SPRINT. • Whenanodepissplitintokpartitions(children),the qualityofsplitiscomputedas, n k GINIsplit iGINI(i) n i1 ni=numberofrecordsatchildi, n =numberofrecordsatnodep. where,
BinaryAttributes:ComputingGINIIndex • Splitsintotwopartitions • EffectofWeighingpartitions: • –LargerandPurerPartitionsaresoughtfor. • B? Yes No NodeN1 NodeN2 Gini(N1) =1–(5/7)2–(2/7)2 =0.408 Gini(N2) =1–(1/5)2–(4/5)2 =0.32 Gini(Children) =7/12*0.408+ 5/12*0.32 =0.371
CategoricalAttributes:ComputingGiniIndex • Foreachdistinctvalue,gathercountsforeachclassin thedataset • Usethecountmatrixtomakedecisions Multi-waysplit Two-waysplit (findbestpartitionofvalues)
ContinuousAttributes:ComputingGiniIndex • UseBinaryDecisionsbasedonone value • SeveralChoicesforthesplittingvalue–Numberofpossiblesplittingvalues • =Numberofdistinctvalues • Eachsplittingvaluehasacountmatrix associatedwithit • –Classcountsineachofthe partitions,A<vandAv • Simplemethodtochoosebestv • Foreachv,scanthedatabaseto gathercountmatrixandcompute itsGiniindex • ComputationallyInefficient! • Repetitionofwork. 10 Taxable Income >80K? Yes No
ContinuousAttributes:ComputingGiniIndex... • Forefficientcomputation:foreachattribute, • Sorttheattributeonvalues • Linearlyscanthesevalues,eachtimeupdatingthecountmatrix andcomputingginiindex • Choosethesplitpositionthathastheleastginiindex SortedValues SplitPosition
MeasuresofNodeImpurity • GiniIndex • Entropy • Misclassificationerror
AlternativeSplittingCriteriabasedonINFO • Entropyatagivennodet: • Entropy(t)p(j|t)logp(j|t) 2 j • p(j|t)istherelativefrequencyofclassjatnodet • Measureshomogeneityofanode
ExamplesforcomputingEntropy Entropy(t)p(j|t)logp(j|t) 2 j P(C1)=0/6=0 P(C2)=6/6=1 Entropy=–0log0–1log1=–0–0=0 P(C1)=1/6 P(C2)=5/6 Entropy=–(1/6)log2(1/6)–(5/6)log2(1/6)=0.65 P(C1)=2/6 P(C2)=4/6 Entropy=–(2/6)log2(2/6)–(4/6)log2(4/6)=0.92
AlternativeSplittingCriteriabasedonINFO • Entropyatagivennodet: • Entropy(t)p(j|t)logp(j|t) 2 j • p(j|t)istherelativefrequencyofclassjatnodet • Measureshomogeneityofanode • Maximum(lognc)whenrecordsareequallydistributed amongallclassesimplyingleastinformation • Minimum(0.0)whenallrecordsbelongtooneclass, implyingmostinformation
SplittingBasedonINFO... • InformationGain: n i GAIN split Entropy(p) Entropy(i) k n i1 • ParentNodepissplitintokpartitions; niisnumberofrecordsinpartitioni • MeasuresReductioninEntropyachievedbecauseof thesplit.Choosethesplitthatachievesmostreduction (maximizesGAIN) • UsedinID3andC4.5 • Disadvantage:Tendstoprefersplitsthatresultinlarge numberofpartitions,eachbeingsmallbutpure.
SplittingBasedonINFO... • GainRatio: GAIN Split SplitINFO nn GainRATIO k SplitINFO i1 log i i nn split • ParentNode,pissplitintokpartitions niisthenumberofrecordsinpartitioni • AdjustsInformationGainbytheentropyofthe partitioning(SplitINFO).Higherentropypartitioning (largenumberofsmallpartitions)ispenalized! • UsedinC4.5 • DesignedtoovercomethedisadvantageofInformation Gain
MeasuresofNodeImpurity • GiniIndex • Entropy • Misclassificationerror