ML-05-decision-trees

CS60050 MachineLearning Decision TreeClassifier SlidestakenfromcoursematerialsofTan,Steinbach,Kumar

IllustratingClassificationTask Learning algorithm Induction Learn Model Model 10 TrainingSet Apply Model Deduction 10 TestSet

Intuitionbehindadecisiontree • Askaseriesofquestionsaboutagivenrecord • Eachquestionisaboutoneoftheattributes • Answertoonequestiondecideswhatquestiontoask next(orifanextquestionisneeded) • Continueaskingquestionsuntilwecaninfertheclass ofthegivenrecord

ExampleofaDecisionTree SplittingAttributes Refund Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO 10 Model: DecisionTree TrainingData

Structureofadecisiontree • Decisiontree:hierarchicalstructure • Onerootnode:noincomingedge,zeroor more outgoingedges • Internalnodes:exactlyoneincomingedge,twoor moreoutgoingedges • Leaforterminal nodes:exactlyoneincomingedge,no outgoingedge • Eachleafnodeassignedaclasslabel • Eachnon-leafnodecontainsatestconditionon oneoftheattributes

ApplyingaDecisionTree Classifier Tree Induction algorithm Induction Learn Model Model 10 TrainingSet Apply Model Decision Tree Deduction 10 TestSet

ApplyModeltoTestData TestData Startfromtherootoftree. Refund 10 Yes No MarSt NO Married Single,Divorced Onceadecisiontree hasbeenconstructed (learned),itiseasyto applyittotestdata TaxInc NO <80K >80K YES NO

ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO

ApplyModeltoTestData TestData Refund 10 Yes No MarSt NO AssignCheatto“No” Married Single,Divorced TaxInc NO <80K >80K YES NO

LearningaDecisionTreeClassifier Tree Induction algorithm Induction Learn Model Model 10 TrainingSet Apply Model Decision Tree Deduction Howtolearnadecisiontree? 10 TestSet

ADecisionTree(seenearlier) SplittingAttributes Refund Yes No MarSt NO Married Single,Divorced TaxInc NO <80K >80K YES NO 10 Model: DecisionTree TrainingData

AnotherDecision Treeonsamedataset Single, Divorced MarSt Married Refund NO No Yes TaxInc NO <80K >80K YES NO Therecouldbemorethanonetree thatfitsthesamedata! 10

Challengeinlearningdecisiontree • Exponentiallymanydecisiontreescanbe constructedfromagivensetofattributes • Someofthetreesaremore‘accurate’orbetter classifiersthantheothers • Findingtheoptimaltreeiscomputationallyinfeasible • Efficientalgorithmsavailabletolearna • reasonablyaccurate(althoughpotentially • suboptimal)decisiontreeinreasonabletime • Employsgreedystrategy • Locallyoptimalchoicesaboutwhichattributetouse nexttopartitionthedata

DecisionTreeInduction • ManyAlgorithms: • Hunt’sAlgorithm(oneoftheearliest) • CART • ID3,C4.5 • SLIQ,SPRINT

GeneralStructureofHunt’sAlgorithm • LetDtbe the setoftrainingrecords thatreachanodet • GeneralProcedure: • –IfDtcontainsrecordsthatall belongthesameclassy,thent • t • isaleafnodelabeledasyt • IfDtisanemptyset,thentisa leafnodelabeledbythedefault classyd • IfDtcontainsrecordsthat • belongtomorethanoneclass, useanattributetesttosplitthe dataintosmallersubsets. • Recursively applythe • proceduretoeachsubset. 10 Dt ?

Hunt’sAlgorithm 10 Defaultclassis“Don’t cheat”sinceitisthe majorityclassinthe dataset

Hunt’sAlgorithm 10 Fornow,assumethat “Refund”hasbeen decidedtobethebest attributeforsplittingin someway(tobe discussedsoon)

10 Don’t Cheat Marital Status Single, Divorced Married Cheat Don’t Cheat

Hunt’sAlgorithm Refund Don’t Cheat Yes No Don’t Cheat Don’t Cheat Refund Refund Yes No Yes No 10 Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Single, Divorced Married Married Don’t Cheat Taxable Income Cheat Don’t Cheat <80K >=80K Don’t Cheat Cheat

TreeInduction • Greedystrategy • Splittherecordsbasedonanattributetest thatoptimizescertaincriterion • Issues • Determinehowtosplittherecords • Howtospecifytheattributetestcondition? • Howtodeterminethebestsplit? • Determinewhentostopsplitting

HowtoSpecifyTestCondition? • Dependsonattributetypes • Nominal:twoormoredistinctvalues (special case:binary)E.g.,maritalstatus:{single, divorced,married} • Ordinal:twoormoredistinctvaluesthathave anordering.E.g.shirtsize:{S,M,L,XL} • Continuous:continuousrangeofvalues • Dependsonnumberofwaystosplit • 2-waysplit • Multi-waysplit

SplittingBasedonNominalAttributes • Multi-waysplit:Useasmany partitionsasdistinct values. • CarType • FamilyLuxury • Sports • Binarysplit:Dividesvaluesintotwosubsets. • Needtofindoptimalpartitioning. CarTypeCarType OR {Sports, Luxury} {Family, Luxury} {Family} {Sports}

SplittingBasedonOrdinalAttributes • Multi-waysplit:Useasmany partitionsasdistinct values. • Size • SmallLarge • Medium • Binarysplit:Dividesvaluesintotwosubsets. • Needtofindoptimalpartitioning. SizeSize OR {Small, Medium} {Medium, Large} {Large} {Small} Size {Small, Large} • Whataboutthissplit? {Medium}

SplittingBasedonContinuousAttributes • Differentwaysofhandling • Discretizationtoformanordinalcategorical attribute • Static–discretizeonceatthebeginning • Dynamic–rangescanbefoundbyequalinterval bucketing,equalfrequencybucketing (percentiles),or clustering. • BinaryDecision:(A<v)or(Av) • considerallpossiblesplitsandfindsthebestcut • canbemorecomputeintensive

SplittingBasedonContinuousAttributes Taxable Income >80K? Taxable Income? <10K >80K Yes No [10K,25K) [25K,50K) [50K,80K) (i)Binarysplit (ii)Multi-waysplit

TreeInduction • Greedystrategy. • Splittherecordsbasedonanattributetest thatoptimizescertaincriterion. • Issues • Determinehowtosplittherecords • Howtospecifytheattributetestcondition? • Howtodeterminethebestsplit? • Determinewhentostopsplitting

Whatismeantby“determinebestsplit” BeforeSplitting:10recordsofclass0, 10recordsofclass1 Own Car? Car Type? Student ID? Family Sports Luxury c1 No Yes c20 c10c11 C0:6 C1:4 C0:4 C1:6 C0:1 C1:3 C0:8 C1:0 C0:1 C1:7 C0:1 C1:0 C0:1 C1:0 C0:0 C1:1 ... ... C0:0 C1:1 Whichtestconditionisthebest?

HowtodeterminetheBestSplit • Greedyapproach: • –Nodeswithhomogeneousclassdistribution arepreferred • Needameasureofnodeimpurity: C0:5 C1:5 C0:9 C1:1 Non-homogeneous, Highdegreeofimpurity Homogeneous, Lowdegreeofimpurity

MeasuresofNodeImpurity • GiniIndex • Entropy • Misclassificationerror

HowtoFindtheBestSplit BeforeSplitting: M0 A? B? Yes No Yes No NodeN1 NodeN2 NodeN3 NodeN4 M2 M3 M4 M1 M12 M34 Gain=M0–M12vs M0–M34

MeasureofImpurity:GINIIndex • GiniIndexforagivennodet: • GINI(t)1[p(j|t)]2 j p(j|t)istherelativefrequencyofclassjatnodet

ExamplesforcomputingGINI GINI(t)1[p(j|t)]2 j P(C1)=0/6=0 P(C2)=6/6=1 Gini=1–P(C1)2–P(C2)2=1–0–1=0 P(C1)=1/6 P(C2)=5/6 Gini= 1–(1/6)2–(5/6)2=0.278 P(C1)=2/6 P(C2)=4/6 Gini= 1–(2/6)2–(4/6)2=0.444

MeasureofImpurity:GINIIndex • GiniIndexforagivennodet: • GINI(t)1[p(j|t)]2 • j • p(j|t)istherelativefrequencyofclassjatnodet • Maximum(1-1/nc)whenrecordsareequally distributedamongallclasses,implyingleast • interestinginformation[nc:numberofclasses] • Minimum(0.0)whenallrecordsbelongtooneclass, implyingmostinterestinginformation

SplittingBasedonGINI • UsedinCART,SLIQ,SPRINT. • Whenanodepissplitintokpartitions(children),the qualityofsplitiscomputedas, n k GINIsplit iGINI(i) n i1 ni=numberofrecordsatchildi, n =numberofrecordsatnodep. where,

BinaryAttributes:ComputingGINIIndex • Splitsintotwopartitions • EffectofWeighingpartitions: • –LargerandPurerPartitionsaresoughtfor. • B? Yes No NodeN1 NodeN2 Gini(N1) =1–(5/7)2–(2/7)2 =0.408 Gini(N2) =1–(1/5)2–(4/5)2 =0.32 Gini(Children) =7/12*0.408+ 5/12*0.32 =0.371

CategoricalAttributes:ComputingGiniIndex • Foreachdistinctvalue,gathercountsforeachclassin thedataset • Usethecountmatrixtomakedecisions Multi-waysplit Two-waysplit (findbestpartitionofvalues)

ContinuousAttributes:ComputingGiniIndex • UseBinaryDecisionsbasedonone value • SeveralChoicesforthesplittingvalue–Numberofpossiblesplittingvalues • =Numberofdistinctvalues • Eachsplittingvaluehasacountmatrix associatedwithit • –Classcountsineachofthe partitions,A<vandAv • Simplemethodtochoosebestv • Foreachv,scanthedatabaseto gathercountmatrixandcompute itsGiniindex • ComputationallyInefficient! • Repetitionofwork. 10 Taxable Income >80K? Yes No

ContinuousAttributes:ComputingGiniIndex... • Forefficientcomputation:foreachattribute, • Sorttheattributeonvalues • Linearlyscanthesevalues,eachtimeupdatingthecountmatrix andcomputingginiindex • Choosethesplitpositionthathastheleastginiindex SortedValues SplitPosition

AlternativeSplittingCriteriabasedonINFO • Entropyatagivennodet: • Entropy(t)p(j|t)logp(j|t) 2 j • p(j|t)istherelativefrequencyofclassjatnodet • Measureshomogeneityofanode

ExamplesforcomputingEntropy Entropy(t)p(j|t)logp(j|t) 2 j P(C1)=0/6=0 P(C2)=6/6=1 Entropy=–0log0–1log1=–0–0=0 P(C1)=1/6 P(C2)=5/6 Entropy=–(1/6)log2(1/6)–(5/6)log2(1/6)=0.65 P(C1)=2/6 P(C2)=4/6 Entropy=–(2/6)log2(2/6)–(4/6)log2(4/6)=0.92

AlternativeSplittingCriteriabasedonINFO • Entropyatagivennodet: • Entropy(t)p(j|t)logp(j|t) 2 j • p(j|t)istherelativefrequencyofclassjatnodet • Measureshomogeneityofanode • Maximum(lognc)whenrecordsareequallydistributed amongallclassesimplyingleastinformation • Minimum(0.0)whenallrecordsbelongtooneclass, implyingmostinformation

SplittingBasedonINFO... • InformationGain: n i   GAIN split Entropy(p) Entropy(i) k    n i1  • ParentNodepissplitintokpartitions; niisnumberofrecordsinpartitioni • MeasuresReductioninEntropyachievedbecauseof thesplit.Choosethesplitthatachievesmostreduction (maximizesGAIN) • UsedinID3andC4.5 • Disadvantage:Tendstoprefersplitsthatresultinlarge numberofpartitions,eachbeingsmallbutpure.

SplittingBasedonINFO... • GainRatio: GAIN Split SplitINFO nn GainRATIO k SplitINFO i1 log i i nn split • ParentNode,pissplitintokpartitions niisthenumberofrecordsinpartitioni • AdjustsInformationGainbytheentropyofthe partitioning(SplitINFO).Higherentropypartitioning (largenumberofsmallpartitions)ispenalized! • UsedinC4.5 • DesignedtoovercomethedisadvantageofInformation Gain

ML-05-decision-trees

ML-05-decision-trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees