290 likes | 458 Views
Data Mining. Decision Trees. dr Iwona Schab. Decision Trees. Method of classification Recursive procedure which ( progressively ) divides sets of n units into groups accoridng to a division rule
E N D
Data Mining • DecisionTrees dr Iwona Schab
DecisionTrees • Method of classification • Recursiveprocedurewhich (progressively) dividessets of nunitsintogroupsaccoridng to a divisionrule • Designedfor supervisedpredictionproblems (i.e. a set of inputvariablesisused to prodict the value of target variable • The primarygoalisprediction • The fittedtree model isused for target variableprediction for newcases (i.e. to scorenewcases/data) • Result: a finalpartition of the observations the Booleanrulesneeded to scorenew data
DecisionTree • A predictive model represented in a tree-likestructure Root node A splitbased on the values of the input Internalnode Terminal node – the leaf
Decissiontree • Nonparametricmethod • Allows for nonlinearrelationshipsmodelling • Sound concept, • Easy to interpret • Robustnessagainstoutliers • Detection and takingintoaccout of potentialinteractionsbetweeninputvariables • Additionalimplementation: categorisation of continiuosvariables, grouping of nopminalvalueds
DecisionTrees • Types: • Classificationtrees(Categoricalresponsevariable) • the leafsgive the predictedclass and the probability of classmembership • Regressiontrees (Continousresponsevariable) • the leafsgive the predictedvalue of the target • Exemplaryapplications: • Handwritingrecognition • Medicalresearch • Financial and capitalmarkets
DecisionTree • The path to eachleafexpresses as a Booleanrule: if … then… • The ’regions’ of the inputspacedetermined by the splitvalues • Intersections of subspacesdefined by a single splittingvariable • Regressiontree model is a multivariate step function • Leavesrepresent the perdicted target • Allcases in a particularleafaregiven the same predicted target • Splits: • Binary • Multiwaysplits (inputspartitionedintodisjoinedranges)
Analyticaldecision • Recursivepartitioningrule / splittingcriterion • Pruningcriterion / stoppingcriterion • Assignement of predicted target variable
Recursivepartitioningrule • Method used to fit the tree • Top-dow, greedyalgorithm • Starts at the rootnode • Splitsinvolvingeach single inputareexamined • Disjointsubsets of nominalinputs • Disjointranges of ordinal / intervalinputs • The splitingcriterion • Measures the reduction in variability of the target distribution in the childnode • used to choose the split • The splitchooseddetermines the partitioning of the observations • Partitionrepeted in eachchildnode as ifitwere a rootnode of a newtree • The partitioncontinuesdeeper in the tree – the processisrepeatedrecursivelyuntilisstopped by the stoppingrule
Splits on (atleast) ordinalinput • Restrictions in order to preserve the ordering • Onlyadjacentvaluesaregrouped • Problem: • To partitionintoBgroupsinput with Ldistinctvalues (levels) • • partitions • possiblesplits on a single ordinalinput • Anymonotonictransformation of the level of the input (with atleastanordinalmeasurementscale) gives the same split
Splits on nominalinput • No restrictionsregardingordering • Problem: • to partitionintoBgroupsinput with Ldistinctvalues (levels) • Numer of partitions: • - Stirling number of the secondkind • count the number of ways to partition a set of Llabelled objects into Bnonempty unlabelled subsets • The totalnumber of partitions:
Binarysplits • Ordinalinput • Nominalinput
Partitioningrule – possiblevariations • Incorporatingsometype of look-aheador backup • Oftenproduceinferiortrees • have not beenshown to be animprovement, Murthy and Salzberg, 1995) • Obliquesplits • Splits on lienearcombination of inputs (as apposite to the standard coordinte-axissplits. i.e. boundariesparallel to the inputcoordinates)
Recursivepartitioningalghorithm • Start with L-waysplit • Collapse the twoleversthatareclosest (based on a splittingcriterion) • Repeat the process on the set of L-1 consolidatedlevels • … • split of eachsize. • Choosetehbestsplit for the giveninput • Repeat the process for eachinput and choose the bestinput • CHAID algorithm • Additionalbacwardelimination step • Number of splits to considergraatlyreduced: • For ordinalinput: • For nominalinput:
Stoppingcriterion • Governs the depth and complexity of the tree • Right balancebewteendepth and complexity • When the treeis to complex: • Perfect discriminantion in the trainingsample • Lost stability • Lost ability to generalisediscoveredpatterns and relations • Overfitted to the trainigsample • Difficultieswith interpretation of prodictiverules • Trade-off beetwen the adjustment to the trainingsample and ability to generalise
Splittingcriterion • Impurityreduction • Chi-square test • Anexhaustivetreealgorithmconsiders: • allpossiblepartitions • Of allinputs • At everynode • combinatorialexplosion
Splitingcriterion • Minimiseimpuritywithinchildnodes / maximisedifferenciesbetweennewlysplitedchildnodes • chose the splitintochildnodeswhich: • maximises the drop in inpurityresulting from the parnetsnodepartition • Maximisesdifferencebetweennodes • Measures of impurity: • Basic ratio • Giniimpurityindex • Entropy • Measures of difference • Based on relativefrequencies (classificationtree) • Based on target variance (regressiontree)
BinaryDecisiontrees • Nonparamemetric model no assumptionsregardingdistributionneeded • Classifiesobservationsintopre-definedgroups target variablepredited for the wholeleafe • Supervisedsegmentation • In the baciscase: recoursivepartitionintotwoseparatecategories in order to maximisesimilarities of observationwithin the leaf and maximisedifferenciesbetweenleaves • Tree model = rules of segmentation • No previousselection of inputvariable
Trees vs hierarchicalsegmentation Trees • Predictiveappraoch • Supervisedclassification • Segmentationbased on target variable • Eachpartitioningbased on one variableat the time (usually) • Hierarchicalsegmentation • Descriptiveapparoach • Unsupervisedclassification • Segmentationbased on allvariables • Eachpartitioningbased on allvariableat the time– based on distancemeasure
Requirements • Large data sample • In case of classificationtrees: sufficientnumber of casesfallingintoeachclass of target (suggeested: min 500 cases per class)
Stoppingcriterion • The nodereachespre-definedsize (e.g 10 or less cases) • The algorithmhas run the predefinednumber of generations • The splitresults in (too) small drop of impurity • Expecteslosses in the testingsample • Stability of resuls in the testingsample • Probabilisticassumptionsregarding the variables (e.g. CHAID algorithm)
Target assignement to the leaf • Frequencybased • Thresholdneeded • Cost of misclassificationbased • α– cost of the I type error – e.g. averagecostincureddueto acceptance of a „bad” credit • β– cost of the II typeerror – e.g. averageincomelostdue to rejection of a „good” credit)
Disadvantages • Lack of stability (often) • Stabilityassessment on the basis of testingsample, withoutformalstatisticalinference • In case of classificationtree: target valuecalculated in the separate step with a „simplistic” method ( dominatingfrequencyassignement) • Target valuecalculated on the leaflevel, not on the individualobservationlevel
SplitingExample • Drop of impurityΔI • Basic ImpurityIndex Averageimpurity of childnodes
SplitingExample • GiniImpurityIndex • Entropy • Pearson’s test for relativefrequencies
SplitingExample • How to split the ordinal (in thiscase) variable „age”? (young+older) vs. medium? (young+medium) vs. older?
SplitingExample 1. Young + Older= r versus Medium = l • I(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3 I(r) = 300/1400 I(l) = 100/600
SplitingExample 2. Young + Medium= r versus Older= l • i(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2 I(r) = 300/1600 I(l) = 100/400
SplitingExample 1. Young + Older= r versus Medium = l • p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3
SplitingExample 2. Young + Medium= r versus Older= l • p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2