Data Mining

Data Mining • DecisionTrees dr Iwona Schab

DecisionTrees • Method of classification • Recursiveprocedurewhich (progressively) dividessets of nunitsintogroupsaccoridng to a divisionrule • Designedfor supervisedpredictionproblems (i.e. a set of inputvariablesisused to prodict the value of target variable • The primarygoalisprediction • The fittedtree model isused for target variableprediction for newcases (i.e. to scorenewcases/data) • Result: a finalpartition of the observations the Booleanrulesneeded to scorenew data

DecisionTree • A predictive model represented in a tree-likestructure Root node A splitbased on the values of the input Internalnode Terminal node – the leaf

Decissiontree • Nonparametricmethod • Allows for nonlinearrelationshipsmodelling • Sound concept, • Easy to interpret • Robustnessagainstoutliers • Detection and takingintoaccout of potentialinteractionsbetweeninputvariables • Additionalimplementation: categorisation of continiuosvariables, grouping of nopminalvalueds

DecisionTrees • Types: • Classificationtrees(Categoricalresponsevariable) • the leafsgive the predictedclass and the probability of classmembership • Regressiontrees (Continousresponsevariable) • the leafsgive the predictedvalue of the target • Exemplaryapplications: • Handwritingrecognition • Medicalresearch • Financial and capitalmarkets

DecisionTree • The path to eachleafexpresses as a Booleanrule: if … then… • The ’regions’ of the inputspacedetermined by the splitvalues • Intersections of subspacesdefined by a single splittingvariable • Regressiontree model is a multivariate step function • Leavesrepresent the perdicted target • Allcases in a particularleafaregiven the same predicted target • Splits: • Binary • Multiwaysplits (inputspartitionedintodisjoinedranges)

Analyticaldecision • Recursivepartitioningrule / splittingcriterion • Pruningcriterion / stoppingcriterion • Assignement of predicted target variable

Recursivepartitioningrule • Method used to fit the tree • Top-dow, greedyalgorithm • Starts at the rootnode • Splitsinvolvingeach single inputareexamined • Disjointsubsets of nominalinputs • Disjointranges of ordinal / intervalinputs • The splitingcriterion • Measures the reduction in variability of the target distribution in the childnode • used to choose the split • The splitchooseddetermines the partitioning of the observations • Partitionrepeted in eachchildnode as ifitwere a rootnode of a newtree • The partitioncontinuesdeeper in the tree – the processisrepeatedrecursivelyuntilisstopped by the stoppingrule

Splits on (atleast) ordinalinput • Restrictions in order to preserve the ordering • Onlyadjacentvaluesaregrouped • Problem: • To partitionintoBgroupsinput with Ldistinctvalues (levels) •  • partitions • possiblesplits on a single ordinalinput • Anymonotonictransformation of the level of the input (with atleastanordinalmeasurementscale) gives the same split

Splits on nominalinput • No restrictionsregardingordering • Problem: • to partitionintoBgroupsinput with Ldistinctvalues (levels)  • Numer of partitions: • - Stirling number of the secondkind • count the number of ways to partition a set of Llabelled objects into Bnonempty unlabelled subsets • The totalnumber of partitions:

Binarysplits • Ordinalinput • Nominalinput

Partitioningrule – possiblevariations • Incorporatingsometype of look-aheador backup • Oftenproduceinferiortrees • have not beenshown to be animprovement, Murthy and Salzberg, 1995) • Obliquesplits • Splits on lienearcombination of inputs (as apposite to the standard coordinte-axissplits. i.e. boundariesparallel to the inputcoordinates)

Recursivepartitioningalghorithm • Start with L-waysplit • Collapse the twoleversthatareclosest (based on a splittingcriterion) • Repeat the process on the set of L-1 consolidatedlevels • … •  split of eachsize. • Choosetehbestsplit for the giveninput • Repeat the process for eachinput and choose the bestinput • CHAID algorithm • Additionalbacwardelimination step • Number of splits to considergraatlyreduced: • For ordinalinput: • For nominalinput:

Stoppingcriterion • Governs the depth and complexity of the tree • Right balancebewteendepth and complexity • When the treeis to complex: • Perfect discriminantion in the trainingsample • Lost stability • Lost ability to generalisediscoveredpatterns and relations • Overfitted to the trainigsample • Difficultieswith interpretation of prodictiverules • Trade-off beetwen the adjustment to the trainingsample and ability to generalise

Splittingcriterion • Impurityreduction • Chi-square test • Anexhaustivetreealgorithmconsiders: • allpossiblepartitions • Of allinputs • At everynode •  combinatorialexplosion

Splitingcriterion • Minimiseimpuritywithinchildnodes / maximisedifferenciesbetweennewlysplitedchildnodes •  chose the splitintochildnodeswhich: • maximises the drop in inpurityresulting from the parnetsnodepartition • Maximisesdifferencebetweennodes • Measures of impurity: • Basic ratio • Giniimpurityindex • Entropy • Measures of difference • Based on relativefrequencies (classificationtree) • Based on target variance (regressiontree)

BinaryDecisiontrees • Nonparamemetric model  no assumptionsregardingdistributionneeded • Classifiesobservationsintopre-definedgroups  target variablepredited for the wholeleafe • Supervisedsegmentation • In the baciscase: recoursivepartitionintotwoseparatecategories in order to maximisesimilarities of observationwithin the leaf and maximisedifferenciesbetweenleaves • Tree model = rules of segmentation • No previousselection of inputvariable

Trees vs hierarchicalsegmentation Trees • Predictiveappraoch • Supervisedclassification • Segmentationbased on target variable • Eachpartitioningbased on one variableat the time (usually) • Hierarchicalsegmentation • Descriptiveapparoach • Unsupervisedclassification • Segmentationbased on allvariables • Eachpartitioningbased on allvariableat the time– based on distancemeasure

Requirements • Large data sample • In case of classificationtrees: sufficientnumber of casesfallingintoeachclass of target (suggeested: min 500 cases per class)

Stoppingcriterion • The nodereachespre-definedsize (e.g 10 or less cases) • The algorithmhas run the predefinednumber of generations • The splitresults in (too) small drop of impurity • Expecteslosses in the testingsample • Stability of resuls in the testingsample • Probabilisticassumptionsregarding the variables (e.g. CHAID algorithm)

Target assignement to the leaf • Frequencybased • Thresholdneeded • Cost of misclassificationbased • α– cost of the I type error – e.g. averagecostincureddueto acceptance of a „bad” credit • β– cost of the II typeerror – e.g. averageincomelostdue to rejection of a „good” credit)

Disadvantages • Lack of stability (often) • Stabilityassessment on the basis of testingsample, withoutformalstatisticalinference • In case of classificationtree: target valuecalculated in the separate step with a „simplistic” method ( dominatingfrequencyassignement) • Target valuecalculated on the leaflevel, not on the individualobservationlevel

SplitingExample • Drop of impurityΔI • Basic ImpurityIndex Averageimpurity of childnodes

SplitingExample • GiniImpurityIndex • Entropy • Pearson’s test for relativefrequencies

SplitingExample • How to split the ordinal (in thiscase) variable „age”? (young+older) vs. medium?  (young+medium) vs. older?

SplitingExample 1. Young + Older= r versus Medium = l • I(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3 I(r) = 300/1400 I(l) = 100/600

SplitingExample 2. Young + Medium= r versus Older= l • i(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2 I(r) = 300/1600 I(l) = 100/400

SplitingExample 1. Young + Older= r versus Medium = l • p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3

SplitingExample 2. Young + Medium= r versus Older= l • p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2

Data Mining

Data Mining

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining

Data Mining: Data

Data Mining: Data