1 / 62

Module-II Decision Tree Learning

Explore decision tree learning as a method for modeling target functions. Learn how decision trees represent data and the basic algorithm behind it.

cashford
Download Presentation

Module-II Decision Tree Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Module-IIDecision TreeLearning

  2. Introduction

  3. Introduction • Decision tree learning is a method for approximating discrete- valued target functions, in which the learned function is represented by a decision tree. • Learned trees can also be re-represented as sets of if-then rules to improve human readability. • Most popular of inductive inference algorithms • Have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants.

  4. Decision treerepresentation

  5. In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. • Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. For example, the decision tree shown in Figure corresponds to the expression (Outlook = Sunny Λ Humidity = Normal) V(Outlook = Overcast) V (Outlook = Rain Λ Wind = Weak)

  6. APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING decision tree learning is generally best suited to problems with the following characteristics • Instances are represented by attribute-value pairs • The target function has discrete output values • Disjunctive descriptions may be required • The training data may contain errors • The training data may contain missing attribute values Examples : Learning to classify medical patients by their disease, equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting on payments. • Such problems, in which the task is to classify examples into one of a discrete set of possible categories, are often referred to as classification problems.

  7. THE BASIC DECISION TREE LEARNING ALGORITHM • The core algorithm that employs a top-down, greedysearch through the space ofpossible decision trees. • This approach is demonstrated by the ID3 algorithm(Iterative Dichotomiser3) • In real time we use variations ofID3. • ID3 basic algorithm, learns decision trees by constructing themtop-down,beginningwiththequestion"whichattribute shouldbetestedattherootofthetree?” • Thebestattributeis selectedbasedonthestatisticaltestat therootnodeofthetree.

  8. Inventor John RossQuinlan • He is a computer science researcher in data mining and decisiontheory. • He has contributed extensively to thedevelopment ofdecisiontreealgorithms,includinginventingtheID3 & canonical C4.5 algorithms.

  9. Which Attribute Is the Best Classifier? • The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. • We would like to select the attribute that is most useful for classifying examples. • What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called information gain, thatmeasures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.

  10. ENTROPY MEASURES HOMOGENEITY OF EXAMPLES • In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples. • Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this Boolean classification is • where p⨁, is the proportion of positive examples in S and pΘ, is the proportion of negative examples in S. In all calculations involving entropy we define 0log0 to be 0.

  11. To illustrate, suppose S is a collection of 14 examples of some boolean concept, including 9 positive and 5 negative examples (we adopt the notation [9+, 5-] to summarize such a sample of data). Then the entropy of S relative to this boolean classification is • Notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p⨁ = 1), then pΘ, is 0, and Entropy(S) = -1 . log2(1) - 0 . log20 = -1 . 0 - 0 . log20 = 0. • Note the entropy is 1 when the collection contains an equal number of positive and negative examples.

  12. If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1. • Figure shows the form of the entropy function relative to a boolean classification, as p⨁, varies between 0 and 1. The entropy function relative to a boolean classification, as the proportion, p⨁, of positive examples varies between 0 and 1.

  13. One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S. • For example, if p, is 1, the receiver knows the drawn example will be positive, so no message need be sent, and the entropy is zero. • On the other hand, if p⨁ is 0.5, one bit is required to indicate whether the drawn example is positive or negative. If p⨁ is 0.8, then a collection of messages can be encoded using on average less than 1 bit per message by assigning shorter codes to collections of positive examples and longer codes to less likely negative examples.

  14. Thus far we have discussed entropy in the special case where the target classification is boolean. More generally, if the target attribute can take on c different values, then the entropy of S relative to this c-wise classification is defined as • where pi is the proportion of S belonging to class i.

  15. INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY • Information gainistheexpectedreductioninentropycausedbypartitioning the examples according tosome attribute A • Splitthenodewithattributehavinghighest Gain • S – a collection ofexamples • A – anattribute • Values(A) – possible values of attributeA; • Sv–thesubsetofSforwhichattributeAhasvaluev.

  16. ID3 using InformationGain

  17. ID3:Illustration

  18. 1. Level 0: To identifyRoot Node Entropy(S)=

  19. Illustration

  20. 2. Level 1: (1st branch)Outlook=Sunny • Entropy(Ssunny)=

  21. 3. Level 1: (2nd branch) Outlook =Overcast, • All areyes, • No Splittingrequired.

  22. Illustration 4. Level 1: (3rd branch) Outlook =Rain • Entropy(Srain)

  23. Illustration: Final DecisionTree

  24. HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING • The hypothesis space searched by ID3 is the set of possible decision trees. • ID3 performs a simple-to complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. • The evaluation function that guides this hill-climbing search is the information gain measure.

  25. Capabilities and Limitations ofID3 • ID3’s hypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to the available attributes. • ID3 maintains only a single current hypothesis as it searches through the space of decision trees. • ID3 in its pure form performs no backtracking in its search. Once it selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice. • ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g., FIND-So r CANDIDATE-ELIMINA

  26. INDUCTIVE BIAS IN DECISION TREE LEARNING • Inductive bias is the set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. • Typically thereare many decision trees consistent with trainingexamples. • Itchooses thefirstacceptabletreeitencountersinitssimple- to-complex,hill-climbingsearchthroughthespaceofpossible trees • ApproximateinductivebiasofID3:Shortertreesarepreferred over largertrees. • A closer approximation to the inductive bias of ID3 • Shortertreesarepreferredoverlongertrees. • Trees that place high information gain attributesclose to therootarepreferredoverthosethatdonot

  27. Restriction Biases and PreferenceBiases • The inductive bias of ID3 is thus a preference forcertain hypotheses over others (e.g., forshorter hypotheses) • This form of bias is typically called a preference bias(or, alternatively, a searchbias). • Incontrast,thebias oftheCEA isintheformofacategorical restrictiononthesetof hypothesesconsidered. • This form of bias is typically called a restrictionbias (or, alternatively, a languagebias). • A preference bias is more desirable than a restrictionbias • ID3 exhibits a purely preference bias and CEA is a purely restriction bias whereas some learning systems combineboth.

  28. Why prefer shorthypothesis? • Occam's razor: (Problem SolvingPrinciple) • Prefer the simplest hypothesis that fits thedata. • (The term razor is frequency and effectiveness withwhich he usedit)

  29. Issues in Decision treelearning • Practical issues in learning decision treesinclude • determininghowdeeplytogrowthedecisiontree, • handling continuousattributes, • choosing an appropriate attributeselection measure, • handlingtrainingdatawithmissingattributevalues, • handlingattributeswithdifferingcosts,and • improving computationalefficiency. • wediscuss each ofthese issuesandextensionstothe basic ID3 algorithm that addressthem. • ID3 hasitselfbeenextendedtoaddressmostofthese issues, with theresulting system renamed C4.5.

  30. Overfitting • Consider 2D data. +veexamples are plotted in Blue, -veare inRed • The green line represents an overfittedmodel and the black line represents a regularizedmodel. • While the green line best follows the training data, itis too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the blackline.

  31. Definition: Given a hypothesis space H, a hypothesis h E H is said to overfit the training data if there exists some alternative hypothesis h' E H, such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances • Underfitting: Underfittingoccurs when a statistical model cannot adequatelycapturetheunderlyingstructureofthedata.

  32. ValidationSet • How exactly might we use a validation set toprevent overfitting? • Reduced ErrorPruning • RulePost-Pruning Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

  33. Reduced Error Pruning

More Related