1.17k likes | 1.38k Views
Cost-Sensitive Learning and Profit Optimal Decision Trees: Using Sequential Binary Programming to Make the Most Profitable Predictions. UCT Department of Information Systems Seminar 23 August 2006. Alan Abrahams BBusSc(UCT) PhD(Cambridge) Jointly with: Adrian Becker. Dan Fleder, Ian MacMillan.
E N D
Cost-Sensitive Learning and Profit Optimal Decision Trees:Using Sequential Binary Programming to Make the Most Profitable Predictions UCT Department of Information Systems Seminar23 August 2006 Alan AbrahamsBBusSc(UCT) PhD(Cambridge)Jointly with: Adrian Becker. Dan Fleder, Ian MacMillan 1
Outline What are DTs ? • Decision Trees: • In Management Science • In Engineering • In Profit Optimal Prediction • Industrial Applications of Profit Optimal Decision Trees • Construction: • Traditional • Traditional with cost-sensitive input or output modification • Profit Optimal, using Sequential Binary Programming [SBP] • Evaluation Why use DTs ? How do I construct and evaluate a DT? 2
Decision Trees in Management Science • Plot alternative futures: decision points and possible outcomes • Select forward path with highest expected profit • Problems: • trees manually constructed, by hand, using chronological events • managers aren't that good at estimating prior probabilities. Rather: use a data-based approach. • manual approach tries only one partition.Rather: try all possible partitions of the candidate set automatically, using software. 4 4
Decision Trees in Management Science • Starting from the rightmost decision nodes, compute the best decision at each decision node. • Best decision = alternative which maximizes Expected Profit from that node. See Hillier & Hillier, Moore & Weatherford, Stevenson & Ozgur 5 5
Decision Trees in Engineering • Partition data set into most homogenous sub-sets • Use the criteria which define each sub-set to predict likely class of new, unseen example 6
TerminologyA Simple Decision Tree Example: Predicting Responses to a Credit Card Marketing Campaign Branch / Arc Nodes Non-Responder Low Debts Root node Leaf nodes Responder High Low Income Responder Many Children High Male Gender Non-Responder Few Female This decision tree says that people with low income and high debts, and high income males with many children are likely responders. Non-Responder 7
Using a Decision TreeClassifying a New Examplefor our Credit Card Marketing Campaign Feed the new example into the root of the tree and follow the relevant path towards a leaf, based on the attributes of the example. Non-Responder Low Debts Assume Sarah has high income. Responder High Low Income Responder Many Children High Male Gender Non-Responder Few Female The tree predicts that Sarah will not respond to our campaign. Non-Responder Already we see a problem: actually I don't really want to know whether Sarah will respond or not … What I really want to know is, on average, will I make a profit from people like Sarah ! 8
Using a Decision TreeReading Rules off the Decision Treefor our Credit Card Marketing Campaign For each leaf in the tree, read the rule from the root to that leaf. You will arrive at a set of rules. Non-Responder IF Income=Low AND Debts=Low THEN Non-Responder Low Debts Responder IF Income=Low AND Debts=High THEN Responder High Low Income IF Income=High AND Gender=Male AND Children=Many THEN Responder Many Responder Children Male High Gender IF Income=High AND Gender=Male AND Children=Few THEN Non-Responder Few Non-Responder Female Non-Responder IF Income=High AND Gender=Female THEN Non-Responder 9
Algorithms • ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and ASSISTANT: Use information gain or gain ratio as splitting criterion (C4.5 is dealt with in this lecture). • CART (Classification And Regression Trees): Uses Gini diversity index or Twoing criterion as measure of impurity when deciding splitting. • CHAID: A statistical approach that uses the Chi-squared test (of correlation/association, dealt with in an earlier lecture) when deciding on the best split. Other statistical approaches, such as those by Goodman and Kruskal, and Zhou and Dillon, use the Assymetrical Tau or Symmetrical Tau to choose the best discriminator. • Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or incorrectly. (See Sestito & Dillon, Chapter 3, Witten & Frank Section 4.3, or Dunham Section 4.4, if you’re interested in learning more) (If interested see also: http://www.kdnuggets.com/software/classification-tree-rules.html) 10
Prediction: Examples from Industry • CommerceBank (ABSA Home Loans) • Visa • WingsOver (Nando’s) • Cingular (Vodacom) • Wal-Mart (Pick 'n Pay) 12
Prediction: Examples from Industry • Fedex (DHL) • Crate & Barrel (Verimark) • British Airways (Kulula) • AND1 (Nike) 13
Guessing right David Kevin Michael x 2 Richard 14
“Accuracy is king” “Only 15% of mergers and acquisitions succeed” Stephen DenningThe Leaders Guide to StoryTelling, pg xiv 15
“Profit is King”(or “It pays to be wrong sometimes…”) Failure rate of new ventures invested in: 8 out of 10 Profit on Google investment: $4 billion (on $25 million) Source: http://www.financialnews-us.com/?contentid=534017 16
Sometimesit pays to be wrong almost all the time… Customer Lifetime Value: $2,700 Cost per flyer: 7 cents Required hit rate = 7 / 270,000 = 1 in 3,857 17
Traditional Trees Recursive Steps in Building a Tree • STEP 1: • Try different partitions using different attributes and splits to break the training examples into different subsets. • STEP 2: • Rank the splits (by purity). Choose the best split. • STEP 3: • For each node obtained by splitting, repeat from STEP 1, until no more good splits are possible. • Note: Usually it is not possible to create leaves that are completely pure - i.e. contain one class only - as that would result in a very bushy tree which is not sufficiently general. However, it is possible to create leaves that are purer - that is contain predominantly one class - and we can settle for that. 19
Traditional Trees ApplicantID City 1 Philly 2 Philly 3 Philly 4 Philly Building a TreeChoosing a SplitExample Applicants who defaulted on loans: Children Income Status Many Medium DEFAULTS Many Low DEFAULTS Few Medium PAYS Few High PAYS Try split on Children attribute: Try split on Income attribute: Many Low Children Income Medium Few High Notice how the split on the Children attribute gives purer partitions. It is therefore chosen as the first (and in this case only) split. 20
Traditional Trees Building a TreeExample The simple tree obtained in the previous slide splits the data like this: Children Applicant 1 Applicant 2 Many DEFAULTS The best split Applicant 3 Applicant 4 Few PAYS Income Low Medium High Notice how the split is parallel to an axis - this is a feature of decision tree approaches. 21
Candidates for Fosamax(osteoperosis drug)Stylized Example 0% buyers Age < 20 40% buyers Age 20 to 40 60% buyers Age 40 to 60 Age > 60 100% buyers • Traditional engineering goal: • predict who are buyers, and who are not • New profit-optimal goal: • predict who are profitable, and who are not 22
Candidates for Fosamax • Assume: • cost of contacting candidate = R1 • profit from sale = R10 0% buyers Age < 20 40% buyers Age 20 to 40 60% buyers Age 40 to 60 Age > 60 100% buyers • Contact all 40 candidates: • 20 purchasers out of 40 candidates contacted • Total profit = (20 x 10) – (40 x 1) • = 200 – 40 • = R160 23
Traditional Split (Lowest Entropy / Highest Purity) Mostly non-buyers 0% buyers Age < 20 40% buyers Age 20 to 40 60% buyers Age 40 to 60 Age > 60 100% buyers Mostly buyers <60 • Contact only the segment aged greater than 60 (mostly buyers): • 10 purchasers out of 10 candidates contacted • Total profit = (10 x 10) – (10 x 1) • = 100 – 10 • = R90 Age >60 24
Traditional Split with Re-Labeling Mostly non-buyers, but still some profit 0% buyers Age < 20 40% buyers Age 20 to 40 60% buyers Age 40 to 60 Age > 60 100% buyers Mostly buyers <60 • Even though the segment younger than 40 is mostly non-buyers, our profit-cost ratio is high enough that we should contact them anyway: • Additional profit from "mostly non-buyers" • = (10 x 10) – (30 x 1) = R70 • Total profit = R160 Age >60 25
Profit-Optimal Splitwith Sequential Binary Programming Unprofitable segment 0% buyers Age < 20 40% buyers Age 20 to 40 60% buyers Age 40 to 60 Age > 60 100% buyers Profitable Segment <20 • Profit-optimal partition is a different tree structure entirely.Contact only people aged greater than 20. • Profit from profit-optimal segmentation: • = (20 x 10) – (30 x 1) • = 200 – 30 • = R170 Age >20 26
Sequential Binary Programming For each level of the tree, starting at the root, solve the following binary integer program: Decision Variables: Xi = Use partition i or not (binary) For each attribute, try different cut-off values (partitions). For example: X1 = Partition is "Age > 0" X2 = Partition is "Age > 20" X3 = Partition is "Age > 40" X4 = Partition is "Age > 60" Constraints: Xi are binary Exactly one partition is chosen at a time: Xi = 1 (continued on next slide) 27
Sequential Binary Programming (continued from previous slide) Profit per buyer = P Cost per candidate = C Objective function: Maximize Profit: = (Buyers P) – (Candidates C) = X1 ((Buyers in X1 P) – (Candidates in X1 C)) + X2 ((Buyers in X2 P) – (Candidates in X2 C)) + X3 ((Buyers in X3 P) – (Candidates in X3 C)) + X4 ((Buyers in X4 P) – (Candidates in X4 C)) 28
Making Tree's Cost Sensitive “Change tree output” Determine lowest expected cost class (Threshold Reclassification) [Elkan 01] [Marginteau 02] “Change tree input” Modify the distribution of the training set (MetaCost, Stratification, C-SVMs) [Domingos 99, Zadrozny 03] Cost-sensitive learning “Change tree structure”Make the classifieralgorithm cost sensitive[Granger 69; Ling 04; Abrahams, Becker, Fleder, MacMillan.] 29
Evaluating Decision Trees • Profiton unseen test set: gain charts • Robustness: volatility of profit • Ability to express misclassification cost functions • Note: there is seldom one ‘best’ tree but usually many good trees to choose from. 31
Evaluating RobustnessCross-validation: try model on different test sets $ Profit Test set • Determine: • Mean • Min and Max • Standard deviation 33
Expressing Profit / Cost Functions Traditional approaches typically permit only constant costs, as seen in this standard misclassification/confusion matrix: Predicted Actual 34
Expressing Profit / Cost Functions • Our Profit-optimal decision trees (SBP) allow you to specify an arbitrary profit function, e.g. dependent on: • Number of buyers in partition (e.g. fixed batch costs, giving economies of scale for large partition sizes) • Physical distance between buyers in partition (e.g. Fedex) • Legislative regulations specific to partition (e.g. government subsidies for ARV's to the poor) • etc… 35
Profit-Optimal SplitAdvantages of Expressive Profit Functions Unprofitable segment becomes viable through scale economies 0% buyers Age < 20 40% buyers Age 20 to 40 60% buyers Age 40 to 60 Age > 60 100% buyers Profitable Segment • Assume the post office gives us a 50% discount for mailings of 40 or more. • The profit-optimal segmentation would now be "Age > 0" since this gives profit: • = (20 x 10) – (40 x 0.5) = R180 • (compared to the original profit of R170) Age >0 36
Conclusion / Summary What are DTs ? • Decision trees are helpful for prediction • Traditional trees in engineering maximize purity • Profit-optimal trees maximize profit • Wide range of industries can benefit: retail, credit, banking, telecoms, entertainment, … • Profit optimal trees can be constructed easily using Sequential Binary Programming (SBP) • SBP trees can express arbitrary cost functions. Why use DTs ? How do I construct and evaluate a DT? 37
References Abrahams A., A. Becker, D. Fleder, and I. MacMillan. 2005. Handling generalized cost functions in the partitioning optimization problem through sequential binary programming. Fifth IEEE International Conference on Data Mining.Elkan, C. 2001. The foundations of cost-sensitive learning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence.Provost, F. and P. Domingos. 2003. Tree Induction for probability-based rankings. Machine Learning, 52(3):199-215.Zadrozny, B. 2003. Policy mining: Learning decision policies from fixed sets of data, Ph.D. Thesis, University of California, San Diego. Zadrozny, B. and C. Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. Proceedings of the Eighteenth International Conference on Machine Learning. Zadrozny, B., J. Langford, and N. Abe. 2003. Cost-sensitive learning by cost-proportionate example weighting. Proceedings of the 2003 IEEE International Conference on Data Mining. 38
Questions? 39
Further Example Applications • Given a supermarket database of purchase transactions, marked with customers who did and did not use coupons, we can build a decision tree to determine which variables influence coupon usage and how much. Our dependent variable here is COUPON_USED. Our independent variables could be the time of day, the type of customer, the number of television / newspaper / magazine / in-store advertisements, or other factors. 41
RevisionComponents of Automated Learning Methods(Classification Algorithms) • Training Set: we aim to proceed from individual cases to general principles - learning from examples • Learning / Search Algorithm (e.g. Rule inducer, Decision Tree generator, Neural Net weight adjuster): These search for the best descriptions of each concept (class). • Knowledge Representation (e.g. set of rules, Decision Trees) • Assessment / Evaluation Scheme. We can measure: • accuracy of the model: are descriptions consistent and complete? • profitability (cost of misclassifications): usually more important than simple accuracy ! • simplicity / understandability (comprehensibility) • handling of noise • robustness (confidence of predictions; flexibility on new data) • ability to handle continuous data (e.g. binning or normalizing may introduce bias) • computational cost of building and using the model • adequacy of the representation language (e.g. as we’ll see decision trees often don’t support complex functions on attributes) 43
Discovering and Using Rules Automated Knowledge Acquisition Program (e.g. rule inducer, decision tree algorithm) Training Data Aim is to avoid the manual knowledge acquisition bottleneck! IF-THEN rules Rule-Based System(Knowledge Base + Inference Engine) New Data Predictions /Classifications 44
Outline – Decision Trees • Terminology & Goals • Algorithms • Using a Decision Tree • Classifying a new example • Reading the rules off a tree • Building a Decision Tree (Training the Model) • Choosing the best split: Measuring impurity (heterogeneity) • Stopping criteria • Dealing with Missing Values in Using & in Building • Weaknesses • Axis parallel splits • Overfitting & Underfitting • Pruning • Variations • Regression Trees • Model Trees 45
Goals • Building Decision Trees • Using Decision Trees 46
GoalsUsing Decision Trees • Users of Decision Trees aim to classify or predict the values of new examples by feeding them into the root of the tree, and determining which leaf the example flows to. • For categorical outputs (classification): The leaves of the tree assign a class label or a probability of being in a class • For numeric outputs (prediction): The leaves of the tree assign an average value (‘regression trees’) or specify a function that can be used to compute a value (‘model trees’) for examples that reach that node. • Users of Decision trees may also want to derive a set of rules (descriptions) that describe the general characteristics of each class. 47
GoalsBuilding Decision Trees • Builders of Decision Trees aim to maximize the purity (homogeneity) of outputs at each node. That is, minimize the impurity (heterogeneity) of outputs at each node. • For categorical outputs (classification trees): Achieve nodes such that one class is predominant at each node. • For numeric outputs (prediction trees): Achieve nodes such that means between nodes vary as much as possible and standard-deviation or variance (i.e. dispersion) within each node is as low as possible. • Decision tree methods are often referred to as Recursive Partitioning methods as they repeatedly partition the data into smaller and smaller - and purer and purer - subsets. 48
Recursive Steps in Building a TreeExample STEP 1: Split Option A STEP 1: Split Option B Not good as sub-nodes are still very heterogenous! Better, as purity of sub-nodes is improving. STEP 2: Choose Split Option B as it is the better split. STEP 3: Try out splits on each of the sub-nodes of Split Option B. Eventually, we arrive at: Notice how examples in a parent node are split between sub-nodes - i.e. notice how the training examples are partitioned into smaller and smaller subsets. Also, notice that sub-nodes are purer than parent nodes. 49
Building a TreeChoosing a SplitInformation Gain • We will deal only with C4.5’s Information Gain metric of purity in this lecture. • A node is highly pure if a particular class predominates amongst the examples at the node. • ‘Information Gain’ measures the gain in weighted node purity (i.e. reduction in impurity) as a result of choosing a particular split. By weighted node purity, we mean that, when adding node purity metrics, a node with few examples in has a lower weight than a node with many examples in it. • A highly impure sub-node is said to have high entropy (be highly chaotic), because the examples at the sub-node are very heterogeneous. • In contrast, our goal is to obtain pure sub-nodes with low entropy, so that the examples at each node of the split are very homogenous. 50