1 / 21

Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumi é re Lyon

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes. Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumi é re Lyon Summarized by Seong-Bae Park. Introduction.

chappell
Download Presentation

Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumi é re Lyon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 10.Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumiére Lyon Summarized by Seong-Bae Park

  2. Introduction • Fast and Efficient Sampling Strategy to Build DTs from a very Large Database • Propose a Strategy Using Successive Samples, one on Each Tree Node

  3. Framework • Play Tennis Table

  4. Handling Continuous Attributes in DT • Discretization • Global Discretization • Each continuous attribute was converted to a discrete one. 1. Each continuous variable is sorted. 2-1. Several cutting points are tested so as to find the subdivision which is the best according to the class attribute. • Use a splitting measure (entropy gain, chi-square, purity measure) 2-2. Looking for the number of intervals and their boundaries. • Local Discretization • It is not necessary to determine how many intervals should be created as each split creates two intervals. • Interaction among attributes is accounted for. • Require initially a sorting of the values O(n log n) • Need sampling to reduce n

  5. Local Sampling Strategy • During construction, on each leaf, a sample is drawn from the part of the database that corresponds to the path associated to the leaf. • Process 1. First, a complete list of individuals on the base is drawn; 2. The first sample is selected while the base is being read; 3. This sample is used to identify the best segmentation attribute, if it exists, otherwise, the stopping rule has played its role and the node becomes a terminal leaf; 4. If a segmentation is possible, then the list in step 1 is broken up into sub-lists corresponding to the various leaves just obtained; 5. Step 4 requires passing through the DB to update each examples’ leaf; this pass is an opportunity to select the samples that will be used in later computations. Iterate Step 3 to Step 5 until all nodes are converted to terminal leaves.

  6. Local Sampling Strategy

  7. Determining The Sample Size • The size of the sample must be such that 1) This split be recognized as such, that is the power of the test must be sufficient; 2) The discretization point be estimated as precisely as possible; 3) If, on the given node on the base, many splitting attributes are possible, the criterion for the optimal attribute remains maximal in the sample.

  8. Testing Statistical Signification for a Link • For each node, we use statistical tests concepts: probability of type I and type II errors ( and ) • Looking for the attribute which provides the best split according to the criterion T. • The split is done if two conditions are met: 1) If this split is the best, 2) If this split is possible (T(Sample Data) is unlikely when H0 is true.) • Null Hypothesis H0: “There is no link between the class attribute and the predictive attribute we are testing.” • p – value : the probability of T being greater than or equal to T(Sample Data) • H0 is rejected, so the split is possible, if the p – value is less than a predetermined significance level, .

  9. Testing Statistical Signification for a Link • True significant level ’ is larger than . (multi hypotheses) • The possibility of ’ of observing at least one of the attributes smaller than  is: • One must use a very small value for . • The significance level  limits the type I error probability.

  10. Notations • Y : class attribute • X : predictor attributes • ij : the proportion of (Y = Yiand X = Xj) in the sub-population corresponding to the working node. • i+ and +j : marginal proportions • ij0 : the products of marginal proportions • nij : the number of (ij) cell in the sample tabulation. • E(nij) = n ij : expected value of nij

  11. Probability Distribution of The Criterion • Measure the link by 2 statistic or information gain • When H0 is true and the sample size is large, both have approximate chi-square distribution with degrees of freedom  = (p – 1)(q – 1). • When H0 is false, the distribution is approximately non-central chi-square. • Central chi-square distribution • When H0 is true,  = 0. • The further the truth is from H0, the larger . • Noncentral chi-square distribution • No closed analytic formulation • Asymptotically normal for large values of . •  : a function of sample size n and the frequencies ij in the whole database

  12. Probability Distribution of The Criterion • The value of  • For information gain • For 2 statistic

  13. Equalizing of Normal Risk Probabilities • Find the minimal sample sizes to get a power (1- ) • T1- : the critical value • If p = q = 2, v = 1 and  = nR2

  14. Equalizing of Normal Risk Probabilities • The weaker the link (R2) is in the database, the larger the sample size must be to make evidence for it. • n increases as the significance level  decreases: If one wants to reduce risk probabilities, a larger sample is needed.

  15. Sampling Methods • Algorithm S • Sequentially processes the DB records and determine whether each record is selected • The first record is selected with probability (n / N) • If m records are selected from among the first t records, the (t+1)st record is selected with probability (n-m)/(N-t). • When n records are selected, stops. • Algorithm D • Random jump between selected records

  16. Experiments • Objective of the Experiments • To show that a tree built with local sampling has a generalization error rate comparable to that of a tree built with the complete database • To show that sampling reduces computing time. • Artificial Database • Artificial Problem • “Breiman’s et al. waves” • Generate 100 times two files, one of 500,000 records for training, the other of 50,000 records for the validation. • Binary discretization • ChAID decision tree algorithm

  17. Experiments • The marginal profit becomes weak.

  18. With Real Benchmark DBs • 5 DBs from UCI which contain more than 12,900 individuals. • Repeat 10 times the following operations • Subdivide randomly the DB in a training set and in a test set. • Test the trees.

  19. With Real Benchmark DBs

  20. With Real Benchmark DBs • The influence of n • The sample size must be too small. • Sampling drastically reduces computing time. • “Letter” DB : data fragmentation

  21. Conclusions • Working on samples is useful. • “Step by step” characteristics of DT allows us to propose a strategy using successive samples. • Theoretical and Empirical evidence • Open Problems • Optimal sampling methods • Learning Imbalanced Classes • Local equal-size sampling

More Related