210 likes | 233 Views
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes. Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumi é re Lyon Summarized by Seong-Bae Park. Introduction.
E N D
Chapter 10.Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumiére Lyon Summarized by Seong-Bae Park
Introduction • Fast and Efficient Sampling Strategy to Build DTs from a very Large Database • Propose a Strategy Using Successive Samples, one on Each Tree Node
Framework • Play Tennis Table
Handling Continuous Attributes in DT • Discretization • Global Discretization • Each continuous attribute was converted to a discrete one. 1. Each continuous variable is sorted. 2-1. Several cutting points are tested so as to find the subdivision which is the best according to the class attribute. • Use a splitting measure (entropy gain, chi-square, purity measure) 2-2. Looking for the number of intervals and their boundaries. • Local Discretization • It is not necessary to determine how many intervals should be created as each split creates two intervals. • Interaction among attributes is accounted for. • Require initially a sorting of the values O(n log n) • Need sampling to reduce n
Local Sampling Strategy • During construction, on each leaf, a sample is drawn from the part of the database that corresponds to the path associated to the leaf. • Process 1. First, a complete list of individuals on the base is drawn; 2. The first sample is selected while the base is being read; 3. This sample is used to identify the best segmentation attribute, if it exists, otherwise, the stopping rule has played its role and the node becomes a terminal leaf; 4. If a segmentation is possible, then the list in step 1 is broken up into sub-lists corresponding to the various leaves just obtained; 5. Step 4 requires passing through the DB to update each examples’ leaf; this pass is an opportunity to select the samples that will be used in later computations. Iterate Step 3 to Step 5 until all nodes are converted to terminal leaves.
Determining The Sample Size • The size of the sample must be such that 1) This split be recognized as such, that is the power of the test must be sufficient; 2) The discretization point be estimated as precisely as possible; 3) If, on the given node on the base, many splitting attributes are possible, the criterion for the optimal attribute remains maximal in the sample.
Testing Statistical Signification for a Link • For each node, we use statistical tests concepts: probability of type I and type II errors ( and ) • Looking for the attribute which provides the best split according to the criterion T. • The split is done if two conditions are met: 1) If this split is the best, 2) If this split is possible (T(Sample Data) is unlikely when H0 is true.) • Null Hypothesis H0: “There is no link between the class attribute and the predictive attribute we are testing.” • p – value : the probability of T being greater than or equal to T(Sample Data) • H0 is rejected, so the split is possible, if the p – value is less than a predetermined significance level, .
Testing Statistical Signification for a Link • True significant level ’ is larger than . (multi hypotheses) • The possibility of ’ of observing at least one of the attributes smaller than is: • One must use a very small value for . • The significance level limits the type I error probability.
Notations • Y : class attribute • X : predictor attributes • ij : the proportion of (Y = Yiand X = Xj) in the sub-population corresponding to the working node. • i+ and +j : marginal proportions • ij0 : the products of marginal proportions • nij : the number of (ij) cell in the sample tabulation. • E(nij) = n ij : expected value of nij
Probability Distribution of The Criterion • Measure the link by 2 statistic or information gain • When H0 is true and the sample size is large, both have approximate chi-square distribution with degrees of freedom = (p – 1)(q – 1). • When H0 is false, the distribution is approximately non-central chi-square. • Central chi-square distribution • When H0 is true, = 0. • The further the truth is from H0, the larger . • Noncentral chi-square distribution • No closed analytic formulation • Asymptotically normal for large values of . • : a function of sample size n and the frequencies ij in the whole database
Probability Distribution of The Criterion • The value of • For information gain • For 2 statistic
Equalizing of Normal Risk Probabilities • Find the minimal sample sizes to get a power (1- ) • T1- : the critical value • If p = q = 2, v = 1 and = nR2
Equalizing of Normal Risk Probabilities • The weaker the link (R2) is in the database, the larger the sample size must be to make evidence for it. • n increases as the significance level decreases: If one wants to reduce risk probabilities, a larger sample is needed.
Sampling Methods • Algorithm S • Sequentially processes the DB records and determine whether each record is selected • The first record is selected with probability (n / N) • If m records are selected from among the first t records, the (t+1)st record is selected with probability (n-m)/(N-t). • When n records are selected, stops. • Algorithm D • Random jump between selected records
Experiments • Objective of the Experiments • To show that a tree built with local sampling has a generalization error rate comparable to that of a tree built with the complete database • To show that sampling reduces computing time. • Artificial Database • Artificial Problem • “Breiman’s et al. waves” • Generate 100 times two files, one of 500,000 records for training, the other of 50,000 records for the validation. • Binary discretization • ChAID decision tree algorithm
Experiments • The marginal profit becomes weak.
With Real Benchmark DBs • 5 DBs from UCI which contain more than 12,900 individuals. • Repeat 10 times the following operations • Subdivide randomly the DB in a training set and in a test set. • Test the trees.
With Real Benchmark DBs • The influence of n • The sample size must be too small. • Sampling drastically reduces computing time. • “Letter” DB : data fragmentation
Conclusions • Working on samples is useful. • “Step by step” characteristics of DT allows us to propose a strategy using successive samples. • Theoretical and Empirical evidence • Open Problems • Optimal sampling methods • Learning Imbalanced Classes • Local equal-size sampling