160 likes | 300 Views
BOAT. Bootstrapped Optimistic Algorithm for Tree Construction. CIS 595 FALL 2000. Presentation Prashanth Saka. BOAT is a new algorithm for decision tree construction that improves both in functionality and performance, resulting in a gain of around 300%.
E N D
BOAT Bootstrapped Optimistic Algorithm for Tree Construction
CIS 595FALL 2000 Presentation Prashanth Saka
BOAT is a new algorithm for decision tree construction that improves both in functionality and performance, resulting in a gain of around 300%. • The reason, only two scans over the entire training data set. • The first scalable algorithm with the ability to incrementally update the tree w.r.t., to both insertions and deletions over the dataset.
Take a sample D ́ D from the training database and construct a sample tree with coarse splitting criteria at each node using bootstrapping. • Make one scan over the database D and process each tuple by ‘streaming it’ down the tree. • At the root node, n, update the counts of buckets for each numerical predictor attribute. • If ‘t’ falls in the confidence interval, ‘t’ is written into a temporary file Sn at node n, else it is sent down the tree.
Then the tree is processed top-down. • At each node a lower bounding technique is used to check whether the global minimum value of the impurity function could be lower than i’, the minimum impurity value. • If the check is successful, then we are done with the node n. Else, we discard n and its sub tree during the current construction.
Each decision tree has exactly one incoming edge and zero or two outgoing edges. • Each leaf is labeled with one class label. • Each internal node is labeled with one predictor attribute Xn called the splitting attribute. • Each internal node has the splitting predicate qn associated with it. • If Xn is numerical, then qn is in the form Xn xn, where xn belongs to dom(Xn); xn is the split point at node n.
The combined information of splitting attribute and splitting predicates at a node n is called the splitting criterion at n.
We associate at each node nT, a predicate fn: dom(X1) x … x dom(Xm) { true, false }, called its node predicate as follows: for the root node n, fn = true Let n be a non-root node with the parent p, whose splitting predicate is qp. If n is the left child of p, then fn = fp qp; If n is the right child of p, then fn = fp¬ qp
Since each leaf node n T is labeled with a class label, it encodes a classification rule fn c, where c is the label of n. T: dom(X1) x … x dom(Xm) dom(C) and is therefore a classifier called a decision classifier. For a node nT, with parent p, Fn is the set of records in D that follows the path from the root to node n, when being processed by the tree. Formally, Fn = { t D : f(n) is true }
Here, the impurity based split selection methods are considered, which produce binary splits. • The impurity based split selection methods calculate the splitting criterion by trying to minimize a concave splitting function imp. • At each node, all the predictor attributes X are examined and the impurity of the best split on X is calculated, and the final split is chosen such that the value of imp is minimized.
Let T be the final tree constructed using the split selection method CL on the training database, D. • As D does not fit into the memory, consider D’D such that D’ fits into the memory. • Compute a sample T’ from D’. • Each node n T’ has a sample splitting criteria consisting of a sample splitting attribute and a split point. • We can use this knowledge of T’ to guide us in the construction of T, our final goal.
Consider a node n in the sample tree T’ with numerical sample attribute Xn, and sample splitting predicate Xn x. • By T’ being close to T we mean that the final splitting attribute is at node n is X and that the final split point is inside a confidence interval around x. • For categorical attributes, both the splitting attribute as well as the splitting subset have to match.
Bootstrapping: The bootstrapping method can be applied to the in-memory sample D’ to obtain a tree T’ that is close to T with high probability. • In addition to T’, we also obtain the confidence intervals that contain the final split points with for nodes with numerical splitting attributes. • We call the information at node n obtained through bootstrapping the coarse splitting criterion at node n.
After finding the final splitting attribute at each node n and also the confidence interval of the attribute values that contain the final split point. • To decide on the final split point we need to examine the value of the impurity function only at the attribute values inside the confidence interval. • If we had all the tuples that fall inside the confidence interval of n in-memory, then we could calculate the final split point exactly by calculating the value of the impurity function at these points only.
To bring these tuples into memory we make one scan over D and keep all tuples that fall inside the confidence interval at any node in-memory. • Then we post process each node with a numerical splitting attribute to find the exact value of the split point using the tuples collected during the database scan. • This phase is called the clean-up phase. • The coarse splitting criterion at node n obtained from the sample D’ through bootstrapping is only correct with a high probability.
Whenever the course splitting criterion at n is not correct, we detect it during the clean-up phase, and can take necessary action. • And hence, the method guarantees to find exactly the same tree as if a traditional main memory algorithm were run on the complete training set.