500 likes | 511 Views
Mining Decision Trees from Data Streams. Tong Suk Man Ivy CSIS DB Seminar February 12, 2003. Contents. Introduction: problems in mining data streams Classification of stream data VFDT algorithm Window approach CVFDT algorithm Experimental results Conclusions Future work. Data Streams.
E N D
Mining Decision Trees fromData Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003
Contents • Introduction: problems in mining data streams • Classification of stream data • VFDT algorithm • Window approach • CVFDT algorithm • Experimental results • Conclusions • Future work
Data Streams • Characteristics • Large volume of ordered data points, possibly infinite • Arrive continuously • Fast changing • Appropriate model for many applications: • Phone call records • Network and security monitoring • Financial applications (stock exchange) • Sensor networks
Problems in Mining Data Streams • Traditional data mining techniques usually require • Entire data set to be present • Random access (or multiple passes) to the data • Much time per data item • Challenges of stream mining • Impractical to store the whole data • Random access is expensive • Simple calculation per data due to time and space constraints
Classification of Stream Data • VFDT algorithm • “Mining High-Speed Data Streams”, KDD 2000. Pedro Domingos, Geoff Hulten • CVFDT algorithm (window approach) • “Mining Time-Changing Data Streams”, KDD 2001. Geoff Hulten, Laurie Spencer, Pedro Domingos
Definitions • A classification problem is defined as: • N is a set of training examples of the form (x, y) • x is a vector of d attributes • y is a discrete class label • Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy
Age<30? Yes No Car Type= Sports Car? Yes Yes No No Decision Tree Learning • One of the most effective and widely-used classification methods • Induce models in the form of decision trees • Each node contains a test on the attribute • Each branch from a node corresponds to a possible outcome of the test • Each leaf contains a class prediction • A decision tree is learned by recursively replacing leaves by test nodes, starting at the root
Challenges • Classic decision tree learners assume all training data can be simultaneously stored in main memory • Disk-based decision tree learners repeatedly read training data from disk sequentially • Prohibitively expensive when learning complex trees • Goal: design decision tree learners that read each example at most once, and use a small constant time to process it
Key Observation • In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. • Given a stream of examples, use the first ones to choose the root attribute. • Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively. • Use Hoeffding bound to decide how many examples are enough at each node
Hoeffding Bound • Consider a random variable a whose range is R • Suppose we have n observations of a • Mean: • Hoeffding bound states: With probability 1- , the true mean of a is at least , where
How many examples are enough? • Let G(Xi) be the heuristic measure used to choose test attributes (e.g. Information Gain, Gini Index) • Xa : the attribute with the highest attribute evaluation value after seeing n examples. • Xb : the attribute with the second highest split evaluation function value after seeing n examples. • Given a desired , if after seeing n examples at a node, • Hoeffding bound guarantees the true , with probability 1-. • This node can be split using Xa, the succeeding examples will be passed to the new leaves.
Algorithm • Calculate the information gain for the attributes and determines the best two attributes • Pre-pruning: consider a “null” attribute that consists of not splitting the node • At each node, check for the condition • If condition satisfied, create child nodes based on the test at the node • If not, stream in more examples and perform calculations till condition satisfied
Age<30? Data Stream Yes No Yes Age<30? Yes No Car Type= Sports Car? Yes Data Stream Yes No No
Performance Analysis • p: probability that an example passed through DTto level i will fall into a leaf at that point • The expected disagreement between the tree produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than /p. • Required memory: O(leaves * attributes * values * classes)
VFDT (Very Fast Decision Tree) • A decision-tree learning system based on the Hoeffding tree algorithm • Split on the current best attribute, if the difference is less than a user-specified threshold • Wasteful to decide between identical attributes • Compute G and check for split periodically • Memory management • Memory dominated by sufficient statistics • Deactivate or drop less promising leaves when needed • Bootstrap with traditional learner • Rescan old data when time available
VFDT(2) • Scales better than pure memory-based or pure disk-based learners • Access data sequentially • Use subsampling to potentially require much less than one scan • VFDT is incremental and anytime • New examples can be quickly incorporated as they arrive • A usable model is available after the first few examples and then progressively defined
Experiment Results (VFDT vs. C4.5) • Compared VFDT and C4.5 (Quinlan, 1993) • Same memory limit for both (40 MB) • 100k examples for C4.5 • VFDT settings: δ= 10-7, τ= 5%, nmin=200 • Domains: 2 classes, 100 binary attributes • Fifteen synthetic trees 2.2k – 500k leaves • Noise from 0% to 30%
Experiment Results Accuracy as a function of the number of training examples
Experiment Results Tree size as a function of number of training examples
Mining Time-Changing Data Stream • Most KDD systems, include VFDT, assume training data is a sample drawn from stationary distribution • Most large databases or data streams violate this assumption • Concept Drift: data is generated by a time-changing concept function, e.g. • Seasonal effects • Economic cycles • Goal: • Mining continuously changing data streams • Scale well
Window Approach • Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples • Sensitive to window size • If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept • Too small w may lead to insufficient examples to learn the concept • If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.
CVFDT • CVFDT (Concept-adapting Very Fast Decision Tree learner) • Extend VFDT • Maintain VFDT’s speed and accuracy • Detect and respond to changes in the example-generating process
Observations • With a time-changing concept, the current splitting attribute of some nodes may not be the best any more. • An outdated subtree may still be better than the best single leaf, particularly if it is near the root. • Grow an alternative subtree with the new best attribute at its root, when the old attribute seems out-of-date. • Periodically use a bunch of samples to evaluate qualities of trees. • Replace the old subtree when the alternate one becomes more accurate.
CVFDT algorithm • Alternate trees for each node in HT start as empty. • Process examples from the stream indefinitely. For each example (x, y), • Pass (x, y) down to a set of leaves using HT and all alternate trees of the nodes (x, y) passes through. • Add (x, y) to the sliding window of examples. • Remove and forget the effect of the oldest examples, if the sliding window overflows. • CVFDTGrow • CheckSplitValidity if f examples seen since last checking of alternate trees. • Return HT.
Pass example down to leaves add example to sliding window Window overflow? Forget oldest example CVFDTGrow f examples since last checking? No Yes CheckSplitValidty CVFDT algorithm: process each example Yes No Read new example
Pass example down to leaves add example to sliding window Yes Window overflow? Forget oldest example No Read new example CVFDTGrow f examples since last checking? No Yes CheckSplitValidty CVFDT algorithm: process each example
CVFDTGrow • For each node reached by the example in HT, • Increment the corresponding statistics at the node. • For each alternate tree Talt of the node, • CVFDTGrow • If enough examples seen at the leaf in HT which the example reaches, • Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index). • If the best attribute is not the “null” attribute, create a node for each possible value of this attribute
Pass example down to leaves add example to sliding window Yes Window overflow? Forget oldest example No Read new example CVFDTGrow f examples since last checking? No Yes CheckSplitValidty CVFDT algorithm: process each example
Forget old example • Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions. • VFDT only maintain such statistics at leaves. • HT might have grown or changed since the example was initially incorporated. • Assigned each node a unique, monotonically increasing ID as they are created. • forgetExample (HT, example, maxID) • For each node reached by the old example with node ID no larger than the max leave ID the example reaches, • Decrement the corresponding statistics at the node. • For each alternate tree Talt of the node, forget(Talt, example, maxID).
Pass example down to leaves add example to sliding window Yes Window overflow? Forget oldest example No CVFDTGrow f examples since last checking? No Yes CheckSplitValidty CVFDT algorithm: process each example Read new example
CheckSplitValidtiy • Periodically scans the internal nodes of HT. • Start a new alternate tree when a new winning attribute is found. • Tighter criteria to avoid excessive alternate tree creation. • Limit the total number of alternate trees.
Age<30? Yes No Married? Car Type= Sports Car? Yes No Yes Yes No Yes No No Experience <1 year? No Yes No Yes Smoothly adjust to concept drift • Alternate trees are grown the same way HT is. • Periodically each node with non-empty alternate trees enter a testing mode. • M training examples to compare accuracy. • Prune alternate trees with non-increasing accuracy over time. • Replace if an alternate tree is more accurate.
Adjust to concept drift(2) • Dynamically change the window size • Shrink the window when many nodes gets questionable or data rate changes rapidly. • Increase the window size when few nodes are questionable.
Performance • Require memory O(nodes * attributes * attribute values * classes). • Independent of the total number of examples. • Running time O(Lc * attributes * attribute values * number of classes). • Lc : the longest length an example passes through * number of alternate trees. • Model learned by CVFDT vs. the one learned by VFDT-Window: • Similar in accuracy • O(1) vs. O(window size) per new example.
Experiment Results • Compare CVFDT, VFDT, VFDT-Window • 5 million training examples • Concept changed at every 50k examples • Drift Level: average percentage of the test points that changes label at each concept change. • About 8% of test points change label each drift • 100,000 examples in window • 5% noise • Test the model every 10k examples throughout the run, averaged these results.
drift level Experiment Results (CVFDT vs. VFDT) Error rate as a function of number of attributes
Experiment Results (CVFDT vs. VFDT) Tree size as a function of number of attributes
Portion of data set that is labelled -ve Experiment Results (CVFDT vs. VFDT) Error rates of learners as a function of the number of examples seen
Experiment Results (CVFDT vs. VFDT) Error rates as a function of the amount of concept drift
Experiment Results CVFDT’s drift characteristics
Stimulated by running VFDT on W for every 100K examples instead of every example Experiment Results (CVFDT vs. VFDT vs. VFDT-window) Error Rate: • VFDT: 19.4% • CVFDT: 16.3% • VFDT-Window: 15.3% Running Time: • VFDT: 10 minutes • CVFDT: 46 minutes • VFDT-Window: expect 548 days Error rates over time of CVFDT, VFDT, and VFDT-window
Experiment Results • CVFDT not use too much RAM • D=50, CVFDT never uses more than 70MB • Use as little as half the RAM of VFDT • VFDT often had twice as many leaves as the number of nodes in CVFDT’s HT and alternate subtrees combined • Reason: VFDT considers many more outdated examples and is forced to grow larger trees to make up for its earlier wrong decisions due to concept drift
Conclusions • CVFDT – a decision-tree induction system capable of learning accurate models from high speed, concept-drifting data streams • Grow an alternative subtree whenever an old one becomes questionable • Replace the old subtree when the new more accurate • Similar in accuracy to applying VFDT to a moving window of examples
Future Work • Concepts changed periodically and removed subtrees may become useful again • Comparisons with related systems • Continuous attributes • Weighting examples
Reference List • P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000. • G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001 • V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proceedings of the Sixteenth International Conference on Data Engineering, 2000. • J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision tree construction. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999.
The end Q & A