1 / 40

Powerpoint Templates

Mining High-Speed Data Streams. Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi. Powerpoint Templates. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As.

Download Presentation

Powerpoint Templates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates

  2. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Outlines

  3. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

  4. Introduction • In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution. • Many organizations have more than very large data bases that grow at a rate of several million records per day. • Opportunities • Challenges • Main limited resources in knowledge discovery systems: • Time • Memory • Sample size

  5. Introduction—cont. • Traditional systems: • Small amount of data is available • Using a fraction of available computational power • Current systems: • The bottleneck is time and memory • Using a fraction of available samples of data • Try to mine databases that don’t fit in main memory • Available algorithms: • Efficient, but not guarantee a similar learned model to the batch mode. • Never recover from an unfavorable set of early examples. • Sensitive to example ordering. • Produce the same model as batch version, but not efficiently. • Slower than the batch algorithm.

  6. Introduction—cont. • Requirements of algorithms to overcome these problems: • Operate continuously and indefinitely • Incorporate examples as they arrive • Never loosing potentially valuable information • Build a model using at most one scan of the data. • Use only a fixed amount of main memory. • Require small constant time per record. • Make a usable model available at any point in time. • Produce a model equivalent to the one obtained by ordinary database mining algorithm. • By changing the data-generating over time, the model at any time should be up-to-date.

  7. Introduction—cont. • Such requirements are fulfilled by: • Incremental learning methods • Online methods • Successive methods • Sequential methods

  8. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

  9. Hoeffding Trees • Classic decision tree learners: • CART, ID3, C4.5 • All examples simultaneously in main memory. • Disk based decision tree learners: • SLIQ, SPRINT • Examples are stored on disk. • Expensive to learn complex trees or very large datasets. • Consider a subset of training examples to find the best attribute: • For extremely large datasets. • Read each examples at most once. • Directly mine online data sources. • Build complex trees with acceptable computational cost.

  10. Hoeffding Trees—cont. • Given a set of examples of the form • : number of examples • : discrete class label • : a vector of attributes (symbolic or numeric) • Goal : produce • A model that will predict the classes of future examples with high accuracy.

  11. Hoeffding Trees—cont. • Given a stream of examples: • Use first ones to choose the root test. • Pass succeeding ones to corresponding leaves. • Pick best attributes there. • … And so on recursively • How many examples are necessary at each node? • Hoeffding Bound • Additive Chernof Bound • A statistical result

  12. Hoeffding Trees—cont. • Hoeffding bound: • : heuristic measure used to choose test attributes • C4.5 information gain • CART Gini index • Assume is to be maximized • : heuristic measure after seeing examples • : attribute with highest observed • : second-best attribute • : difference between and • : probability of choosing the wrong attribute • Hoeffding bound guarantees that is the correct choice with probability if: • examples have been seen at this node

  13. Hoeffding Trees—cont. • Hoeffding bound: • If • is the best attribute with probability • Node needs to accumulate examples from the stream until becomes smaller than • It is independent of the probability distribution generating the observations. • More conservative than distribution dependent ones.

  14. Hoeffding Tree algorithm • Inputs: • : is a sequence of examples. • : is a set of discrete attributes. • : is a split evaluation function. • : desired probability of choosing the wrong attribute at any given node. • Output: • : is a decision tree.

  15. Hoeffding Tree algorithm—cont. • Procedure HoeffdingTree () • Let be a tree with a single leaf (the root). • Let • Let predict most frequent class in . • For each class • For each value of each attribute • Let

  16. Hoeffding Tree algorithm—cont. • For each example in • Sort into a leaf using • For each and each such that • Increment . • Label with majority class among examples seen at . • Compute for each attribute . • Let be the attribute with highest . • Let be the attribute with second-highest . • Compute . • If , then • Replace by an internal node that split on . • For each branch of the split • Add a new leaf , Let . • Let predict most frequent class. • For each class and each that • Let • Return .

  17. Hoeffding Trees—cont. • : leaf probability (assume this is constant). • : tree produced by Hoeffding tree algorithm with desired given an infinite sequence of examples . • : decision tree induced by choosing at each node the attribute with true greatest . • : intentional disagreement between two decision trees: • : probability that the attribute vector will be observed. • : indicator function (1:true argument, 0:otherwise) • THEOREM :

  18. Hoeffding Trees—cont. • Suppose that the best and second-best attribute differ by 10% • According to • requires 380 examples • requires 345 more examples • An exponential improvement in can be obtained with a linear increase in the number of examples

  19. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

  20. The VFDT System • Very Fast Decision Tree learner (VFDT). • A decision tree learning system. • based on the Hoeffding tree algorithm. • Either uses information gain or Gini index as attribute evaluation measure. • Includes a number of refinements to Hoeffding tree algorithm: • Ties. • computation. • Memory. • Poor attributes. • Initialization. • Rescans.

  21. The VFDT System—cont. • Ties • Two or more attributes have very similar ’s • Potentially many examples will be required to decide between them with high confidence. • It makes little difference which attribute is chosen. • If : split on the current best attribute.

  22. The VFDT System—cont. • computation • The most significant part of the time cost per example is recomputing. • Computing for every new example is inefficient. • new examples must be accumulated at a leaf before recomputing.

  23. The VFDT System—cont. • Memory • VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves. • If the maximum available memory reached, VFDT deactivates the least promising leaves. • The least promising leaves are considered to be the ones with the lowest values of .

  24. The VFDT System—cont. • Poor attributes • VFDS’s memory usage is also minimized by dropping early on attributes that do not look promising. • As soon as the difference between an attribute’s and the best one’s becomes greater than , the attribute can be dropped. • The memory used to store the corresponding counts can be freed.

  25. The VFDT System—cont. • Initialization • VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data. • The tree can either be input as it is, or over-pruned. • Gives VFDT a “head start”.

  26. The VFDT System—cont. • rescans • VFDT can rescan previously-seen examples. • Can be activate if: • The data arrives slowly enough that there is time for it. • The dataset is finite and small enough that it is feasible.

  27. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

  28. Synthetic Data Study • Comparing VFDT with C4.5 release 8. • Restricted two systems to using the same amount of RAM. • VFDT used information gain as the function. • 14 concepts were used, all with 2 classes and 100 attributes. • For each level after the first 3 • A fraction of the nodes was replaced by leaves • The rest become splits on a random attribute • At depth of 18, all the nodes were replaced with leaves. • Each leaf was randomly assigned a class • Stream of training examples was then generated • Sampling uniformly from the instance space. • Assigning classes according to the target tree. • Various levels of class and attribute noise was added.

  29. Synthetic Data Study—cont. • Accuracy as a function of the number of training examples.

  30. Synthetic Data Study—cont. • Tree size as a function of the number of training examples.

  31. Synthetic Data Study—cont. • Accuracy as a function of the noise level. • 4 runs on same concept (C4.5:100k,VFDT:20million examples)

  32. Lesion Study • Effect of initializing VFDT with C4.5 with and without over-pruning.

  33. Web Data • Applying VFDT to mining the steam of Web page requests. • From the whole University of Washington mail campus. • To mine 1.6 million examples: • VFDT took 1540 seconds to do one pass over the training data. • 983 seconds was spent reading data from disk. • C4.5 took 24 hours to mine 1.6 million examples.

  34. Web Data—cont. • Performance on Web data

  35. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

  36. Conclusion • Hoeffding trees: • A method for learning online. • Learns the high-volume data streams. • Allows learning in very small constant time per example. • Guarantees high similarity to the corresponding batch trees. • VFDT system: • A high performance data mining system. • Based on Hoeffding trees. • Effective in taking advantage of massive number of examples.

  37. Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

  38. Qs & As • Name 4 requirements of algorithms to overcome current disk-based available algorithms? • Operate continuously and indefinitely • Incorporate examples as they arrive • Never loosing potentially valuable information • Build a model using at most one scan of the data. • Use only a fixed amount of main memory. • Require small constant time per record. • Make a usable model available at any point in time. • Produce a model equivalent to the one obtained by ordinary database mining algorithm. • By changing the data-generating over time, the model at any time should be up-to-date

  39. Qs & As • What are the benefits of considering a subset of training examples to find the best attribute: • For extremely large datasets. • Read each examples at most once. • Directly mine online data sources. • Build complex trees with acceptable computational cost.

  40. Qs & As • How does VFDT’s tie refinement to Hoeffding tree algorithm works? • Two or more attributes have very similar ’s • Potentially many examples will be required to decide between them with high confidence. • It makes little difference which attribute is chosen. • If : split on the current best attribute.

More Related