Mining High-Speed Data Streams

Mining High-Speed Data Streams Presented by: Yumou Wang Dongyun Zhang Hao Zhou

Introduction • The world’s information is doubling every two years. • From 2006 to 2011, the amount of information grew by a factor of 9 in just five years.

Introduction • By 2020 the world will generate 50 times the amount of information and 75 times the number of "information containers" • However, IT staff to manage it will grow less than 1.5 times. • Current algorithms can only deal with small amount of data less than a day’s data of many applications. • For example, banks, telecommunication companies.

Introduction • Problems : When new examples arrive at a higher rate than they can be mined, the amount of unused data grows without bounds as time progresses. • Today, to deal with these huge amount of data in a responsible way is very important. • Mining these continuous data streams brings unique opportunities, but also new challenges.

Background • Design Criteria for mining High Speed Data Streams • It must be able to build a model using at most one scan of the data. • It must use only a fixed amount of main memory. • It must require small constant time per record.

Background • Usually, use KDD system to operate this examples when they arrive. • Shortcomings: learning model learned are highly sensitive to example ordering compare to the batch model. • Others can produce the same model as batch version but very slower.

Classification Method • Input: Examples of the form (x,y), y is the class label, x is the vector of attributes. • Output: A model y=f(x), predict the classes y of future examples x with high accuracy.

Decision Tree • One of the most effective and widely-used classification methods. • A decision tree is a decision support tool that uses a tree-like graph or model. • Decision trees are commonly used in machine learning.

Building a Decision Tree • 1. Starting at the root. • 2. Testing all the attributes and choose the best one according to some heuristic measure. • 3. Split one node into branches and leaves. • 4. Recursively replacing leaves by test nodes.

Example of Decision Tree

Problems • There are some problems existed in traditional decision tree. • Some of them assume that all training data examples can be stored simultaneously in main memory. • Disadvantages: Limited the number of examples can be learned from. • Disk-based decision tree learners: examples in disk, repeatedly reading them. • Disadvantages: expensive when learning complex trees.

Hoeffding Trees • Designed for extremely large datasets • Main idea: To find the best attribute at a given node by considering only a small subset of the training examples that pass through the node. • Using how many examples is sufficient

Hoeffding Bound • Definition: The statistical result that can decide how many examples “n” using by each node is called Hoeffding bound. • Assume: R—the range of variable r • n independent observations • mean: r’ • With probability 1-δ, the true mean of r is at least r’-є

Hoeffding Bound • This function is a decreasing function • n is bigger, the є is smaller • It is the difference between true value and mean value of r.

Hoeffding Tree Algorithm

Hoeffding Tree Algorithm • Inputs: S -> is a sequence of examples, X -> is a set of discrete attributes, G(.) -> is a split evaluation function, δ -> is one minus the desired probability of choosing the correct attribute at any given node. • Outputs: HT -> is a decision tree.

Hoeffding Tree Algorithm • Goal: • Ensure that, with a high probability, the attribute chosen using n examples, is the same as that would be chosen using infinite examples. • Let Xa be the attribute with the highest observed G’ and Xb be with second highest attribute. • After seeing n examples. • Let ΔG’ = G’(Xa) – G’(Xb) • ΔG’ > ϵ • Thus a node needs to accumulate examples from the stream until ϵ becomes smaller than ΔG.

Hoeffding Tree Algorithm • The algorithm constructs the tree using the same procedure as ID3. It calculates the information gain for the attributes and determines the best attributes. • At each node it checks for condition ΔG > ϵ. If the condition is satisfied, then it creates child nodes based on the test at the node. • If not it streams in more training examples and carries out the calculations till it satisfies the condition.

Hoeffding Tree Algorithm • Memory cost • d—number of attributes • c—number of classes • v—number of values per attribute • l—number of leaves in the tree • The memory cost for each leaf is O(dvc) • The memory cost for whole tree is O(ldvc)

Advantages of Hoeffding Tree • 1. Can deal with extremely large datasets. • 2. Each example to be read at most once in a small constant time. Makes it possible to mine online data sources. • 3. Build very complex trees with acceptable computational cost.

VFDT—Very Fast Decision Tree • Breaking ties • Reduce waste • Useful under condition where • Use of • Split may not change with a single example • Significantly reduce the time of re-computation • Memory cleanup • Measurement of • Clearance of least promising leaves • Option of enabling reactivation

VFDT—Very Fast Decision Tree • Filtering out poor attributes • Dropping early • Reduces memory consumption • Initialization • Can be initialized with other existing tree • Set a head start • Rescans

Tests—Configuration • 14 Concepts • Generated by random decision trees using • Number of leaves: 2.2k to 61k • Noise level: 0 to 30% • 50k examples for testing • Available memory: 40MB • Legacy processors

Tests—Synthetic data • , ,

Tests—Synthetic data

Tests—Synthetic data • Time consumption • 20m examples • VFDT takes 5752s to read, 625s to process • 100k examples • C4.5 takes 36s • VFDT takes 47s

Tests—parameters • W/ & w/o over-pruning

Tests—parameters • W/ ties vs. w/o ties • 65 nodes vs. 8k nodes for VFDT • 805 nodes vs. 8k nodes for VFDT-boot • 72.9% vs. 86.9% for VFDT • 83.3% vs. 88.5% for VFDT-boot • vs. • VFDT: +1.1% accuracy, +3.8x time • VFDT-boot: -0.9% accuracy, +3.7x time • 5% more nodes

Tests—parameters • 40MB vs. 80MB memory • 7.8k more nodes • VFDT: +3.0% accuracy • VFDT-boot: +3.2% accuracy • vs. • 30% less nodes • VFDT: +2.3% accuracy • VFDT-boot: +1.0% accuracy

Tests—web data • For predicting accesses • 1.89m examples • 61.1% with most common class • 276230 examples for testing

Tests—web data • Decision dump • 64.2% accuracy • 1277s to learn • C4.5 with 40MB memory • 74.5k examples • 2975s to learn • 73.3% accuracy • VFDT-bootstrapped with C4.5 • 1.61m examples • 1450s to learn after initialization(983s to read)

Tests—web data

Mining Time-Changing Data Streams

Why is VFDT not Enough? • VFDT, assume training data is a sample drawn from stationarydistribution. • •Most large databases or data streams violate this assumption • –Concept Drift: data is generated by a time-changing concept function, e.g. • •Seasonal effects • •Economic cycles • •Goal: • –Mining continuously changing data streams • –Scale well

Why is VFDT not Enough? • Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples • –Sensitive to window size • •If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept • •Too small w may lead to insufficient examples to learn the concept • –If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.

CVFDT • CVFDT (Concept-adapting Very Fast Decision Tree learner) • –Extend VFDT • –Maintain VFDT’s speed and accuracy • –Detect and respond to changes in the example-generating process

CVFDT (contd.) • With a time-changing concept, the current splitting attribute of some nodes may not be the best anymore. • An out dated subtree may still be better than the best single leaf, particularly if it is near the root. • – Grow an alternative subtree with the new best attribute at its root, when the old attribute seems out-of-date. • Periodically use a bunch of samples to evaluate qualities of trees. • – Replace the old subtree when the alternate one becomes more accurate.

How CVFDT Works

Example

Sample Experiment Result

Conclusion and Future Work • CVFDT is able to maintain a decision-tree up-to—date with a window of examples by using a small constant amount of time for each new examples that arrives. • Empirical studies show that CVFDT is effectively able to keep its model up-to-date with a massive data stream even in the face of large and frequent concept shifts. • Future Work: Currently CVFDT discards subtrees that are out-of-date, but some concepts change periodically and these subtrees may become useful again – identifying these situations and taking advantage of them is another area for further study.

Thank You

Mining High-Speed Data Streams