580 likes | 697 Views
Mining Data Streams Challenges, Techniques, and Future Work. Ruoming Jin Joint work with Prof. Gagan Agrawal. August 10-17, 2003. Major Power outrage simultaneously hits a dozen of big cities in the east of America and Canada Suddenly, millions of people have to live without electricity
E N D
Mining Data Streams Challenges, Techniques, and Future Work • Ruoming Jin • Joint work with Prof. Gagan Agrawal
August 10-17, 2003 • Major Power outrage simultaneously hits a dozen of big cities in the east of America and Canada • Suddenly, millions of people have to live without electricity • Internet Worm • Millions of computers were attacked • In a single day, I received almost 100 emails generated by the worm • Capable to collect and monitor the data from power grid and email server Unable to extract knowledge fast enough from the dynamic and huge amount of data!
Data Explosion • The Challenge: • Our ability to access, collect, generate, and store the data has been exceeding our ability to understand them • Real Applications • WALMART: 20M transactions per day • AT&T: 300 M calls per day • Earth Observing System from NASA: 50 GB per hour • Amazon. COM: 4-5M sessions per day • Power Grid/Sensor Network • Internet/Intranet
Data Streams • What is Data Streams? • Continuous streams • Hugh, Fast and Changing • Why Data Streams? • The arrival of streams and the volume of data are beyond our capability to store them • Real-time processing • Evolution of Data (Static/Dynamic) You can only have one look at the data!
Data Mining • Extracting useful information or knowledge from large amounts of data • Interesting patterns • Regularity or Anomaly • Typical Data Mining Tasks • Association Rule Mining • Classification • Clustering • Disk-resident or in-core datasets Traditional data mining needs multi-pass of the data!
Stream Data Mining • How to run traditional mining tasks over data streams • Single pass/multi-pass • How to discover new information over data streams • Changing • How to perform data mining over dynamic data streams • Concept drifting
Roadmap • Thesis Statement • Current Work • Decision Tree Construction • Frequent Itemsets Mining • Future Work • Mining Maximal/Closed/Approximate Frequent Itemsets • Mining New Knowledge from Data Streams • Mining Dynamic Data Streams • Applications • Conclusion
Motivation • The need for efficient computation and low memory mining algorithms • Real-time constraint • Memory requirements • The need to mine new information from data streams • The need for having results with high accuracy and confidence • Approximate results with high accuracy
Thesis Statement “Designing computation and memory efficient algorithms to provide approximate results with high accuracy and confidence helps mine useful information from data streams”
Roadmap • Thesis Statement • Current Work • Decision Tree Construction • Frequent Itemsets Mining • Future Work • Mining Maximal/Closed/Approximate Frequent Itemsets • Mining New Knowledge from Data Streams • Mining Dynamic Data Streams • Applications • Conclusion
Salary Age Employment Group 30K 30 Self C 40K 35 Industry C 70K 50 Academia C 50K 45 Self B 70K 30 Academia B 60K 35 Industry A 60K 35 Self A 70K 30 Self A 40K 45 Industry C Salary <= 50K > 50K Group C Age <= 40 > 40 Group C Employment Academia, Industry Self Group B Group A Decision Tree Construction • Three predictor attributes: Numerical (salary, age), Categorical (employment) • Class label attribute: group
The problem • Basic algorithm (a greedy algorithm) • Tree will be built in a top-down recursive way • At start, all the training records are at the root; the records are partitioned recursively based on split criteria • Split criteria are selected based on a heuristic or statistical measure (e.g., information gain or gini function) • Analysis • To find the split criteria, all the records falling into the node need to be scanned • Scanning the entire datasets multiple times • The difficulty to handle numerical attributes • Streaming Data • You can only have one look at the data • Real-time constraint and memory requirement
Outline of Our Solution • Motivation • Three New Techniques • Experimental results • Conclusion
Very Fast Decision Tree (VFDT)─Domingo and Hulten (SIGKDD’00) • Sampling based approach • Given a desired confidence level (α), applying Hoeffding Inequality to test if enough samples has collected to find the best split criteria • Accuracy • Probabilistic bound on the different number of nodes between the tree built on samples and the one built on complete data • Limitation • Focus on processing categorical attributes • Ideal environment
Our Contributions • Efficient processing of numerical attributes • High memory and computational overheads • Numerical Interval Pruning (NIP) • Determining exact split points in one pass • Confidence interval • ExactSplit algorithm • Using smaller samples size for the same probabilistic bound • Normal Test • Efficient Decision Tree Construction on Streaming Data (R. Jin and G. Agrawal, SIGKDD’03) • Accurate One Pass Mining of Streaming Data (R. Jin, A. Goswami and G. Agrawal, submitted to SDM’04)
Numerical Interval Pruning (NIP) –Efficiently Handling Numerical Attributes • Existing methods • Preprocessing • Online sorting • Full Class Histogram • Basic Ideas of NIP • Hierarchical Information • Concise class Histogram and Detailed Information • Divide the range of numerical attributes into intervals • Summarize class histogram for intervals • Only visit intervals likely to have best split point • Drop the detailed information for pruned intervals (Approximate)
Finding Best Split Point The data comes from a IBM Quest synthetic dataset for function 0 Best Split Point
Summarizing and Pruning Intervals Upper bound of gains for intervals
Visiting Detailed Information Best Split Point
Re-pruning and Verification Gain of Best Split Point False Pruning Additional intervals needs to be visited if false pruning happens
[ 50 ,54 ] [ 50 ,54 ] Possible Best Configuration-1 Possible Best Configuration-2 Least Upper Bound of Gain for an Interval
Finding Exact Split Points in a Single Pass • Confidence Interval (CI) • Build CI near the approximate split points • If the exact split points after processing all data falls into the CI, we will be able to determine it, and correct the descendant nodes also • ExactSplit algorithm • Recursively find exact split points and correct the descendant nodes from the root • Dynamic shrinking • Reduce the length of CI as more data instances are processed
Sample Size Problem • Let n be the sample size of S, N be the normal distribution. Then, for the entropy function g, we have where, • Normal Test • Normal Test is better than Hoeffding Bound because later one does not utilize the normal distribution property.
Performance Results • 700MHz Intel Pentium III, with 1GB SDRAM and a 18GB disk with Ultra 160 SCI Drive • Stop condition: >=95% accuracy, depth of nodes<=12, >=1% fraction of instances • Start processing the nodes where data instances >=10,000, re-evaluate each node every 5,000 data instances
Instances Utilization • ClassHist-H: Hoeffding bound and full class histograms • Sample-H: Hoeffding bound and samples to evaluate candidate split conditions • NIP-H: Hoeffding bound and Numerical Interval Pruning • NIP-N: Normal test and Numerical Interval Pruning
Adult Dataset • Predicting whether income exceeds $50K/yr based on census data • 48842 instances, 14 attributes (6 continuous and 8 nominal) Running Time in seconds, TIR and IAP in millions
Summary • Three new techniques enable • an average of 39% reduction in execution times • a 37% reduction in the number of data instances required • an average of 79% accuracy to determine the exact split condition for the non-leaf nodes on the top 5 levels. • The techniques can be applied to other applications, such as • K-mean clustering (ongoing work)
Roadmap • Thesis Statement • Current Work • Decision Tree Construction • Frequent Itemsets Mining • Future Work • Mining Maximal/Closed/Approximate Frequent Itemsets • Mining New Knowledge from Data Streams • Mining Dynamic Data Streams • Applications • Conclusion
Frequent Itemsets Mining • Desired frequency 50% • {A},{B},{C},{A,B}, {A,C} • Down-closure property • If an itemset is frequent, all of its subset must also be frequent • Multi-pass algorithms or in-core datasets • Apriori, Eclat, FP-tree
The Problem • Streaming data • You can only have one look of the data • Impossible to find all the frequent itemsets in one pass • Proposed solutions (with θ,ε) • A one-pass algorithm to find a superset of the frequent (θ) itemsets, and each itemset in the superset has to appear more than a desired frequency(θ(1-ε)) • A two-pass algorithm will find the exact frequent itemsets (eliminate the false positive)
Outline of Our Solution • A simplified problem and its solution • StreamMining • Implementing Issues • Experimental Results • Conclusion
A Simplified Problem • Finding frequent items • Given a sequence (x1,…xN) where xi∈[1,n], and a real number θ between zero and one. • Looking for xi whose frequency > θ • N>>n>>1/θ • The number of frequent items ≤ 1/θ P*(Nθ) ≤ N
KRP algorithm ─ R. Karp, et. al (TODS’ 03) • n=12 • N=30 • Θ=0.35 N/ (⌈1/θ⌉) ≤ Nθ ⌈1/θ⌉ =3
Frequent Itemsets n=10K, Θ=0.1%, average length=10, n*n=100M, |frequent 2-itemsets| ≤ 50K 2-itemset is the key!
Enhance the Accuracy • n=12 • N=30 • Θ=0.35 ⌈1/θ⌉ =3 • ε=0.5 Θ(1- ε )=0.175 ⌈1/(θ ε)⌉ =6
StreamMining Sketch • Put a transaction into the buffer • Update 1-itemset counts • Update/insert 2-itemsets • If the 2-itemsets is beyond a threshold • Crossover • Applying the transactions in the buffer to update 3-itemsets, 4-itemsets … • Clear buffer • Perform additional Crossover
Implementing Issues • Data Structure • TreeHash, a prefix tree encoded into a hash table • Frequently insert/delete/increment the potential frequent itemsets • Optimizations • Online dataset trimming • Reducing subset checking • Online checking
Experimental Results T10.I4.N10K Dataset, 12M transactions
Experimental Results (Cont’) T10.I4.N10K Dataset, 0.1% support level
Results for very large number of distinctive items T25.I4.N100K Dataset, 12M transactions
Real Dataset BMS-WebView-1 Dataset
Related Work • One-pass algorithm Manku and Motwani • Two-pass algorithm • Partition • Sampling based • CARMA • Oracle • FP-tree and FP-stream • Multi-pass algorithm
Discussion • The new algorithm StreamMining • High accuracy ( even when ε=1, the accuracy is 94% or higher) • Memory efficient • Handle very large number of distinctive itemsets and low threshold using reasonable amounts of memory • Observations • Reducing passes can not directly contribute to the performance • Computational Intensive instead of I/O Intensive • In-core algorithm is the key
Roadmap • Thesis Statement • Current Work • Decision Tree Construction • Frequent Itemsets Mining • Future Work • Mining Maximal/Closed/Approximate Frequent Itemsets • Mining New Knowledge from Data Streams • Mining Dynamic Data Streams • Applications • Conclusion
Mining Frequent Itemsets • The problem • Computational Intensive • Different Solutions • Maximal Frequent Itemsets • Closed Frequent Itemsets • Approximate Frequent Itemsets • StreamMax • Contour sets
Stream* • Common characteristics of one-pass and two-pass algorithms for streaming data and very large datasets • Maintaining a superset of frequent itemsets • Different methods to update the supersets • Applying slightly different in-core algorithms • A framework to efficiently incorporate different in-core algorithms for mining streams • Apriori, Eclat and FP-tree • TreeHash and StreamMining/MM
Frequent Itemsets Mining over Dynamic Data Streams • Sliding Window Model • Recent data • New queries raised from sliding window • Frequent itemsets for the current window • The intersection and union of frequent itemsets over windows • Itemsets with large frequency changes • Two key issues • How to forget/delete information obsolete • Computing the new queries systematically
Learning over Dynamic Streaming Data • Concept Drifting • CVFDT • Ensemble classifiers • Clustering • Mining changes • Demon • Burst detection • Cluster Changes