130 likes | 340 Views
IncSpan: Incremental Mining of Sequential Patterns in Large Databases. Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign. Sequence Database Is Growing!. Sequential pattern mining is an important problem with broad applications Customer shopping sequences
E N D
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign
Sequence Database Is Growing! • Sequential pattern mining is an important problem with broad applications • Customer shopping sequences • Medical treatment sequences • Web log mining • Many real life sequence databases grow incrementally • Customer continues shopping • Patient has new treatment records • Web log grows with subsequent visits
Incremental Mining Is Challenging • Undesirable to mine from scratch each time a small fraction of sequences grow • Nontrivial to mine sequential patterns incrementally because • Database growth brings in new patterns • Growing subsequences interact with original ones • IncSpan: Major new techniques • Buffering Semi-frequent patterns • Reverse Pattern Matching
Major Challenge: Appending to Existing Sequences • Two kinds of sequence database growth • Insert new sequences • Append new transactions to existing sequences (More challenging—our focus) • Example: Minimum Support=10%
Semi-Frequent: A Buffer In Between • Given minsup andμ≤ 1, a sequence a is • frequent if sup(a) ≥ min_sup • semi-frequent if μ·min_sup ≤ sup(a) < min_sup • infrequent ifsup(a) <μ·min_sup • Incremental sequential pattern mining • Given a sequence database D, amin_sup threshold, the set of frequent subsequences FS inD, and an appended sequence database D’of D • Mine theset of frequent subsequences FS’in D’based on FS insteadof mining on D’from scratch
Semi-Frequent Sequence Buffering and Maintenance • Keeping some additional information about the original database for incremental mining • Buffering semi-frequent subsequences SFS of the original database • SFS are “almost frequent”, they are likely to become frequent in the growing database • SFS is a boundary between frequent and infrequent sequences • Keep FS and SFS of the original database
Buffering Technique (I) • Handle “infrequent-to-frequent” case. • If an infrequent pattern p’ in D becomes frequent in D’, then at least one of its prefix subsequences p is in FS Solution: Start from its frequent prefix p and construct p-projected database to discover p’ Theorem(Used for search space pruning) For a frequent pattern p, if its support in satisfies the condition , then there is no sequencep’ having p as prefix changing from infrequent in Dto frequent in D’
Buffering Technique (II) • Handle “infrequent-to-semi-frequent” case • If an infrequent pattern p’ in D becomes semi-frequent in D’, then at least one of its prefix subsequence p is either in FS or SFS Solution: Start from its frequent or semi-frequent prefix p and construct p-projected database to discover p’
Reverse Pattern Matching • An optimization technique: Match a pattern against a sequence from end towards front • Since the item sets are appended at the end, reverse matching can save some computation If the last item of pattern p does not appear in Sa, then appending Sa to S will not increase sup(p) So, just scan Sa for the last item in p and prune search if the above condition meets
Performance Study • Compare with • ISM algorithm [Parthasarathy, Zaki, Ogihara and Dwarkadas, CIKM’99] • PrefixSpan – mining from scratch approach to see how much we can save • Compare CPU time and memory usage Figure 1. Memory Usage under varied minsup
Performance Study(II) Figure 2. Varying minsup Figure 3. Varying percentage of updated sequences
Discussion and Conclusion • Buffering semi-frequent patterns is effective • User can control the size of SFS by μ • SFS is within 1μfrom being frequent, so likely to become frequent with dababase growth • When only a small portion (5%) of the database is appended, IncSpan is more efficient than mining from scratch • IncSpan can be easily extended to handle inserting or deleting sequences from database • Handling incremental mining in Stream data? • No. still needs more than one scan of the database