1 / 13

IncSpan: Incremental Mining of Sequential Patterns in Large Databases

IncSpan: Incremental Mining of Sequential Patterns in Large Databases. Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign. Sequence Database Is Growing!. Sequential pattern mining is an important problem with broad applications Customer shopping sequences

xander
Download Presentation

IncSpan: Incremental Mining of Sequential Patterns in Large Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign

  2. Sequence Database Is Growing! • Sequential pattern mining is an important problem with broad applications • Customer shopping sequences • Medical treatment sequences • Web log mining • Many real life sequence databases grow incrementally • Customer continues shopping • Patient has new treatment records • Web log grows with subsequent visits

  3. Incremental Mining Is Challenging • Undesirable to mine from scratch each time a small fraction of sequences grow • Nontrivial to mine sequential patterns incrementally because • Database growth brings in new patterns • Growing subsequences interact with original ones • IncSpan: Major new techniques • Buffering Semi-frequent patterns • Reverse Pattern Matching

  4. Major Challenge: Appending to Existing Sequences • Two kinds of sequence database growth • Insert new sequences • Append new transactions to existing sequences (More challenging—our focus) • Example: Minimum Support=10%

  5. Semi-Frequent: A Buffer In Between • Given minsup andμ≤ 1, a sequence a is • frequent if sup(a) ≥ min_sup • semi-frequent if μ·min_sup ≤ sup(a) < min_sup • infrequent ifsup(a) <μ·min_sup • Incremental sequential pattern mining • Given a sequence database D, amin_sup threshold, the set of frequent subsequences FS inD, and an appended sequence database D’of D • Mine theset of frequent subsequences FS’in D’based on FS insteadof mining on D’from scratch

  6. Semi-Frequent Sequence Buffering and Maintenance • Keeping some additional information about the original database for incremental mining • Buffering semi-frequent subsequences SFS of the original database • SFS are “almost frequent”, they are likely to become frequent in the growing database • SFS is a boundary between frequent and infrequent sequences • Keep FS and SFS of the original database

  7. Possible State Transitions After Appending

  8. Buffering Technique (I) • Handle “infrequent-to-frequent” case. • If an infrequent pattern p’ in D becomes frequent in D’, then at least one of its prefix subsequences p is in FS Solution: Start from its frequent prefix p and construct p-projected database to discover p’ Theorem(Used for search space pruning) For a frequent pattern p, if its support in satisfies the condition , then there is no sequencep’ having p as prefix changing from infrequent in Dto frequent in D’

  9. Buffering Technique (II) • Handle “infrequent-to-semi-frequent” case • If an infrequent pattern p’ in D becomes semi-frequent in D’, then at least one of its prefix subsequence p is either in FS or SFS Solution: Start from its frequent or semi-frequent prefix p and construct p-projected database to discover p’

  10. Reverse Pattern Matching • An optimization technique: Match a pattern against a sequence from end towards front • Since the item sets are appended at the end, reverse matching can save some computation If the last item of pattern p does not appear in Sa, then appending Sa to S will not increase sup(p) So, just scan Sa for the last item in p and prune search if the above condition meets

  11. Performance Study • Compare with • ISM algorithm [Parthasarathy, Zaki, Ogihara and Dwarkadas, CIKM’99] • PrefixSpan – mining from scratch approach to see how much we can save • Compare CPU time and memory usage Figure 1. Memory Usage under varied minsup

  12. Performance Study(II) Figure 2. Varying minsup Figure 3. Varying percentage of updated sequences

  13. Discussion and Conclusion • Buffering semi-frequent patterns is effective • User can control the size of SFS by μ • SFS is within 1­μfrom being frequent, so likely to become frequent with dababase growth • When only a small portion (5%) of the database is appended, IncSpan is more efficient than mining from scratch • IncSpan can be easily extended to handle inserting or deleting sequences from database • Handling incremental mining in Stream data? • No. still needs more than one scan of the database

More Related