460 likes | 612 Views
Constraint Mining of Frequent Patterns in Long Sequences. Presented by Yaron Gonen. Outline. Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results Future Work. Frequent Item-sets: The Market-Basket Model.
E N D
Constraint Mining of Frequent Patterns in Long Sequences Presented by YaronGonen
Outline • Introduction • Problems definition and motivation • Previous work • The CAMLS Algorithm • Overview • Main contributions • Results • Future Work
Frequent Item-sets:The Market-Basket Model • A set of items, e.g., stuff sold in a supermarket • A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.
Support • Support for item-set I = the number of baskets containing all items in I (Usually given as a percentage) • Given a support threshold minSup, sets of items that appear in >minSup baskets are called frequent item-sets • Simplest question: find sets of frequent item-sets
Example • Items: • Minimum Support = 0.6 (2 baskets)
Application (1) • Items: products at a supermarket • Baskets: set of products a customer bought at one time. • Example: many people by beer and diapers together. • Place beer next to diapers to increase both sales • Run a sale on diapers and raise price of beer.
Application (2) (Counter-Intuitive) • Items: species of plants • Baskets: each basket represent an attribute. A basket contains items (plants) that have that attribute • Frequent sets may indicate similarity between plants
Scale of Problem • Costco sells more than 120k different items, and has 57m members (from Wikipedia) • Botany has identified about 350k extant species of plants
The Naïve Algorithm • Generate all possible itemsets. • Check their support. , , , , , , , , … , ,
The Apriori Property • All nonempty subsets of a frequent itemset must also be frequent. X X X X
The Apriori Algorithm Find frequent 1-itemsets Here’s where the apriori property is used. Merge and prune to generate candidate of next size Frequent itemset Has candidates? > min support? Go though whole DB to count support yes no Largest itemset’s length times going over the DB End
Vertical Format • Index on items. • Calculating support is fast 2 1 2 1 2 3 3 3 1 2
Frequent Sequences:Taking it to the Next Level • A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time 2 weeks 5 days
Support • Subsequence: a sequence, that all its events are subsets of another sequence, in the same order (but not necessarily consecutive) • Support for subsequence s = the number of sequences containing s (Usually given as a percentage) • Given a supportthresholdminSup, subsequences that appear in >minSup sequences are called frequent subsequences • Simplest question: find all frequent subsequence
Notations • Items are letters: a,b,… • Events are parenthesized: (ab), (bdf),… • Except for events with single items • Sequences are surrounded by <…> • Every sequence has an identifier sid
Example minSup= 0.5
Motivation • Customer shopping patterns • Stock market fluctuation • Weblog click stream analysis • Symptoms of a diseases • DNA sequence analysis • Weather forecast • Machine anti aging • Many more…
Much Harder than Frequent Item-sets! 2m*npossible candidates! Where m is the number of items, and n in the number of transactions in the longest sequence
The Apriori Property • If a sequence is not frequent, then any sequence that contains it cannot be frequent
Constraints • Problems: • Too many frequent sequences • most frequent sequences are not useful • Solution remove them • Constraints are a way to define usefulness • The trick do so while mining
Previous Work • GSP (Srikant and Agrawal, 1996) • Generation-and-test Apriori Based approach • SPADE (Zaki, 2001) • Generation-and-test Apriori Based approach • Uses equivalence-class for memory optimization • Uses a vertical-format db • PrefixSpan (Pei, 2004) • No candidate generation • Using a db-projection method
Why a New Algorithm? • Huge set of candidate-sequences/projected db generated • Multiple Scans of database needed • Inefficient for mining long sequential patterns • No exploits of domain-specific properties • Weak constraints support
The CAMLS Algorithm • Constraint-based Apriori algorithm for Mining Long Sequences • Designed especially for efficient mining of long sequences • Outperforms SPADE and PrefixSpan on both synthetic and real data
The CAMLS Algorithm Makes a logical distinction between two types of constraints: • Intra-Event: not time related (i.e. mutually exclusive items) • Inter-Event: addresses the temporal aspect of the data (i.e. values that can or cannot appear one after the other)
Event-wise Constraints • Event must/must not contain a specific item • Two items cannot occur on the same time • max_event_length: An event cannot contain more than a fixed number of items
Sequence–wise Constraints • max_sequence_length: a sequence cannot contain more than a fixed number of events • max_gap: long time between events dismisses the pattern
CAMLS Overview Input Event-wise Sequence-wise Output Constraints (minSup, maxGap, …) Frequent events + occurrence index
What Do We Get? • The best of both worlds: • Much less candidates are being generated. • Support check is fast. • Worst case: works like SPADE. • Tradeoff: Uses a bit more memory (for storing the frequent item-sets).
Event-wise Phase • Input: sequence database and constraints • Output: frequent events + occurrence index • Use Apriori or FP-Growth to find frequent itemsets (both with minor modifications)
Example soon! Event-wise • L1 = all frequent items • fork=2;Lk-1≠Φ;k++do • generateCandidates(Lk-1) • Lk= pruneCandidates() • L = LLk • end for If two frequent (k-1) event have the same prefix merge them and form a new candidate Prune, calculate support count and create occurrence index
Occurrence Index • A compact representation of all occurrences of a sequence • Structure: list of sids, each associated with a list of eids sequence eid1 eid2 eid3 Example on next slide! sid1 sid2 eid4 eid5 sid3 eid6 eid7 eid8 eid9
minSup=2 Event-wise Example(Using Apriori) candidates: (ab),(ac),(ad),(bc),… All frequent items: a:3, b:2, c:3, d:3 Support count: (ac):2, (ad):2, (bd):2, (cd):2 candidates: (abc), (abd),(acd),… Support count: (acd):2 0 1 3 11 No more candidates!
Sequence-wise Phase • Input: frequent events + occurrence index, constraints • Output: all frequent sequences • Similar to GSP’s and SPADE’s candidate generation phase – except using the frequent itemsets as seeds
Sequence-wise • L1 = all frequent 1-sequences • fork=2;Lk-1≠Φ;k++do • generateCandidates(Lk-1) • Lk= pruneAndSupCalc() • L = L Lk • end for Elaboration on next two slide
Sequence-wise Candidate Generation • If two frequent k-sequences s’ and s’’ share a common k-1 prefix and s1 is a generator, we form a new candidate s‘ = <s’1s’2…s’k> <s’1s’2…s’k s’’ = <s’’1s’’2…s’’k> s’’k> <s’1s’2…s’k-1> = <s’’1s’’2…s’’k-1>
Sequence-wise Pruning • Keep a radix-ordered list of pruned sequences in current iteration • In the same iteration its possible that a k-sequence will contain another k-sequence in the same iteration. • With a new candidate: • Check subsequence in pruned list: Very Fast! • Test for frequency • Add to pruned list if needed
Support Calculation • A simple intersection operation between the occurrence index of the forming sequences • When a new occurrence index is formed, calculation is trivial
The maxGap Constraint • maxGap is a special kind of constraint: • Data dependant • Apriori property not applicable • The occurrence index enables fast maxGap check • A frequent sequence that does not satisfy maxGap is flagged as non-generator. Example: • Assume <ab> is frequent but gap between a and b > maxgap • But frequent sequences <ac> and <ab> and in <acb> all maxgap constraints are ok! • So <ab> is a non-Generator but kept in order not to prune <acb>
Original DB Sequence-Wise Example Event-wise Candidate generation minSup=2 maxGap=5 <aa> is added to pruned list. <a(ac)> is a super-sequence of <aa>, therefore it is pruned. <ab> does not pass maxGap, therefore it is not a generator. No more candidates!
Evaluation (1):Machine Anti Aging How can Sequence Mining Help? • Data collected from machine is a sequence • Discover typical behavior leading to failure • Monitor machine and alert before failure • Domain: • Light intensity for wavelengths (continuous) • Pre-process • Discretization • Meta features (maxDisc, maxWL, isBurned) • Synm stands for a synthetic database simulating the machine behavior with m meta-features
Evaluation (2) • Real Stocks data values • Rn stands for stock data (10 different stocks) for n days
So, What’s CAMLS Contribution? • Constraints distinction: easy implementation • Two phases • Handling on the MaxGap constraint • Occurrence index data structure • Fast new pruning method
Future Research • Main issue: closed sequences • More constraints (aspiring regexp)