Constraint Mining of Frequent Patterns in Long Sequences

Constraint Mining of Frequent Patterns in Long Sequences Presented by YaronGonen

Outline • Introduction • Problems definition and motivation • Previous work • The CAMLS Algorithm • Overview • Main contributions • Results • Future Work

Frequent Item-sets:The Market-Basket Model • A set of items, e.g., stuff sold in a supermarket • A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.

Support • Support for item-set I = the number of baskets containing all items in I (Usually given as a percentage) • Given a support threshold minSup, sets of items that appear in >minSup baskets are called frequent item-sets • Simplest question: find sets of frequent item-sets

Example • Items: • Minimum Support = 0.6 (2 baskets)

Application (1) • Items: products at a supermarket • Baskets: set of products a customer bought at one time. • Example: many people by beer and diapers together. • Place beer next to diapers to increase both sales • Run a sale on diapers and raise price of beer.

Application (2) (Counter-Intuitive) • Items: species of plants • Baskets: each basket represent an attribute. A basket contains items (plants) that have that attribute • Frequent sets may indicate similarity between plants

Scale of Problem • Costco sells more than 120k different items, and has 57m members (from Wikipedia) • Botany has identified about 350k extant species of plants

The Naïve Algorithm • Generate all possible itemsets. • Check their support. , , , , , , , , … , ,

The Apriori Property • All nonempty subsets of a frequent itemset must also be frequent. X X X X

The Apriori Algorithm Find frequent 1-itemsets Here’s where the apriori property is used. Merge and prune to generate candidate of next size Frequent itemset Has candidates? > min support? Go though whole DB to count support yes no Largest itemset’s length times going over the DB End

Vertical Format • Index on items. • Calculating support is fast 2 1 2 1 2 3 3 3 1 2

Frequent Sequences:Taking it to the Next Level • A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time 2 weeks 5 days

Support • Subsequence: a sequence, that all its events are subsets of another sequence, in the same order (but not necessarily consecutive) • Support for subsequence s = the number of sequences containing s (Usually given as a percentage) • Given a supportthresholdminSup, subsequences that appear in >minSup sequences are called frequent subsequences • Simplest question: find all frequent subsequence

Notations • Items are letters: a,b,… • Events are parenthesized: (ab), (bdf),… • Except for events with single items • Sequences are surrounded by <…> • Every sequence has an identifier sid

Example minSup= 0.5

Motivation • Customer shopping patterns • Stock market fluctuation • Weblog click stream analysis • Symptoms of a diseases • DNA sequence analysis • Weather forecast • Machine anti aging • Many more…

Much Harder than Frequent Item-sets! 2m*npossible candidates! Where m is the number of items, and n in the number of transactions in the longest sequence

The Apriori Property • If a sequence is not frequent, then any sequence that contains it cannot be frequent

Constraints • Problems: • Too many frequent sequences • most frequent sequences are not useful • Solution remove them • Constraints are a way to define usefulness • The trick do so while mining

Previous Work • GSP (Srikant and Agrawal, 1996) • Generation-and-test Apriori Based approach • SPADE (Zaki, 2001) • Generation-and-test Apriori Based approach • Uses equivalence-class for memory optimization • Uses a vertical-format db • PrefixSpan (Pei, 2004) • No candidate generation • Using a db-projection method

Why a New Algorithm? • Huge set of candidate-sequences/projected db generated • Multiple Scans of database needed • Inefficient for mining long sequential patterns • No exploits of domain-specific properties • Weak constraints support

The CAMLS Algorithm • Constraint-based Apriori algorithm for Mining Long Sequences • Designed especially for efficient mining of long sequences • Outperforms SPADE and PrefixSpan on both synthetic and real data

The CAMLS Algorithm Makes a logical distinction between two types of constraints: • Intra-Event: not time related (i.e. mutually exclusive items) • Inter-Event: addresses the temporal aspect of the data (i.e. values that can or cannot appear one after the other)

Event-wise Constraints • Event must/must not contain a specific item • Two items cannot occur on the same time • max_event_length: An event cannot contain more than a fixed number of items

Sequence–wise Constraints • max_sequence_length: a sequence cannot contain more than a fixed number of events • max_gap: long time between events dismisses the pattern

CAMLS Overview Input Event-wise Sequence-wise Output Constraints (minSup, maxGap, …) Frequent events + occurrence index

What Do We Get? • The best of both worlds: • Much less candidates are being generated. • Support check is fast. • Worst case: works like SPADE. • Tradeoff: Uses a bit more memory (for storing the frequent item-sets).

Event-wise Phase • Input: sequence database and constraints • Output: frequent events + occurrence index • Use Apriori or FP-Growth to find frequent itemsets (both with minor modifications)

Example soon! Event-wise • L1 = all frequent items • fork=2;Lk-1≠Φ;k++do • generateCandidates(Lk-1) • Lk= pruneCandidates() • L = LLk • end for If two frequent (k-1) event have the same prefix merge them and form a new candidate Prune, calculate support count and create occurrence index

Occurrence Index • A compact representation of all occurrences of a sequence • Structure: list of sids, each associated with a list of eids sequence eid1 eid2 eid3 Example on next slide! sid1 sid2 eid4 eid5 sid3 eid6 eid7 eid8 eid9

minSup=2 Event-wise Example(Using Apriori) candidates: (ab),(ac),(ad),(bc),… All frequent items: a:3, b:2, c:3, d:3 Support count: (ac):2, (ad):2, (bd):2, (cd):2 candidates: (abc), (abd),(acd),… Support count: (acd):2 0 1 3 11 No more candidates!

Sequence-wise Phase • Input: frequent events + occurrence index, constraints • Output: all frequent sequences • Similar to GSP’s and SPADE’s candidate generation phase – except using the frequent itemsets as seeds

Sequence-wise • L1 = all frequent 1-sequences • fork=2;Lk-1≠Φ;k++do • generateCandidates(Lk-1) • Lk= pruneAndSupCalc() • L = L Lk • end for Elaboration on next two slide

Sequence-wise Candidate Generation • If two frequent k-sequences s’ and s’’ share a common k-1 prefix and s1 is a generator, we form a new candidate s‘ = <s’1s’2…s’k> <s’1s’2…s’k s’’ = <s’’1s’’2…s’’k> s’’k> <s’1s’2…s’k-1> = <s’’1s’’2…s’’k-1>

Sequence-wise Pruning • Keep a radix-ordered list of pruned sequences in current iteration • In the same iteration its possible that a k-sequence will contain another k-sequence in the same iteration. • With a new candidate: • Check subsequence in pruned list: Very Fast! • Test for frequency • Add to pruned list if needed

Support Calculation • A simple intersection operation between the occurrence index of the forming sequences • When a new occurrence index is formed, calculation is trivial

The maxGap Constraint • maxGap is a special kind of constraint: • Data dependant • Apriori property not applicable • The occurrence index enables fast maxGap check • A frequent sequence that does not satisfy maxGap is flagged as non-generator. Example: • Assume <ab> is frequent but gap between a and b > maxgap • But frequent sequences <ac> and <ab> and in <acb> all maxgap constraints are ok! • So <ab> is a non-Generator but kept in order not to prune <acb>

Original DB Sequence-Wise Example Event-wise Candidate generation minSup=2 maxGap=5 <aa> is added to pruned list. <a(ac)> is a super-sequence of <aa>, therefore it is pruned. <ab> does not pass maxGap, therefore it is not a generator. No more candidates!

Evaluation (1):Machine Anti Aging How can Sequence Mining Help? • Data collected from machine is a sequence • Discover typical behavior leading to failure • Monitor machine and alert before failure • Domain: • Light intensity for wavelengths (continuous) • Pre-process • Discretization • Meta features (maxDisc, maxWL, isBurned) • Synm stands for a synthetic database simulating the machine behavior with m meta-features

Evaluation (2) • Real Stocks data values • Rn stands for stock data (10 different stocks) for n days

CAMLS Compared with PrefixSpan

CAMLS Compared with Spade and PrefixSpan

So, What’s CAMLS Contribution? • Constraints distinction: easy implementation • Two phases • Handling on the MaxGap constraint • Occurrence index data structure • Fast new pruning method

Future Research • Main issue: closed sequences • More constraints (aspiring regexp)

Thank You!

Constraint Mining of Frequent Patterns in Long Sequences