† Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current:

Efficient Mining of Recurrent Rules from a Sequence Database David Lo†* Joint work with: Siau-Cheng Khoo† and Chao Liu‡ †Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) ‡Data Mining Group Department of Computer Science Uni. of Illinois at Urbana-Champaign Current: (Microsoft Research, Redmond)

Motivation • Huge amount of data exists, we want to mine knowledge from data. • Recurrent Rules “Whenever a series of precedent events (pre) occurs, eventually another series of consequent events (post) occurs.” Denoted as: pre->post • We want to mine for recurrent rules from a sequence database.

Recurrent Rules – Intuitive Examples • Locking Protocol • Internet Banking “Whenever a lock is acquired, eventually it is released” “Whenever a connection to a bank server is made and authentication is completed, money transfer command is issued and verified, eventually money is transferred and notification is displayed.”

Soft. Specifications & Recurrent Rule • Recurrent rule • Corresponds to a family of program properties useful for software verification • Formalized in Linear Temporal Logic • Mining for these software specs are often incomplete, outdated [ABL02,DSB04,LKL07] • Mining specifications helps in: • Understanding existing/legacy systems • Help verification tools to ensure correctness of systems and detect bugs.

Problem Statements Problem 1 • “Given a set of sequences, find rules that recur (are satisfied)a significant number of times within a sequence and across multiple sequences. • A rule is significant if it satisfies minimum thresholds of supports and confidence. ” Problem 2 “Mine a set of non-redundant significant recurrent rules.”

Extending Sequential Rules [S99] • Sequential rule pre->post: • Rules formed by composing sequential patterns [AS95,YHA03,WH04]: series of events supported (i.e. a sub-sequence of) by a significant number of sequences. • Whenever a sequence is a super-seq. of pre it will also be a super-seq. of pre++post • Recurrent rule: • Multiple occurrences of the rule’s premise and consequent both within a sequence and across multiple sequences are considered

Extending Episode Rules [MTV97] • Episode rule pre->post: • Episode: series of events occurring close together (e.g., in a window). • Whenever a window is a super-seq. of pre it will also be a super-seq. of pre++post. • Recurrent rule: • Handle multiple sequences • We want to break the window barrier • It is hard to tell the right window size • Lock separated frm unlock by arbitrary no of evs • We mine a non-redundant set of rules

Preliminaries

Linear Temporal Logic (LTL) • Formalism to precisely specify temporal requirements. • It works on paths [HR03] • There are a number of operators: • G p – Globally at every point in time p holds • F p – At that point in time or eventually (Finally) p holds • X p – p holds at the neXt point in time

Automata Model main lock lock To Check use use unlock unlock lock end Checking or Verifying Temporal Logics Program Transform main(x){ if (lock=0) lock;use;unlock;lock; else for i: 1 to 10 lock;use;unlock } LTL property to check <main,lock> -> <unlock,end> 10 Possible Traces or Sequences main lock use unlock lock end main lock use unlock lock use unlock end main lock use unlock end … Violation

Concepts, Definitions And Rules Semantics

Temporal Points “Whenever a series of precedent events occurs at a point in time or temporal point, eventually another series of consequent events occurs.” • Peek at interesting temporal points & see what series of evs are likely to happen next • Temporal points in a sequence S • The indices in S, starting from 1. • Consider a sequence <a,b,a,b,a,c>. There are 6 temporal points in the sequence. • For a temporal point j in S= <a1,…,an> , the prefix <a1,…,aj> of S is called j-prefix of S.

Occurrences & Instances • Consider a pattern P, and a sequence S • The set of all occurrences of P in S, Occ(P,S) is the set: • {j| P j-prefix of S && last (P) = S[j] } • The set of all instances of P in S, Inst(P,S) is the set: • {j-prefix of S | j is in Occ(P,S)} • Consider the sequence <A,B,A,B,A,B> • The set of occurrences of <A,B> is {2,4,6} • Instances of <A,B> is: {<A,B>,<A,B,A,B>, <A,B,A,B,A,B>} • Correspond to temporal points to be checked for rules with <A,B> as premise

Projected and Projected-all DB • A database SeqDB projected on pattern P is defined as: • SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is the minimal prefix of s • containing P} SeqDB SeqDB<a,b> <e,a,b,c> <e,a,e,b,c>

Projected and Projected-all DB • A database SeqDB projected-all on pattern P is defined as: • SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is an instance of P} • Return temporal points to check all all SeqDB SeqDB<a,b> <e,a,b,c> <c> <e,a,e,b,c> <c>

Counting Supports and Confidence • Consider the rule pre->post • Sequence Support (s-sup): The number of sequences where the prefix pre appears. • Instance support (i-sup): The number of instances of pre++post. • Confidence (conf): The likelihood that post appears after pre. This can be found by computing the ratio: Instances of pre, where post eventually occurs afterwards ----------------------------- = |Instances of pre| all |(SeqDBpre)post| ---------- |SeqDBpre| all

Counting Supports and Confidence X X s-sup (<a,b>-><c>) = 2 i-sup (<a,b>-><c>) = 3 conf(<a,b>-><c>) = 1.0 conf(<a,b>-><e>) = 0.5

Properties, Theorems, and Algorithms

Apriori Properties – Support & Conf. Theorem 1. Consider two rule Rx = p->c & Ry = q -> c. If p q and s-sup(Rx) < min-s-sup, then s-sup(Ry) < min-s-sup. Theorem 2. Consider two rule Rx = p->c & Ry = p -> d. If c d and conf(Rx) < min-conf, then conf(Ry) < min-conf. Rx:a -> z ; conf(Rx) < min_conf Rx:a -> z ; s-sup(Rx) < min_s-sup a -> b,z a -> b,c,z a -> c,z a -> b,d,z …. a,b -> z a,b,c -> z a,c -> z a,b,d -> z …. Non- significant Rys Rys

Rule Redundancy • Consider two rules Rx = p->c and Ry = q -> d. Rx is redundant if the following conditions hold: • Rx is a sub-seq. of Y (i.e., p++c q++d) • Rx & Ry have the same sup. and conf. values. Rx:a -> b,c,d a -> b a -> c a -> b,c a -> b,d …. Redundant iff sup and conf are the same Rys Redundant rules are identified and removed early during mining process.

<a,b,c,d> -> post pre -> <b,c,d,e> Redundant Rules: Redundant Rules: <a,d> -> post pre -> <c,d,e> <a,c,d> -> post pre -> <d,e> …. …. Theorem 3. Given two pre-conditions PX and PY where PX PY , if SeqDBPX = SeqDBPY then for all sequences of events post, rules PX -> post is rendered redundant by PY -> post. Theorem 4. Given two rules RX (pre -> CX) and RY (pre -> CY ) if CX CY and (SeqDBpre)CX = (SeqDBpre)CY then RX is rendered redundant by RY and can be pruned. all all

Algorithm • Step 1: Mine a pruned set of pre-conditions • Satisfy min-s-sup threshold • Use Theorems 1 & 3 • Step 2: For each pre-cond. pre, create SeqDBpre. • Step 3: Mine a pruned set of post-conditions • Corresponding rules satisfy min-conf. • Use Theorems 2 & 4 • Step 4: Remove rules that don’t satisfy min-i-sup. • Step 5: Filter any remaining redundant rules. all

Equiv. Proj DB & LS-Set Patterns • From Theorem 3 (& 4), a pre- (post-) condition is not pruned iff: • there does not exist any super-sequence pattern having the same projected database. • Also referred to as projected-database closed or LS-Set (Yan and Han, 2003) • We generate this set by modifying BIDE (Wang and Han, 2004) • Keep the search space pruning strategy • Remove the closure checks • Proof of completeness in technical report

Mine Pruned Pre-Conds Mine Pruned Post-Conds Check Instance Support & Remove RemainingRed. Rules

Performance & Case Study

Synthetic DatasetD5C20N10S20 147x Faster,8500xMore Compact

Gazelle DatasetKDD Cup 2000 Full-set of significant rules is not minable

JBoss Security Whenever loginconfiguration information is checked, eventually invocations of authentication events, binding of principal to subject, utilization of subject & principal information occur

Conclusion • We propose a novel framework to mine a non-redundant set of significant recurrent rules: • “Whenever a series of precedent events occurs, eventually a series of consequent events occurs” • Employ 2 apriori properties and 2 redundancy thms • Major speedup and reduction of rules by non-redundant rule mining strategy. • We show the utility in mining behavior of JBoss Security Future Work • Improve mining speed • More case studies and apps to DM/SE problems

† Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: