1 / 29

† Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current:

Efficient Mining of Recurrent Rules from a Sequence Database. David Lo †* Joint work with: Siau-Cheng Khoo † and Chao Liu ‡. † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.). ‡ Data Mining Group

sloan
Download Presentation

† Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Mining of Recurrent Rules from a Sequence Database David Lo†* Joint work with: Siau-Cheng Khoo† and Chao Liu‡ †Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) ‡Data Mining Group Department of Computer Science Uni. of Illinois at Urbana-Champaign Current: (Microsoft Research, Redmond)

  2. Motivation • Huge amount of data exists, we want to mine knowledge from data. • Recurrent Rules “Whenever a series of precedent events (pre) occurs, eventually another series of consequent events (post) occurs.” Denoted as: pre->post • We want to mine for recurrent rules from a sequence database.

  3. Recurrent Rules – Intuitive Examples • Locking Protocol • Internet Banking “Whenever a lock is acquired, eventually it is released” “Whenever a connection to a bank server is made and authentication is completed, money transfer command is issued and verified, eventually money is transferred and notification is displayed.”

  4. Soft. Specifications & Recurrent Rule • Recurrent rule • Corresponds to a family of program properties useful for software verification • Formalized in Linear Temporal Logic • Mining for these software specs are often incomplete, outdated [ABL02,DSB04,LKL07] • Mining specifications helps in: • Understanding existing/legacy systems • Help verification tools to ensure correctness of systems and detect bugs.

  5. Problem Statements Problem 1 • “Given a set of sequences, find rules that recur (are satisfied)a significant number of times within a sequence and across multiple sequences. • A rule is significant if it satisfies minimum thresholds of supports and confidence. ” Problem 2 “Mine a set of non-redundant significant recurrent rules.”

  6. Extending Sequential Rules [S99] • Sequential rule pre->post: • Rules formed by composing sequential patterns [AS95,YHA03,WH04]: series of events supported (i.e. a sub-sequence of) by a significant number of sequences. • Whenever a sequence is a super-seq. of pre it will also be a super-seq. of pre++post • Recurrent rule: • Multiple occurrences of the rule’s premise and consequent both within a sequence and across multiple sequences are considered

  7. Extending Episode Rules [MTV97] • Episode rule pre->post: • Episode: series of events occurring close together (e.g., in a window). • Whenever a window is a super-seq. of pre it will also be a super-seq. of pre++post. • Recurrent rule: • Handle multiple sequences • We want to break the window barrier • It is hard to tell the right window size • Lock separated frm unlock by arbitrary no of evs • We mine a non-redundant set of rules

  8. Preliminaries

  9. Linear Temporal Logic (LTL) • Formalism to precisely specify temporal requirements. • It works on paths [HR03] • There are a number of operators: • G p – Globally at every point in time p holds • F p – At that point in time or eventually (Finally) p holds • X p – p holds at the neXt point in time

  10. Automata Model main lock lock To Check use use unlock unlock lock end Checking or Verifying Temporal Logics Program Transform main(x){ if (lock=0) lock;use;unlock;lock; else for i: 1 to 10 lock;use;unlock } LTL property to check <main,lock> -> <unlock,end> 10 Possible Traces or Sequences main lock use unlock lock end main lock use unlock lock use unlock end main lock use unlock end … Violation

  11. Concepts, Definitions And Rules Semantics

  12. Temporal Points “Whenever a series of precedent events occurs at a point in time or temporal point, eventually another series of consequent events occurs.” • Peek at interesting temporal points & see what series of evs are likely to happen next • Temporal points in a sequence S • The indices in S, starting from 1. • Consider a sequence <a,b,a,b,a,c>. There are 6 temporal points in the sequence. • For a temporal point j in S= <a1,…,an> , the prefix <a1,…,aj> of S is called j-prefix of S.

  13. Occurrences & Instances • Consider a pattern P, and a sequence S • The set of all occurrences of P in S, Occ(P,S) is the set: • {j| P j-prefix of S && last (P) = S[j] } • The set of all instances of P in S, Inst(P,S) is the set: • {j-prefix of S | j is in Occ(P,S)} • Consider the sequence <A,B,A,B,A,B> • The set of occurrences of <A,B> is {2,4,6} • Instances of <A,B> is: {<A,B>,<A,B,A,B>, <A,B,A,B,A,B>} • Correspond to temporal points to be checked for rules with <A,B> as premise

  14. Projected and Projected-all DB • A database SeqDB projected on pattern P is defined as: • SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is the minimal prefix of s • containing P} SeqDB SeqDB<a,b> <e,a,b,c> <e,a,e,b,c>

  15. Projected and Projected-all DB • A database SeqDB projected-all on pattern P is defined as: • SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is an instance of P} • Return temporal points to check all all SeqDB SeqDB<a,b> <e,a,b,c> <c> <e,a,e,b,c> <c>

  16. Counting Supports and Confidence • Consider the rule pre->post • Sequence Support (s-sup): The number of sequences where the prefix pre appears. • Instance support (i-sup): The number of instances of pre++post. • Confidence (conf): The likelihood that post appears after pre. This can be found by computing the ratio: Instances of pre, where post eventually occurs afterwards ----------------------------- = |Instances of pre| all |(SeqDBpre)post| ---------- |SeqDBpre| all

  17. Counting Supports and Confidence X X s-sup (<a,b>-><c>) = 2 i-sup (<a,b>-><c>) = 3 conf(<a,b>-><c>) = 1.0 conf(<a,b>-><e>) = 0.5

  18. Properties, Theorems, and Algorithms

  19. Apriori Properties – Support & Conf. Theorem 1. Consider two rule Rx = p->c & Ry = q -> c. If p q and s-sup(Rx) < min-s-sup, then s-sup(Ry) < min-s-sup. Theorem 2. Consider two rule Rx = p->c & Ry = p -> d. If c d and conf(Rx) < min-conf, then conf(Ry) < min-conf. Rx:a -> z ; conf(Rx) < min_conf Rx:a -> z ; s-sup(Rx) < min_s-sup a -> b,z a -> b,c,z a -> c,z a -> b,d,z …. a,b -> z a,b,c -> z a,c -> z a,b,d -> z …. Non- significant Rys Rys

  20. Rule Redundancy • Consider two rules Rx = p->c and Ry = q -> d. Rx is redundant if the following conditions hold: • Rx is a sub-seq. of Y (i.e., p++c q++d) • Rx & Ry have the same sup. and conf. values. Rx:a -> b,c,d a -> b a -> c a -> b,c a -> b,d …. Redundant iff sup and conf are the same Rys Redundant rules are identified and removed early during mining process.

  21. <a,b,c,d> -> post pre -> <b,c,d,e> Redundant Rules: Redundant Rules: <a,d> -> post pre -> <c,d,e> <a,c,d> -> post pre -> <d,e> …. …. Theorem 3. Given two pre-conditions PX and PY where PX PY , if SeqDBPX = SeqDBPY then for all sequences of events post, rules PX -> post is rendered redundant by PY -> post. Theorem 4. Given two rules RX (pre -> CX) and RY (pre -> CY ) if CX CY and (SeqDBpre)CX = (SeqDBpre)CY then RX is rendered redundant by RY and can be pruned. all all

  22. Algorithm • Step 1: Mine a pruned set of pre-conditions • Satisfy min-s-sup threshold • Use Theorems 1 & 3 • Step 2: For each pre-cond. pre, create SeqDBpre. • Step 3: Mine a pruned set of post-conditions • Corresponding rules satisfy min-conf. • Use Theorems 2 & 4 • Step 4: Remove rules that don’t satisfy min-i-sup. • Step 5: Filter any remaining redundant rules. all

  23. Equiv. Proj DB & LS-Set Patterns • From Theorem 3 (& 4), a pre- (post-) condition is not pruned iff: • there does not exist any super-sequence pattern having the same projected database. • Also referred to as projected-database closed or LS-Set (Yan and Han, 2003) • We generate this set by modifying BIDE (Wang and Han, 2004) • Keep the search space pruning strategy • Remove the closure checks • Proof of completeness in technical report

  24. Mine Pruned Pre-Conds Mine Pruned Post-Conds Check Instance Support & Remove RemainingRed. Rules

  25. Performance & Case Study

  26. Synthetic DatasetD5C20N10S20 147x Faster,8500xMore Compact

  27. Gazelle DatasetKDD Cup 2000 Full-set of significant rules is not minable

  28. JBoss Security Whenever loginconfiguration information is checked, eventually invocations of authentication events, binding of principal to subject, utilization of subject & principal information occur

  29. Conclusion • We propose a novel framework to mine a non-redundant set of significant recurrent rules: • “Whenever a series of precedent events occurs, eventually a series of consequent events occurs” • Employ 2 apriori properties and 2 redundancy thms • Major speedup and reduction of rules by non-redundant rule mining strategy. • We show the utility in mining behavior of JBoss Security Future Work • Improve mining speed • More case studies and apps to DM/SE problems

More Related