300 likes | 508 Views
Sequential Patterns & Process Mining. Current State of Research Edgar de Graaf LIACS. Mining Sequential Patterns. Sequential Patterns Sequence Databases AprioriAll PrefixSpan Gap Constraints. Sequential Patterns. <(a,b)(c)(a,b,d)> < a 1 , a 2 , a 3 >
E N D
Sequential Patterns&Process Mining Current State of Research Edgar de Graaf LIACS
Mining Sequential Patterns • Sequential Patterns • Sequence Databases • AprioriAll • PrefixSpan • Gap Constraints
Sequential Patterns • <(a,b)(c)(a,b,d)> < a1, a2, a3 > • <(3)(4,5)(8)> contained in <(7)(3,8)(9)(4,5,6)(8)> • <(3)(4,5)(8)> not contained in <(7)(3,8)(9)(4)(5,6)(8)>
Sequential databases The Database with sequences
Sequential databases <(3)(4,5)(8)> Support count 0 A Generated Candidate Pattern
Sequential databases <(3)(4,5)(8)> Support count 0 1
Sequential databases Support count 1 <(3)(4,5)(8)> Not Contained → Not Counted
Sequential databases Contained Support count 1 2 3 4 5 Contained Contained IF Minimal Support ≤ 50% THEN <(3)(4,5)(8)> frequent Contained Contained
Lifting order (1) • Notation by examples • <A,B,C>, a ordered list of sets ≡ sequence • Every set A,B and C is unordered. E.g. A = (x,y,z) = (y,z,x) = (z,y,x) = … • [x,y,z] is an extension: we ignore the order when counting frequency
Lifting order (2) • <(t1)(t2)(t3)(t4)> and <(t1)(t3)(t2)(t4)> frequent → <(t1)(t3,t2)(t4)> is frequent • Says: t3 and t2 occurs frequent in-between t1 and t4 in either order
Lifting Order (3) • <(t1)(t2)(t3)(t4)> and <(t1)(t3)(t2)(t4)> infrequent suppose (t1)[t3,t2](t4) frequent • Says: often t3 and t2 occur in-between t1 and t4
Existing Algorithms • AprioriAll: the first algorithm based on the anti-monotone principles • PrefixSpan: currently the fastest algorithm around, it uses projected databases
AprioriAll (1) AprioriAll(DB, min_sup){ L1 = {frequent sequences size 1} k = 2 while(Lk-1 is not empty){ Ck = candidateGeneration(Lk-1,k) Ck = candidatePruning(Ck, k) Lk = supportBasedPruning(Ck) k++ } }
PrefixSpan (1) Assume that the prefix = <(a,b)(c)> • Scan de projected database to find every frequent item x such that • <(a,b)(c,x)> is frequent or • <(a,b)(c)(x)> is frequent • Append the x to the prefix and output the pattern • Now call recursively e.g. PrefixSpan(<(a,b)(c,x)> , newProjDB)
Gap Constraint • Simple idea: between sequence-item-sets a maximal distance • <(a)(c)(d)(e)>, e.g. pattern = <(a)(e)> and gap = 1 then this sequence is not counted
Process Mining • What is process mining? • Using D/F tables and graphs • Genetic Algorithms • Problem areas • Using sequential patterns
What is process mining? (1) • The ordering of events is known e.g. <(task A)(task B)(task C)> • Process mining constructs a petri net: pay ready claim register to_be_evaluated send_letter Source: Workflow Management by W. van der Aalst and K. van Hee. (1997)
What is process mining? (2) • Usability of process mining: • Given the audit trails, what is the workflow network? • Mined workflow network ≡ original design? (Delta Analysis) • Mined workflow network better than the original design? (Performance Analysis)
Using D/F tables and graphs (1) • For every task a D/F table: • Intuition: if A is often followed by B then the probability of A causing B increases
Using D/F tables and graphs (2) • A D/F graph is constructed: IF((A→B ≥ N) AND (A > B ≥ σ) AND (B < A ≤ σ) THEN connection A to B • More complicated rules deal with recursion and short loops
Using D/F tables and graphs (3) • D/F Graph example:
Genetic Algorithms (1) • Create a initial population of workflows • Calculate their fitness using audit trails • Create a child • Mutate the child • Repeat 3 to 4 to create the new population • Go to 2
Genetic Algorithms (2) • Advantages: • Can deal with duplicate tasks and non-free choice. • Disadvantages: • The structure of the “chromosome” • How do we measure fitness? • How do we do cross-over and mutation?
Problem Areas (1) • Hidden tasks: • Duplicate tasks: when tasks have the same name B C
Problem Areas (2) • Mining non-free-choice A D C B E
Problem Areas (3) • Mining Loops: ABCDBCD A D B C
Problem Areas (4) • Delta analysis: how do we compare two models? • Other problems: time, dealing with noise and incompleteness.
Using sequential patterns • Mining loops? • Fitness measure in a GA? • Use in delta analysis? • Generate the important frequent subsequences to help the designer
Further research in sequences • How about gaps between items in different item sets? • What type of frequent subsequences to use in fitness? • Lifting order, is it useful in workflow generation? • Further research of lifting order
The End Thank you for your attention Edgar de Graaf edegraaf@liacs.nl