Sequential Patterns & Process Mining

Sequential Patterns&Process Mining Current State of Research Edgar de Graaf LIACS

Mining Sequential Patterns • Sequential Patterns • Sequence Databases • AprioriAll • PrefixSpan • Gap Constraints

Sequential Patterns • <(a,b)(c)(a,b,d)> < a1, a2, a3 > • <(3)(4,5)(8)> contained in <(7)(3,8)(9)(4,5,6)(8)> • <(3)(4,5)(8)> not contained in <(7)(3,8)(9)(4)(5,6)(8)>

Sequential databases The Database with sequences

Sequential databases <(3)(4,5)(8)> Support count 0 A Generated Candidate Pattern

Sequential databases <(3)(4,5)(8)> Support count 0 1

Sequential databases Support count 1 <(3)(4,5)(8)> Not Contained → Not Counted

Sequential databases Contained Support count 1 2 3 4 5 Contained Contained IF Minimal Support ≤ 50% THEN <(3)(4,5)(8)> frequent Contained Contained

Lifting order (1) • Notation by examples • <A,B,C>, a ordered list of sets ≡ sequence • Every set A,B and C is unordered. E.g. A = (x,y,z) = (y,z,x) = (z,y,x) = … • [x,y,z] is an extension: we ignore the order when counting frequency

Lifting order (2) • <(t1)(t2)(t3)(t4)> and <(t1)(t3)(t2)(t4)> frequent → <(t1)(t3,t2)(t4)> is frequent • Says: t3 and t2 occurs frequent in-between t1 and t4 in either order

Lifting Order (3) • <(t1)(t2)(t3)(t4)> and <(t1)(t3)(t2)(t4)> infrequent suppose (t1)[t3,t2](t4) frequent • Says: often t3 and t2 occur in-between t1 and t4

Existing Algorithms • AprioriAll: the first algorithm based on the anti-monotone principles • PrefixSpan: currently the fastest algorithm around, it uses projected databases

AprioriAll (1) AprioriAll(DB, min_sup){ L1 = {frequent sequences size 1} k = 2 while(Lk-1 is not empty){ Ck = candidateGeneration(Lk-1,k) Ck = candidatePruning(Ck, k) Lk = supportBasedPruning(Ck) k++ } }

PrefixSpan (1) Assume that the prefix = <(a,b)(c)> • Scan de projected database to find every frequent item x such that • <(a,b)(c,x)> is frequent or • <(a,b)(c)(x)> is frequent • Append the x to the prefix and output the pattern • Now call recursively e.g. PrefixSpan(<(a,b)(c,x)> , newProjDB)

Gap Constraint • Simple idea: between sequence-item-sets a maximal distance • <(a)(c)(d)(e)>, e.g. pattern = <(a)(e)> and gap = 1 then this sequence is not counted

Process Mining • What is process mining? • Using D/F tables and graphs • Genetic Algorithms • Problem areas • Using sequential patterns

What is process mining? (1) • The ordering of events is known e.g. <(task A)(task B)(task C)> • Process mining constructs a petri net: pay ready claim register to_be_evaluated send_letter Source: Workflow Management by W. van der Aalst and K. van Hee. (1997)

What is process mining? (2) • Usability of process mining: • Given the audit trails, what is the workflow network? • Mined workflow network ≡ original design? (Delta Analysis) • Mined workflow network better than the original design? (Performance Analysis)

Using D/F tables and graphs (1) • For every task a D/F table: • Intuition: if A is often followed by B then the probability of A causing B increases

Using D/F tables and graphs (2) • A D/F graph is constructed: IF((A→B ≥ N) AND (A > B ≥ σ) AND (B < A ≤ σ) THEN connection A to B • More complicated rules deal with recursion and short loops

Using D/F tables and graphs (3) • D/F Graph example:

Genetic Algorithms (1) • Create a initial population of workflows • Calculate their fitness using audit trails • Create a child • Mutate the child • Repeat 3 to 4 to create the new population • Go to 2

Genetic Algorithms (2) • Advantages: • Can deal with duplicate tasks and non-free choice. • Disadvantages: • The structure of the “chromosome” • How do we measure fitness? • How do we do cross-over and mutation?

Problem Areas (1) • Hidden tasks: • Duplicate tasks: when tasks have the same name B C

Problem Areas (2) • Mining non-free-choice A D C B E

Problem Areas (3) • Mining Loops: ABCDBCD A D B C

Problem Areas (4) • Delta analysis: how do we compare two models? • Other problems: time, dealing with noise and incompleteness.

Using sequential patterns • Mining loops? • Fitness measure in a GA? • Use in delta analysis? • Generate the important frequent subsequences to help the designer

Further research in sequences • How about gaps between items in different item sets? • What type of frequent subsequences to use in fitness? • Lifting order, is it useful in workflow generation? • Further research of lifting order

The End Thank you for your attention Edgar de Graaf edegraaf@liacs.nl

Sequential Patterns & Process Mining