Distributed Genetic Process Mining using Sampling

Distributed Genetic Process Mining using Sampling Carmen Bratosin, Natalia Sidorova, Wil van der Aalst

Process Mining

Process Mining:Process Models Discovery from Event Logs

A small example A V D V V G F E B V V V V V C V Input condition: C can be exe-cuted if B OR G OR F has already been executed Output condition: after D will be executed B AND G

Context Genetic based Process Mining Algorithm • Heuristics based process mining algorithms drawbacks: • Fail to discover complex process structures • Not robust to noise or infrequent behavior

Genetic Miner Find a Model such that maxSpaceOfAllModelsfitness(Log, Model) Build Initial Population Compute Fitness Create New Population (Elitism, Mutation, Crossover) Evaluate Stop Condition NO YES Stop

Fitness Computation – Main Ideas • Execution time linearly dependent on the number of traces • Execution time is dependent on the quality of the solution • More complex the process model to be discovered => more time needed Each individual is assessed against each trace For each trace rewards and penaltiesare given when activities may/ may not be replayed

Genetic Miner Disadvantages • time consumption • the time needed to compute the fitness • the large number of fitness evaluations needed The goal To use distribution techniques in order to improve the time consumption. Advantages discover non-trivial process structures (e.g. non free-choice routings) robustness to noise

Distributed Genetic Process MiningEvent Log Distribution Coordinator

Event logs redundancy Process structure = composition of multiple control-flow patterns (choice, parallel, iteration) Different instances formed of e.g. different combinations of choices made, or different interleaving of events Different execution traces may represent the “same” behavior => event log redundancy

Basic idea behind the algorithm V C A B F E V V V V V D G V V V

Island Algorithm

Evaluation Three different logs:

Experiment design Vary the sample size from 10 traces to the full log Vary the stop condition Vary the population size Use islands with same set-up (processor, memory, OS etc.)

Experimental Results Same quality achieved

Experimental Results PS – population size ISS – sample size MUNT – mean used number of traces MFC – mean number of fitness computation MET – mean execution time

Conclusions • A new distributed genetic algorithm for process mining using sampling • Evaluation confirmed that our approach reduces the overall computation time • The sample size is strongly correlated with the logs characteristics and their level of difficulty from the mining point of view • Future work: • Use smart sampling techniques to reduce the execution time

Distributed Genetic Process Mining using Sampling