180 likes | 377 Views
Distributed Genetic Process Mining using Sampling. Carmen Bratosin , Natalia Sidorova, Wil van der Aalst. Process Mining. Process Mining: Process Models Discovery from Event Logs. A small example. A. V. D. V. V. G. F. E. B. V. V. V. V. V. C. V.
E N D
Distributed Genetic Process Mining using Sampling Carmen Bratosin, Natalia Sidorova, Wil van der Aalst
A small example A V D V V G F E B V V V V V C V Input condition: C can be exe-cuted if B OR G OR F has already been executed Output condition: after D will be executed B AND G
Context Genetic based Process Mining Algorithm • Heuristics based process mining algorithms drawbacks: • Fail to discover complex process structures • Not robust to noise or infrequent behavior
Genetic Miner Find a Model such that maxSpaceOfAllModelsfitness(Log, Model) Build Initial Population Compute Fitness Create New Population (Elitism, Mutation, Crossover) Evaluate Stop Condition NO YES Stop
Fitness Computation – Main Ideas • Execution time linearly dependent on the number of traces • Execution time is dependent on the quality of the solution • More complex the process model to be discovered => more time needed Each individual is assessed against each trace For each trace rewards and penaltiesare given when activities may/ may not be replayed
Genetic Miner Disadvantages • time consumption • the time needed to compute the fitness • the large number of fitness evaluations needed The goal To use distribution techniques in order to improve the time consumption. Advantages discover non-trivial process structures (e.g. non free-choice routings) robustness to noise
Distributed Genetic Process MiningEvent Log Distribution Coordinator
Event logs redundancy Process structure = composition of multiple control-flow patterns (choice, parallel, iteration) Different instances formed of e.g. different combinations of choices made, or different interleaving of events Different execution traces may represent the “same” behavior => event log redundancy
Basic idea behind the algorithm V C A B F E V V V V V D G V V V
Evaluation Three different logs:
Experiment design Vary the sample size from 10 traces to the full log Vary the stop condition Vary the population size Use islands with same set-up (processor, memory, OS etc.)
Experimental Results Same quality achieved
Experimental Results PS – population size ISS – sample size MUNT – mean used number of traces MFC – mean number of fitness computation MET – mean execution time
Conclusions • A new distributed genetic algorithm for process mining using sampling • Evaluation confirmed that our approach reduces the overall computation time • The sample size is strongly correlated with the logs characteristics and their level of difficulty from the mining point of view • Future work: • Use smart sampling techniques to reduce the execution time