180 likes | 314 Views
Trace Clustering in Process Mining. M. Song, C.W. Gunther , and W.M.P. van der Aalst. Han- na Yang. Introduction. The major application of process mining Discovery ⇒ extraction of abstract process knowledge from event logs Real-life business processes are flexible Spaghetti model
E N D
Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst Han-na Yang
Introduction • The major application of process mining • Discovery ⇒ extraction of abstract process knowledge from event logs • Real-life business processes are flexible • Spaghetti model • Single cases differ significantly from one another = ‘Diversity’ • Discovering actual process which is being executed is valuable. • Solution for diversity of cases • Measure the similarity of cases and use the information to divide the set of cases into more homogeneous subsets. • Trace clustering
Running Example • Repair process of products within an electronic company that makes navigation and mobile phones • Case: a specific row • Trace: the sequence of events within a case • Events: represented by the case identifier, activity identifier, and originator Activity identifier Originator Case identifier
Running Example • Trace clustering can support the identification of process variants corresponding to homogenous subsets of cases Navigation system Mobile Phone Repair is Canceled
Trace profiles • In the trace clustering approach, each case is characterized by a defined set of items, i.e., specific features which can be extracted from the corresponding trace. • Items for comparing traces are organized in trace profiles, each addressing a specific perspective of the log
Trace profiles • Information in Event log • WorkflowLog • group any number of process elements • ProcessInstance • a case • AuditTrailEntry • events • WorkflowModelElement • name of event • mandatory event attribute • EventType • identify lifecycle transitions • mandatory event attribute • Timestamp, Originator • optional data fields
Trace profiles • Profile • A set of related items which describe the trace from a specific perspective • Every item is a metric ⇒ we can consider a profile with n items to be a function, which assigns to a trace a vector (i1, i2, … in) • Profiling a log can be described as measuring a set of traces with a number of profiles, resulting in an aggregate vector • Resulting vectors can subsequently be used to calculate the distance between any two traces, using a distance metric
Clustering Methods - Distance Measures • Distance Measures • To calculate the similarity between cases • Three distance measures • n: the number of items extracted from the process log • case cj: corresponds to the vector (ij1, ij2, … ijn) • ijk: the number of appearance of item k in the case j
Clustering Methods – Clustering Algorithm • K-means clustering • A method of cluster analysis • aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. • QT (quality threshold) clustering • A method of partitioning data • invented for gene clustering • requires more computing power than k-means • but does not require specifying the number of clusters a priori • predictable - always returns the same result when run several times. • guided by a quality threshold(determines the maximum diameter of clusters)
Clustering Methods – Clustering Algorithm • Agglomerative hierarchical clustering • Gradually generate clusters by merging nearest traces • Smaller clusters are merged into large ones • Example: we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Clustering Methods – Clustering Algorithm • The Self-Organizing Map (SOM) • Used to map high dimensional data onto low dimensional spaces • grouping similar cases close together in certain areas of the value range • can be used to portray complex correlations in statistical data. • Example: World Bank statistics of countries in 1992. • 39 indicators describing various quality-of-life factors were used • Countries that had similar values of the indicators place near each other on the map • different clusters were automatically encoded with different bright colors • each country was assigned a color describing its poverty type in relation to other countries • The poverty structures of the world: each country on the geographic map has been colored according to its poverty type.
Case study • ProM • Support various process mining algorithm • Implemented the trace clustering plug-in in ProM • Process log from AMC hospital in Amsterdam, Netherlands • 619 gynecological oncology patients (treated in 2005, 2006) = 619 cases • 52 diagnostic activities • 3,574 events, 34 departments are involved
Case study • Process model for all cases obtained using the Heuristic Miner
Case study • The result obtained by applying the trace clustering plug-in in ProM • The cases in the same cell = belong to the same cluster cluster (1,2) 352 cluster (3,1) 113
Case study • Result for cluster (1,2) • 352 cases (more than half of the entire cases) • Only 11 activities ⇒ Simple • Patients who are diagnosed by another hospital and are referred to the AMC hospital for treatment
Case study • Result for cluster (3,1) • 113 cases • Complex as the original process model • Patients who are not diagnosed and need more complex and detailed diagnostic activities
Conclusion • Process mining techniques can deliver valuable, factual insights into how processes are being executed in real life • Important for analyzing flexible environments • Trace clustering operates on the event log level • Improve the results of any process mining algorithm • Both the approach and implementation are straightforward to extend • Ex: By adding domain-specific profiles or further clustering algorithm