230 likes | 542 Views
Email Analysis for Business Process Discovery . Nassim LAGA 1 Marwa ELLEUCH 1,2 Walid GAALOUL 2 Oumaima ALAOUI ISMAILI 1 1 Orange Labs , France 2 Télécom SudParis , Paris Saclay university , France. Introduction. Informal methods. ?. Process, Activity and Instance Recognition.
E N D
Email Analysis for Business Process Discovery Nassim LAGA1 Marwa ELLEUCH1,2 Walid GAALOUL2 Oumaima ALAOUI ISMAILI1 1Orange Labs, France 2Télécom SudParis, Paris Saclay university, France
Introduction Informal methods ? Process, Activity and Instance Recognition Structured Event logs Process Discovery = Process Model Generation Conversion intostructured format Fuzzy miner, Heuristic Miner, Alpha Algorithm X Hyp1: Have a structured format Hyp2: Contain the trace of all BP tasks X
Introduction Structured Event logs Email log ? Process, Activity and Instance Recognition Conversion intostructured format
Propositions Structured Event logs Email log ? Process, Activity and Instance Recognition • Automaticallyidentify one activity, one process and one instance related to each email usingsupervisedlearning and clustering techniques • Minimizehuman intervention by using: • Progressivelylearningapproachthatgeneratestwotypes of predictivemodels for predictingprocess and activitynames • Collaborative approach to build the learningdatasetfor training thesepredictivemodels => minimize the individualhuman effort • Non parametricclusteringalgorithm(Hdbscan) for identifyingprocess instances Conversion intostructured format
Propositions Unstructured email logs Step1: Activity and ProcessLabels generation Step2: Process Instance Detection Progressivelylearningapproach to train process and activitypredictivemodels Clustering Per process and per activity email lists Per process instance lists Conversion bloc intostructured format Structuredevent logs Step3: Event Logs generation
Propositions Unstructured email logs Step1: Activity and ProcessLabels generation Step2: Process Instance Detection Progressivelylearningapproach to train process and activitypredictivemodels Clustering Per process and per activity email lists Per process instance lists Conversion bloc intostructured format Structuredevent logs Step3: Event Logs generation
Propositions Process Predictive model Mini-batchinglearning Collaborative annotation Features - For predictingprocesses: Subject & Entities of email interlocutors - For predictingactivities: Subject & content & Entities of email interlocutors & Exchange History (short emails) Unlabeleddata Labeled data Activity Predictive models Labeled data Prediction Unlabeleddata Step1: Activity and Process Labels Generation
Propositions Yes No End If error_rate > thresh Activity Predictive model Error_ratecalculation Manual correction Number of manual correction Process Predictive models New batch of emails • Detectand Replace Particular Expressions by a tag • Remove stop words and personNames • Lemmatizeterms • Generate 1gram, 2gram vocabulary • Update wordcountersacross the whole email dataset => Generate TFIDF Values • GenerateEntities Interaction values ……….Email3, Email2, Email1 Preprocessing and selectingfeatures Predicted Labels Adding the new batch to the wholedataset Re-training Pre- processed data Prediction Mini-Batchinglearningapproach All existing emails (annotated and preprocessed)
Propositions Unstructured email logs Step1: Activity and ProcessLabels generation Step2: Process Instance Detection Progressivelylearningapproach to train process and activitypredictivemodels Clustering Per process and per activity email lists Per process instance lists Conversion bloc intostructured format Structuredevent logs Step3: Event Logs generation
Propositions Similarity function S betweentwo emails E1 and E2 : S(E1,E2) = W0 + W1 (1-) + W2 Jaccard Distance betweenentity sets of emails’ interlocutors (C(E1) and C(E2) Time Distance between E1 and E2 (ts = timestamp, ) Jaccard Distance related to the namedentities and referencespresent in textual data of E1 and E2 W0, W1 and W2 are tuned by usersaccording to the type of the process (e.g ; if it has time constraintssuch as the accountingclosingprocess or not)
Propositions Unstructured email logs Step1: Activity and ProcessLabels generation Step2: Process Instance Detection Progressivelylearningapproach to train process and activitypredictivemodels Clustering Per process and per activity email lists Per process instance lists Conversion bloc intostructured format Structuredevent logs Step3: Event Logs generation
Evaluation Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection
Evaluation • Number of emails : 1024 • Number of activities: 116 • Number of processes : 13 : Hiring, patent Application, Command, Conference Participation, travelexpenserefund, etc... • Testing 3 predictivealgorithms: • Randomforest (RF) • LogisticRegression (LR) withStochastic Gradient Descent (SGD) optimiser • Support Vector Machine (SVM) Evaluation Dataset Evaluation usingF1-Score Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection
Evaluation Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection • Evaluation • ClusteringAlgorithm: HDBSCAN • Evaluation Metric: Adjustedmutual Information (AMI) • Returned Value • → 1 if real and Hdbscan partitions are stronglymatched • → 0 if real and Hdbscan partitions are weaklymatched • - Number of emails : 180 • Processname : HiringProcess • Number of instance clusters: 11 Evaluation Dataset Evaluation Result: = 0,86 → 1 => The real and Hdbscan partitions are stronglymatched Real partition (Instance clusters Manuallydefined)
Evaluation Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection Real Model ? Comparision • - Number of emails : 180 • Processname : HiringProcess • Number of instance clusters: 11 Evaluation Dataset Conversion intostructured format Applying the Heuristic Miner Algorithm Discovered Hiring Model
Evaluation Real Hiringprocess Model DiscoveredHiring Model Looping Occurrence number over all process instances of eachbeahvior Looping Additionalbehavior X Not allowed behavior - Almost in conformity - Twodiscrepancy types detectedatlowfrequency : (1) Unfitting model behavior (2) Additional model behavior - Discrepancy Causes : Errorsaccumulatedthrough the log building systems / Log miner technique / Real differencebetween the process as observed in the emails and the relatedtheoretical BP model.
Conclusion Propositions & Advantages • A solution for mining business processesfrom emails • (+) The need of lesshuman intervention atindividuallevelcompared to relatedworksthrough the use of a collaborative learningapproach and a non parametricclusteringalgorithm • (+) Good performances obtainedaftertesting the overallapproach on real dataset Limitations & Perspectives • One Email can talk about more than one activity and more than one instance • The current approach still requires human involvement => More automate the BP discovery pipeline • Similarity meanings are not used => Employ similarity meaning measures for constructing learning features based on email contents