Email Analysis for Business Process Discovery

Email Analysis for Business Process Discovery Nassim LAGA1 Marwa ELLEUCH1,2 Walid GAALOUL2 Oumaima ALAOUI ISMAILI1 1Orange Labs, France 2Télécom SudParis, Paris Saclay university, France

Introduction Informal methods ? Process, Activity and Instance Recognition Structured Event logs Process Discovery = Process Model Generation Conversion intostructured format Fuzzy miner, Heuristic Miner, Alpha Algorithm X Hyp1: Have a structured format Hyp2: Contain the trace of all BP tasks X

Introduction Structured Event logs Email log ? Process, Activity and Instance Recognition Conversion intostructured format

Propositions Structured Event logs Email log ? Process, Activity and Instance Recognition • Automaticallyidentify one activity, one process and one instance related to each email usingsupervisedlearning and clustering techniques • Minimizehuman intervention by using: • Progressivelylearningapproachthatgeneratestwotypes of predictivemodels for predictingprocess and activitynames • Collaborative approach to build the learningdatasetfor training thesepredictivemodels => minimize the individualhuman effort • Non parametricclusteringalgorithm(Hdbscan) for identifyingprocess instances Conversion intostructured format

Propositions Unstructured email logs Step1: Activity and ProcessLabels generation Step2: Process Instance Detection Progressivelylearningapproach to train process and activitypredictivemodels Clustering Per process and per activity email lists Per process instance lists Conversion bloc intostructured format Structuredevent logs Step3: Event Logs generation

Propositions Process Predictive model Mini-batchinglearning Collaborative annotation Features - For predictingprocesses: Subject & Entities of email interlocutors - For predictingactivities: Subject & content & Entities of email interlocutors & Exchange History (short emails) Unlabeleddata Labeled data Activity Predictive models Labeled data Prediction Unlabeleddata Step1: Activity and Process Labels Generation

Propositions Yes No End If error_rate > thresh Activity Predictive model Error_ratecalculation Manual correction Number of manual correction Process Predictive models New batch of emails • Detectand Replace Particular Expressions by a tag • Remove stop words and personNames • Lemmatizeterms • Generate 1gram, 2gram vocabulary • Update wordcountersacross the whole email dataset => Generate TFIDF Values • GenerateEntities Interaction values ……….Email3, Email2, Email1 Preprocessing and selectingfeatures Predicted Labels Adding the new batch to the wholedataset Re-training Pre- processed data Prediction Mini-Batchinglearningapproach All existing emails (annotated and preprocessed)

Propositions Similarity function S betweentwo emails E1 and E2 : S(E1,E2) = W0 + W1 (1-) + W2 Jaccard Distance betweenentity sets of emails’ interlocutors (C(E1) and C(E2) Time Distance between E1 and E2 (ts = timestamp, ) Jaccard Distance related to the namedentities and referencespresent in textual data of E1 and E2 W0, W1 and W2 are tuned by usersaccording to the type of the process (e.g ; if it has time constraintssuch as the accountingclosingprocess or not)

Evaluation Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection

Evaluation • Number of emails : 1024 • Number of activities: 116 • Number of processes : 13 : Hiring, patent Application, Command, Conference Participation, travelexpenserefund, etc... • Testing 3 predictivealgorithms: • Randomforest (RF) • LogisticRegression (LR) withStochastic Gradient Descent (SGD) optimiser • Support Vector Machine (SVM) Evaluation Dataset Evaluation usingF1-Score Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection

Evaluation Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection • Evaluation • ClusteringAlgorithm: HDBSCAN • Evaluation Metric: Adjustedmutual Information (AMI) • Returned Value • → 1 if real and Hdbscan partitions are stronglymatched • → 0 if real and Hdbscan partitions are weaklymatched • - Number of emails : 180 • Processname : HiringProcess • Number of instance clusters: 11 Evaluation Dataset Evaluation Result: = 0,86 → 1 => The real and Hdbscan partitions are stronglymatched Real partition (Instance clusters Manuallydefined)

Evaluation Per process and per activity email lists Per process instance lists Step1: Activity and Process Labels generation Step3: Event Logs generation Step2: Process Instance Detection Real Model ? Comparision • - Number of emails : 180 • Processname : HiringProcess • Number of instance clusters: 11 Evaluation Dataset Conversion intostructured format Applying the Heuristic Miner Algorithm Discovered Hiring Model

Evaluation Real Hiringprocess Model DiscoveredHiring Model Looping Occurrence number over all process instances of eachbeahvior Looping Additionalbehavior X Not allowed behavior - Almost in conformity - Twodiscrepancy types detectedatlowfrequency : (1) Unfitting model behavior (2) Additional model behavior - Discrepancy Causes : Errorsaccumulatedthrough the log building systems / Log miner technique / Real differencebetween the process as observed in the emails and the relatedtheoretical BP model.

Conclusion Propositions & Advantages • A solution for mining business processesfrom emails • (+) The need of lesshuman intervention atindividuallevelcompared to relatedworksthrough the use of a collaborative learningapproach and a non parametricclusteringalgorithm • (+) Good performances obtainedaftertesting the overallapproach on real dataset Limitations & Perspectives • One Email can talk about more than one activity and more than one instance • The current approach still requires human involvement => More automate the BP discovery pipeline • Similarity meanings are not used => Employ similarity meaning measures for constructing learning features based on email contents

Thanks for your attention

Email Analysis for Business Process Discovery

Email Analysis for Business Process Discovery

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7