The Automatic Explanation of Multivariate Time Series (MTS)

The Automatic Explanation of Multivariate Time Series (MTS) Allan Tucker

The Problem - Data • Datasets which are Characteristically: • High Dimensional MTS • Large Time Lags • Changing Dependencies • Little or No Available Expert Knowledge

The Problem - Requirement • Lack of Algorithms to Assist Users in Explaining Events where: • Model Complex MTS Data • Learnable from Data with Little or No User Intervention • Transparency Throughout the Learning and Explaining Process is Vital

Contribution to Knowledge • Using a Combination of Evolutionary Programming (EP) and Bayesian Networks (BNs) to Overcome Issues Outlined • Extending Learning Algorithms for BNs to Dynamic Bayesian Networks (DBNs) with Comparison of Efficiency • Introduction of an Algorithm for Decomposing High Dimensional MTS into Several Lower Dimensional MTS

Contribution to Knowledge (Continued) • Introduction of New EP-Seeded GA Algorithm • Incorporating Changing Dependencies • Application to Synthetic and Real-World Chemical Process Data • Transparency Retained Throughout Each Stage

Framework Pre-processing Data Preparation Variable Groupings Model Building Search Methods Synthetic Data Evaluation Real Data Changing Dependencies Explanation

Key Technical Points 1Comparing Adapted Algorithms • New Representation • K2/K3 [Cooper and Herskovitz] • Genetic Algorithm [Larranaga] • Evolutionary Algorithm [Wong] • Branch and Bound [Bouckaert] • Log Likelihood / Description Length • Publications: • International Journal of Intelligent Systems, 2001

Key Technical Points 2Grouping • A Number of Correlation Searches • A Number of Grouping Algorithms • Designed Metrics • Comparison of All Combinations • Synthetic and Real Data • Publications: • IDA99 • IEEE Trans System Man and Cybernetics 2001 • Expert Systems 2000

Key Technical Points 3EP-Seeded GA • Approximate Correlation Search Based on the One Used in Grouping Strategy • Results Used to Seed Initial Population of GA • Uniform Crossover • Specific Lag Mutation • Publications: • Genetic Algorithms and Evolutionary Computation Conference 1999 (GECCO99) • International Journal of Intelligent Systems, 2001 • IDA2001

Key Technical Points 4Changing Dependencies • Dynamic Cross Correlation Function for Analysing MTS • Extend Representation Introduce a Heuristic Search - Hidden Controller Hill Climb (HCHC) • Hidden Variables to Model State of the System • Search for Structure and Hidden States Iteratively

Future Work • Parameter Estimation • Discretisation • Changing Dependencies • Efficiency • New Datasets • Gene Expression Data • Visual Field Data

DBN Representation a0(t) (3,1,4) (4,2,3) (2,3,2) (3,0,2) (3,4,2) a1(t) a2(t-2) a2(t) a3(t-4) a3(t-2) a3(t) a4(t-3) a4(t) t-4 t-3 t-2 t-1 t

Sample DBN Search Results N = 5, MaxT = 10 N = 10, MaxT = 60

1. Correlation Search (EP) 2. Grouping Algorithm (GGA) Several Lower Dimensional MTS Grouping One High Dimensional MTS (A) List 1 2 R (a, b, lag) (a, b, lag) (a, b, lag) G {0,3} {1,4,5} {2}

Original Synthetic MTS Groupings Groupings Discovered from Synthetic Data Sample of Variables from a Discovered Oil Refinery Data Group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 0 6 1 2 3 4 5 7 8 9 10 11 12 13 14 15 20 21 22 16 17 18 19 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Sample Grouping Results

Parameter Estimation • Simulate Random Bag (Vary R, s and c, e) • Calculate Mean and SD for Each Distribution (the Probability of Selecting e from s) • Test for Normality (Lilliefors’ Test) • Symbolic Regression (GP) to Determine the Function for Mean and SD from R, s and c (e will be Unknown) • Place Confidence Limits on the P(Number of Correlations Found e)

Final EPList EP-Seeded GA 0: (a,b,l) 1: (a,b,l) 2: (a,b,l) EPListSize: (a,b,l) EP DBN Initial GAPopulation 0: ((a,b,l),(a,b,l)…(a,b,l)) 1: ((a,b,l),(a,b,l)…(a,b,l)) 2: ((a,b,l),(a,b,l)…(a,b,l)) GAPopsize: ((a,b,l) … (a,b,l)) GA

EP-Seeded GA Results N = 10, MaxT = 60 N = 20, MaxT = 60

Varying the value of c

Time Explanation t t-1 t-11 t-13 t-16 t-20 t-60 P(TT instate_0) = 1.0 P(TGF instate_0) = 1.0 P(BPF instate_3) = 1.0 P(TGF instate_3) = 1.0 P(TT instate_1) = 0.446 P(SOT instate_0) = 0.314 P(C2% instate_0) = 0.279 P(T6T instate_0) = 0.347 P(RinT instate_0) = 0.565

50 10.5 10 45 9.5 40 9 A/M_GB Variable Magnitude 35 TGF 8.5 30 8 25 7.5 20 7 1 501 1001 1501 2001 2501 3001 3501 Time (Minutes) Changing Dependencies

Dynamic Cross- Correlation Function

Hidden Variable - OpState a0(t-4) a2(t-1) a2(t) OpState2 a3(t-2) t-4 t-3 t-2 t-1 t

< DBN_List > < Segment_Lists > Update Segment_Lists through Op_State Parameter Estimation Score Update DBN_List through DBN Structure Search Hidden Controller Hill Climb

HCHC Results - Oil Refinery Data

HCHC Results - Synthetic Data Generate Data from Several DBNs Append each Section of Data Together to Form One MTS with Changing Dependencies Run HCHC

Time Explanation t t-1 t-3 t-5 t-6 t-9 P(OpState1 is 0) = 1.0 P(a1 is 0) = 1.0 P(a0 is 0) = 1.0 P(a2 is 1) = 1.0 P(OpState1 is 0) = 1.0 P(a1 is 1) = 1.0 P(a0 is 0) = 1.0 P(a2 is 1) = 1.0 P(a2 is 0) = 0.758 P(OpState0 is 0) = 0.519 P(a0 is 0) = 0.968 P(OpState0 is 0) = 0.720 P(a0 is 1) = 0.778 P(a2 is 0) = 0.545 P(a0 is 1) = 0.517

Time Explanation t t-1 t-3 t-5 t-6 t-7 t-9 P(OpState1 is 4) = 1.0 P(a1 is 0) = 1.0 P(a0 is 0) = 1.0 P(a2 is 1) = 1.0 P(OpState1 is 4) = 1.0 P(a1 is 1) = 1.0 P(a0 is 0) = 1.0 P(a2 is 1) = 1.0 P(a2 is 1) = 0.570 P(a0 is 0) = 0.506 P(OpState2 is 3) = 0.210 P(a2 is 1) = 0.974 P(OpState2 is 4) = 0.222 P(a2 is 0) = 0.882 P(a0 is 1) = 0.549

TGF %C3 Process Diagram TT T6T PGM PGB SOTT11 SOFT13 RINT C11/3T T36T AFT FF RBT BPF %C2

TGF %C3 Typical Discovered Relationships PGM TT T6T PGB SOTT11 SOFT13 RINT C11/3T T36T AFT FF RBT BPF %C2

Parameters DBN SearchGA EP PopSize 100 10 MR0.1 0.8 CR0.8 --- GenBased on FC Based on FC Correlation Search c - Approx. 20% of s R - Approx. 2.5% of s Grouping GA Synth. 1 Synth. 2-6 Oil PopSize150 100 150 CR 0.8 0.8 0.8 MR0.1 0.1 0.1 Gen 150 100 (1000 for GPV) 150

Parameters EP-Seeded GA c - Approx. 20% of s EPListSize - Approx. 2.5% of s GAPopSize - 10 MR - 0.1 CR - 0.8 LMR -0.1 Gen - Based on FC HCHC Oil Synthetic DBN_Iterations 1×106 5000 Winlen 1000 200 Winjump 500 50

The Automatic Explanation of Multivariate Time Series (MTS)