580 likes | 688 Views
. Sequence Classification Using Statistical Pattern Recognition. José Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis Computer Science Department Universidad Carlos III de Madrid Avda. de la Universidad, 30. 28911 Leganés, Spain {jiglesia, ledezma, masm}@inf.uc3m.es. Outline.
E N D
. Sequence Classification Using Statistical Pattern Recognition José Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis Computer Science Department Universidad Carlos III de Madrid Avda. de la Universidad, 30. 28911 Leganés, Spain {jiglesia, ledezma, masm}@inf.uc3m.es
. Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments and Results Conclusions and Future Works 1
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments and Results Conclusions and Future Works 1
. Opponent Modeling Pattern Recognition Pattern Detection Base Estrategy Pattern RoboCup Soccer Server No-Pattern LogFile Pattern LogFile Environment Information Advices to Players On-Line Comparing Method Recognized Patterns Pattern Recognized On-Line Detection Off-Line Analysis Motivation Opponent behavior Modelling / Classification (Environment: soccer simulation domain) 2
. Introduction Behavior Classification Behavior as sequence of elements Sequence Classification • Sequence: • “set of elements ordered so that they can be labelled with the positive integers” (Merriam-Webster Dictionary) 3
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 4
Sequence classification Given: Classes = {c1, c2, … cn} Sequence E = {e1, e2, … en} Determine: Which class ciЄC does the sequence E belong to. 5
. Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 6
. Pattern to classify Pattern 3 Pattern 2 Pattern 1 Our approach pwd fs fg … finger more ls ... vi man ls … vi more ls … … SEQUENCE CLASS Classification Result Sequence 1 Class 1 Sequence 2 Class 2 Sequence n Class n Sequence to classify Compare_Patterns On-Line Sequence Classification Compare_Patterns … … Compare_Patterns Pattern Library Library Creation Classification 7
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 8
. Library Creation Trie (retrieval) data structure: Special search tree used for storing elements and its prefixes. Every node: represents an element stores useful information (times appeared,…) 9
Library Creation - An example trie pwd vi pwd vi pwd ls Sequence to insert initially in the trie: {pwd vi pwd vi pwd ls} Sequence 10
Library Creation - An example trie pwd vi pwd vi pwd ls Sub-sequence length: 3 {pwd vi pwd vi pwd ls} Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} Sequence to insert initially in the trie: {pwd vi pwd vi pwd ls} Sequence 10
Library Creation- An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [1] vi [1] pwd [1] Library Creation- An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [1] vi [1] pwd [1] vi [1] pwd [1] Library Creation- An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [2] vi [1] pwd [1] vi [1] pwd [1] Library Creation- An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [2] vi [1] pwd [1] vi [2] pwd [2] ls [1] Library Creation - An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [3] vi [1] pwd [1] vi [2] pwd [2] ls [1] ls [1] Library Creation- An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [3] vi [1] pwd [1] vi [2] pwd [2] ls [1] ls [1] ls [1] Library Creation - An example trie Root Sub-sequences to insert in the trie: {pwd vi pwd} and {vi pwd ls} 11
pwd [3] vi [1] pwd [1] vi [2] pwd [2] ls [1] ls [1] ls [1] Library Creation - An example trie pwd vi pwd vi pwd ls Root {pwd vi pwd vi pwd ls} 11
Library Creation - Evaluating Dependences Evaluate the relation/dependence between an element and its prefix Two approaches: Frequency-based method. Statistical dependence method. Our approach: Statistical Value used: Chi-square value. This value is stored in every node of the trie 12
. Library Creation - Evaluating Dependences (Rowi Total x Columnj Total) Expected (Eij)= Grand Total (Oij - Eij ) 2 r k X2= ∑ ∑ Eij i=1 j=1 2 x 2 Contingency Table O11: How many times the current node/element is followed by its prefix. O12: How many times the current node/element is followed by a different prefix. O21:How many times a different prefix (of the same length) is followed by the same node. O22: How many times a different prefix (of the same length) is followed by a different node. 13
. pwd [3] vi [1] [5.1] pwd [1] [4.3] vi [2] pwd [2] [3.5] ls [1] [4.3] ls [1] [4.3] ls [2] Library Creation - Evaluating Dependences Sequence Pattern Trie Root • A Sequence Pattern Trie is created for each class. 14
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 15
. Pattern to classify Pattern 1 Pattern 3 Pattern 2 Classification pwd fs fg … finger more ls ... vi man ls … vi more ls … TestingTrie … Sequence 1 Class 1 Sequence 2 Class 2 Sequence n Class n Sequence to classify ONLINE SEQUENCE CLASS Compare_Patterns ClassTrie On-Line Sequence Classification Compare_Patterns … … Compare_Patterns Pattern Library Library Creation Classification 16
. Classification – Comparing Process Class Trie Testing Trie Root Root … ls [2] pwd [3] vi [2] pwd [3] vi [2] vi [1] [7.1] pwd [2] [1.5] vi [1] [5.1] who [2] [3.5] pwd [1] [7.3] ls [1] [0.3] who [1] [4.3] • If the node (and its prefix) are in both Tries: • If ( abs(chi2TestingTrie – chi2ClassTrie) ≤ ThresholdValue ): • Similarity between both tries. • Result [ElementTestingTrie, PrefixTestingTrie, Chi2TestingTrie] 17
. Classification – Comparing Process Class Trie Testing Trie Root Root … ls [2] pwd [3] vi [2] pwd [3] vi [2] vi [1] [7.1] pwd [2] [1.5] vi [1] [5.1] who [2] [3.5] pwd [1] [7.3] ls [1] [0.3] who [1] [4.3] • If the node (and its prefix) are in both Tries: • If (abs(5.1 – 7.1) ≤ ThresholdValue): • Similarity between both tries. • Result [vi , pwd, 5.1] 17
. Classification – Comparing Process Class Trie Testing Trie Root Root … ls [2] pwd [3] vi [2] pwd [3] vi [2] vi [1] [7.1] pwd [2] [1.5] vi [1] [5.1] who [2] [3.5] pwd [1] [7.3] ls [1] [0.3] who [1] [4.3] • If the node (and its prefix) are only in the Testing Trie: • Differencebetween both tries. • Result [ElementTestingTrie, PrefixTestingTrie, (Chi2TestingTrie * -1)] 17
. Classification – Comparing Process Class Trie Testing Trie Root Root … ls [2] pwd [3] vi [2] pwd [3] vi [2] vi [1] [7.1] pwd [2] [1.5] vi [1] [5.1] who [2] [3.5] pwd [1] [7.3] ls [1] [0.3] who [1] [4.3] • If the node (and its prefix) are only in the Testing Trie: • Differencebetween both tries. • Result [who, pwd vi, (-4.3)] 17
. Root … ls [2] pwd [3] vi [2] vi [1] [7.1] pwd [2] [1.5] pwd [1] [7.3] ls [1] [0.3] Classification – Comparing Process Class Trie Testing Trie Root pwd [3] vi [2] vi [1] [5.1] who [2] [3.5] who [1] [4.3] • If the node (and its prefix) are only in the Testing Trie: • Differencebetween both tries. • Result [who, vi, (-3.5)] 17
. Classification – Comparing Process Each comparison (ClassTrie, TestingTrie): A comparision value Result: [Element1, Prefix1, Value1] [Element2, Prefix2, Value2] [Element3, Prefix3, Value3] [Element4, Prefix4, Value4] … [Elementn, Prefixn, Valuen] Comparison Value 18
. Classification – Comparing Process Result: [vi, pwd, + 5.1] [who, pwd vi, - 4.3] [who, pwd, - 3.5] - 2.7 Comparison Value 18
. Pattern to classify Pattern 3 Pattern 2 Pattern 1 Classification pwd fs fg … finger more ls ... vi man ls … vi more ls … … Sequence 1 Class 1 Sequence 2 Class 2 Sequence n Class n Sequence to classify ONLINE SEQUENCE CLASS Compare_Patterns comparision value On-Line Sequence Classification Compare_Patterns … comparision value … Compare_Patterns Pattern Library comparision value Library Creation Classification 19
. Pattern to classify Pattern 3 Pattern 2 Pattern 1 Classification pwd fs fg … finger more ls ... vi man ls … vi more ls … … Sequence 1 Class 1 Sequence 2 Class 2 Sequence n Class n Sequence to classify ONLINE SEQUENCE CLASS Compare_Patterns comparision value On-Line Sequence Classification Compare_Patterns … comparision value … Greatest Comparison Value Compare_Patterns Pattern Library comparision value Library Creation Classification 20
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 21
Environment – UNIX command line sequences # Start session 1 cd ~/private/docs ls -laF | more cat foo.txt bar.txt zorch.txt > a.txt exit # End session 1 # Start session 2 cd ~/games/ xquake & fg … **SOF** cd <1> ls -laF | more cat <3> > <1> exit **EOF** … one "file name" argument Command histories of 9 UNIX computer usersat over 2 years UCI Repository of ML Database [Newman C., Hettich S., Merz, C. (1998)] three "file name" arguments one "file name" argument 22
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 23
. Experiments – UNIX command line sequences 9 files (users) containing from about 10.000 to 60.000 commands each. 1. Extracting Patterns:A trie is created for each user Pattern Library 24
. Experiments – UNIX command line sequences 2. Classification Algorithm: Sequence to classify(sequences of very different sizes) Classified in the class with the greatest value (result value). 9 files (users) containing from about 10.000 to 60.000 commands each. 1. Extracting Patterns:A trie is created for each user Pattern Library 24
. Experiments – UNIX command line sequences 2. Classification Algorithm: Sequence to classify(sequences of very different sizes) Classified in the class with the greatest value (result value). 9 files (users) containing from about 10.000 to 60.000 commands each. 1. Extracting Patterns:A trie is created for each user Pattern Library • 3. Evaluating the result: • Calculate: • difference between the greatest value and the second greatest value (+) • difference betweenthe real classification value and the greatest value (-) • (The greater the difference, the better the classification) 24
. Results – UNIX command line sequences Unix Commands Classification – User 6 Classification Value average of 25 simulation results Length of the Sequence to classify 25
. Results – UNIX command line sequences Minimum length for classifying a UNIX Computer User correctly Length of the Sequence to classify Unix Computer User (Class) 26
Outline Motivation and Introduction Sequence classification Our approach Library Creation Classification Target Environment Description Experiments & Results Conclusions and Future Works 27
Conclusions A threshold must be found Long time for creating the tries Results depend on the length of the sub-sequences used to create the trie 28
Conclusions Effective method to classify UNIX users If a behavior can be represented by sequences, the proposed classification method can be used If a new class is added, only its trie must be created (the others are not modified) This method could be used for other tasks: sequence prediction, sequence clustering… RoboCup Coach 2006 Competition (succesfully results) 29
Future Works Pattern Library One Trie for all classes (users). Classification method without threshold value Analysis comparing our approach to others (HMMs) 30
. Thank you! Sequence Classification Using Statistical Pattern Recognition José Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis Computer Science Department Universidad Carlos III de Madrid Avda. de la Universidad, 30. 28911 Leganés, Spain {jiglesia, ledezma, masm}@inf.uc3m.es
. Questions Sequence Classification Using Statistical Pattern Recognition José Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis Computer Science Department Universidad Carlos III de Madrid Avda. de la Universidad, 30. 28911 Leganés, Spain {jiglesia, ledezma, masm}@inf.uc3m.es
. Related to Questions... Sequence Classification Using Statistical Pattern Recognition José Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis Computer Science Department Universidad Carlos III de Madrid Avda. de la Universidad, 30. 28911 Leganés, Spain { jiglesia, ledezma, masm}@inf.uc3m.es 29