460 likes | 488 Views
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification. M N S S K Pavan Kumar. Advisor : Dr. C. V. Jawahar. Pattern Classification. Given a sample x Find the label corresponding to it
E N D
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar
Pattern Classification • Given a sample x • Find the label corresponding to it • A classifier is an algorithm, which takes x and returns the label between 1 to N • Binary Classification -- N = 2 • Multiclass classification -- N > 2 • Evaluation is usually done as probability of correct classification
Multiclass Classification • Many standard approaches • Neural Networks, Decision Trees • Direct extensions • Combinations of component classifiers
x 1,5 Sample x from class 3 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 4 3 2 1 5 Decision Directed Acyclic Graph
x 1,5 Sample x from class 5 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 5 4 3 2 1 Decision Directed Acyclic Graph
Decision Directed Acyclic Graph x 1,5 Sample x from class 4 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 4 3 2 1 5
x 1,5 There are multiple paths 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 4 3 2 1 5 Decision Directed Acyclic Graph
Decision Directed Acyclic Graph x 1,5 A DDAG can be improved by improving individual nodes 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 5 4 3 2 1
Decision Directed Acyclic Graph x A DDAG can be improved by improving individual nodes 1,5 2,5 1,4 Architecture is fixed for a given sequence of classes 3,5 2,4 1,3 4,5 2,3 1,2 3,4 5 4 3 2 1
Decision Directed Acyclic Graph x A DDAG can be improved by improving individual nodes 3,5 2,5 3,4 A DDAG can be improved by changing class order 1,5 2,4 3,1 4,5 2,1 3,2 1,4 5 4 1 2 3 Class Order Changed
Features at Each Node • Image as Features • Large number of features in Computer vision problems • Principal Component Analysis (PCA) • Project the data onto an axis which preserves maximum variance • PCA is good for representation but not for discrimination
Features at Each Node • Pairwise Linear Discriminant Analysis (LDA) is more effective • Fischer Linear Discriminant, Optimal Discriminant Vectors • Large number of feature extractions • Large number of matrices to be stored LDA performs better, but is computationally expensive
Solution 1,4 4 3 1 2,4 1,3 2 3,4 2,3 1,2 3 2 1 4
Solution M14 1,4 1 4 2,4 1,3 3 2 3,4 2,3 1,2 3 2 1 4
Solution M14 1,4 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M23 3 2 1 4
Solution M14 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 3 2 1 4 M23 M12
Solution M14 M24 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M13 3 2 1 4 M23 M12 4 Classes 6 classifiers 6 Dimensionality Reductions Total number of features extracted : (N-1) * reduced_dimension
Solution M14 M24 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M13 3 2 1 4 M23 M12 Example : 400 classes and 400 features reduced to 50 Results in 399000 Projections overall, and 19950 for a single evaluation DDAG
M14 Solution M13 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M24 3 2 1 4 M23 M12 LDA is effective, but highly complex in space and time
Solution M14 M13 M34 1 4 3 2 M24 M23 M12
Solution M14 M13 M12 M34 M23 1 4 3 M34 M = 2 M13 M24 M14 M23 M24 M12 Stack all the transformations
Solution M14 M12 M13 M23 M34 1 M34 4 M = 3 M13 2 M14 M24 M24 M23 This matrix is Rank Deficient M12
M14 Solution M24 M12 M34 M23 1 4 3 M34 M = 2 M13 M13 M14 M23 M24 M12 This matrix is Rank Deficient Use a reduced representation
M14 Solution M24 M12 M34 M23 1 4 3 M34 M = 2 M13 M13 M14 M23 M24 M12 This matrix is Rank Deficient Has many similar rows Clustering, SVD etc., may be used
Remarks • Only one time feature extraction • Results in a reduced LDA matrix, retaining the discriminant capacity
Motivating Example 1,4 Priors : {0.3, 0.1, 0.2, 0.4} All Classifiers are 90% Correct 2,4 1,3 1,4 0.3*(0.9)3 + 0.1*(0.5)*(0.9) 2 +0.2*(0.5)*(0.9) 2 + 0.4*(0.9)3 Reordering 3,4 2,3 1,2 2,4 1,3 2 1 4 3 3,4 2,3 1,2 Accuracy : 80.28 % Accuracy : 88.92 % 1 4 3 2 43.8% reduction in error !!
Formulation 1,4 Number of classes = N • Prefer central positions in the list for high prior classes • Optimal Priors = Pi 2,4 1,3 Errors = q (at each nodes) Relevant Path length = max (N – i, i – 1) 3,4 2,3 1,2 Number of relevant paths of length l to node r = Nrl 2 1 4 3 Maximize
Disadvantage of a DDAG • DDAG can provide only a class label • New DDAG classification protocol proposed • Previous formulation is insufficient
Maximizing DDAG Accuracy 1,4 2,4 1,3 3,4 2,3 1,2 j i ……..
DDAG design is NP-Hard • Optimal Decision Tree is NP-Hard • DAG Design is reducible to Optimal Decision Tree • Approximate algorithms are the only resort
Proposed Algorithms • Three greedy algorithms • Prefer high prior classifiers to be at center of the DDAG • Prefer high performance classifiers to be the root nodes of the DDAG • Prefer high error classes to be at the center of the DDAG • Empirical results show that approximation error is close to half that of optimal graph
Binary Hierarchical Classifiers 1,4,5 vs 2,3 3 5 2 4 vs 1,5 2 vs 3 1 4 4 1vs5 3 2 1 5
Graph Partitioning 3 3 5 2 5 2 1 Root Node 1 4 4 1,2,4,5 vs 3 1,4 vs 2,3,5 Data Similarity Graph None of the partitioning schemes are universally good for all problems (No Free Lunch Theorem) We prefer Linear Cuts We prefer Linear Cuts with large Margin Objective : Maximize the cut Objective : Compact Clusters
Graph Partitioning 3 3 5 2 5 2 1 1 4 4 Graph Data Simple Workaround : Use locally best partitions
Margin Improvement 3 3 5 Remove class 2 5 2 1 1 4 4 Improved Margin Margin Don’t insist on mutually exclusive partitions Let some classes be there on both sides
Trees with Overlapping Partitions 1,2 – 3– 4,5,6 1,2 – 3 3,4 – 5 – 6 3 1,2 3,4 - 5 5,6 5 3,4 2 1 5 6 3 4
Comments • The complexity remains O(log(N)) • Different criterion for removing bad classes
Configurable Hybrid Classifiers • DDAG : High Accuracy, Large Size • BHC : Moderate Accuracy, Small Size Take advantages of both If “classification” is easy, use BHC, otherwise use a DDAG
Classifiability • Use expected error to select appropriate classifiers • How easy or difficult is it to classify a set of classes • Computable from cooccurence matrices • We proposed a pair wise classifiability measure • Lpairwise =2/N(N-1)∑ Lij
Generalization Capacity of Proposed Algorithms • The probability of error that a classifier makes on unseen samples is called generalization • Large Margin • Better features in a DDAG • Better partitions in a BHC • Use classifier of required complexity at each step (Occam’s Razor) • Efficient feature representations require less complex classifiers • Simpler partitions in BHC require less complex classifiers • Architecture level generalization • Hybrid classifiers use architectures of required complexity at each node, thereby improving the generalization • Empirically we have demonstrated the generalization of algorithms
Conclusions • Formulation, Analysis and Algorithms are presented • to design DDAGs using robust feature representations • to design DDAGs using node-reordering • to design Hierarchical classifiers with better generalization • to design Hybrid hierarchical classifiers
Future Work • Design based on simple algorithms may improve the current “high-performance” classifiers • Promising directions • Feature based partitioning vs Class based partitioning • Trees with overlapping partitions • Efficient DDAG design algorithms • Configurability in classifier design