Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar

Pattern Classification • Given a sample x • Find the label corresponding to it • A classifier is an algorithm, which takes x and returns the label between 1 to N • Binary Classification -- N = 2 • Multiclass classification -- N > 2 • Evaluation is usually done as probability of correct classification

Multiclass Classification • Many standard approaches • Neural Networks, Decision Trees • Direct extensions • Combinations of component classifiers

x 1,5 Sample x from class 3 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 4 3 2 1 5 Decision Directed Acyclic Graph

x 1,5 Sample x from class 5 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 5 4 3 2 1 Decision Directed Acyclic Graph

Decision Directed Acyclic Graph x 1,5 Sample x from class 4 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 4 3 2 1 5

x 1,5 There are multiple paths 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 4 3 2 1 5 Decision Directed Acyclic Graph

Decision Directed Acyclic Graph x 1,5 A DDAG can be improved by improving individual nodes 2,5 1,4 3,5 2,4 1,3 4,5 2,3 1,2 3,4 5 4 3 2 1

Decision Directed Acyclic Graph x A DDAG can be improved by improving individual nodes 1,5 2,5 1,4 Architecture is fixed for a given sequence of classes 3,5 2,4 1,3 4,5 2,3 1,2 3,4 5 4 3 2 1

Decision Directed Acyclic Graph x A DDAG can be improved by improving individual nodes 3,5 2,5 3,4 A DDAG can be improved by changing class order 1,5 2,4 3,1 4,5 2,1 3,2 1,4 5 4 1 2 3 Class Order Changed

Features at Each Node • Image as Features • Large number of features in Computer vision problems • Principal Component Analysis (PCA) • Project the data onto an axis which preserves maximum variance • PCA is good for representation but not for discrimination

Features at Each Node • Pairwise Linear Discriminant Analysis (LDA) is more effective • Fischer Linear Discriminant, Optimal Discriminant Vectors • Large number of feature extractions • Large number of matrices to be stored LDA performs better, but is computationally expensive

Solution 1,4 4 3 1 2,4 1,3 2 3,4 2,3 1,2 3 2 1 4

Solution M14 1,4 1 4 2,4 1,3 3 2 3,4 2,3 1,2 3 2 1 4

Solution M14 1,4 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M23 3 2 1 4

Solution M14 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 3 2 1 4 M23 M12

Solution M14 M24 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M13 3 2 1 4 M23 M12 4 Classes 6 classifiers 6 Dimensionality Reductions Total number of features extracted : (N-1) * reduced_dimension

Solution M14 M24 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M13 3 2 1 4 M23 M12 Example : 400 classes and 400 features reduced to 50 Results in 399000 Projections overall, and 19950 for a single evaluation DDAG

M14 Solution M13 1,4 M34 1 4 2,4 1,3 3 2 3,4 2,3 1,2 M24 3 2 1 4 M23 M12 LDA is effective, but highly complex in space and time

Solution M14 M13 M34 1 4 3 2 M24 M23 M12

Solution M14 M13 M12 M34 M23 1 4 3 M34 M = 2 M13 M24 M14 M23 M24 M12 Stack all the transformations

Solution M14 M12 M13 M23 M34 1 M34 4 M = 3 M13 2 M14 M24 M24 M23 This matrix is Rank Deficient M12

M14 Solution M24 M12 M34 M23 1 4 3 M34 M = 2 M13 M13 M14 M23 M24 M12 This matrix is Rank Deficient Use a reduced representation

M14 Solution M24 M12 M34 M23 1 4 3 M34 M = 2 M13 M13 M14 M23 M24 M12 This matrix is Rank Deficient Has many similar rows Clustering, SVD etc., may be used

Remarks • Only one time feature extraction • Results in a reduced LDA matrix, retaining the discriminant capacity

Motivating Example 1,4 Priors : {0.3, 0.1, 0.2, 0.4} All Classifiers are 90% Correct 2,4 1,3 1,4 0.3*(0.9)3 + 0.1*(0.5)*(0.9) 2 +0.2*(0.5)*(0.9) 2 + 0.4*(0.9)3 Reordering 3,4 2,3 1,2 2,4 1,3 2 1 4 3 3,4 2,3 1,2 Accuracy : 80.28 % Accuracy : 88.92 % 1 4 3 2 43.8% reduction in error !!

Formulation 1,4 Number of classes = N • Prefer central positions in the list for high prior classes • Optimal Priors = Pi 2,4 1,3 Errors = q (at each nodes) Relevant Path length = max (N – i, i – 1) 3,4 2,3 1,2 Number of relevant paths of length l to node r = Nrl 2 1 4 3 Maximize

Disadvantage of a DDAG • DDAG can provide only a class label • New DDAG classification protocol proposed • Previous formulation is insufficient

Maximizing DDAG Accuracy 1,4 2,4 1,3 3,4 2,3 1,2 j i ……..

DDAG design is NP-Hard • Optimal Decision Tree is NP-Hard • DAG Design is reducible to Optimal Decision Tree • Approximate algorithms are the only resort

Proposed Algorithms • Three greedy algorithms • Prefer high prior classifiers to be at center of the DDAG • Prefer high performance classifiers to be the root nodes of the DDAG • Prefer high error classes to be at the center of the DDAG • Empirical results show that approximation error is close to half that of optimal graph

Complexities of Classification

Binary Hierarchical Classifiers 1,4,5 vs 2,3 3 5 2 4 vs 1,5 2 vs 3 1 4 4 1vs5 3 2 1 5

Graph Partitioning 3 3 5 2 5 2 1 Root Node 1 4 4 1,2,4,5 vs 3 1,4 vs 2,3,5 Data Similarity Graph None of the partitioning schemes are universally good for all problems (No Free Lunch Theorem) We prefer Linear Cuts We prefer Linear Cuts with large Margin Objective : Maximize the cut Objective : Compact Clusters

Graph Partitioning 3 3 5 2 5 2 1 1 4 4 Graph Data Simple Workaround : Use locally best partitions

Margin Improvement 3 3 5 Remove class 2 5 2 1 1 4 4 Improved Margin Margin Don’t insist on mutually exclusive partitions Let some classes be there on both sides

Trees with Overlapping Partitions 1,2 – 3– 4,5,6 1,2 – 3 3,4 – 5 – 6 3 1,2 3,4 - 5 5,6 5 3,4 2 1 5 6 3 4

Comments • The complexity remains O(log(N)) • Different criterion for removing bad classes

Configurable Hybrid Classifiers • DDAG : High Accuracy, Large Size • BHC : Moderate Accuracy, Small Size Take advantages of both If “classification” is easy, use BHC, otherwise use a DDAG

Results on OCR datasets

Classifiability • Use expected error to select appropriate classifiers • How easy or difficult is it to classify a set of classes • Computable from cooccurence matrices • We proposed a pair wise classifiability measure • Lpairwise =2/N(N-1)∑ Lij

Generalization Capacity of Proposed Algorithms • The probability of error that a classifier makes on unseen samples is called generalization • Large Margin • Better features in a DDAG • Better partitions in a BHC • Use classifier of required complexity at each step (Occam’s Razor) • Efficient feature representations require less complex classifiers • Simpler partitions in BHC require less complex classifiers • Architecture level generalization • Hybrid classifiers use architectures of required complexity at each node, thereby improving the generalization • Empirically we have demonstrated the generalization of algorithms

Conclusions • Formulation, Analysis and Algorithms are presented • to design DDAGs using robust feature representations • to design DDAGs using node-reordering • to design Hierarchical classifiers with better generalization • to design Hybrid hierarchical classifiers

Future Work • Design based on simple algorithms may improve the current “high-performance” classifiers • Promising directions • Feature based partitioning vs Class based partitioning • Trees with overlapping partitions • Efficient DDAG design algorithms • Configurability in classifier design

Thank You

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification