370 likes | 500 Views
Geometrical Complexity of Classification Problems. Tin Kam Ho Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law. What Is the Story in this Image?. Automatic Pattern Recognition. A fascinating theme.
E N D
Geometrical Complexity of Classification Problems Tin Kam Ho Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law
Automatic Pattern Recognition A fascinating theme. What will the world be like if machines can do it? • Robots can see, read, hear, and smell. • We can get rid of all spam email, and findexactly what we need from the web. • We can protect everything with our signatures, fingerprints, iris patterns, or just our face. • We can track anybody by look, gesture, and gait. • We can be warned of disease outbreaks, terrorist threats. • Feed in our vital data and we will have perfect diagnosis. • We will know whom we can trustto lend our money.
Automatic Pattern Recognition And more … • We can tell the true emotions of all our encounters! • We can predict stock prices! • We can identify all potential criminals! • Our weapons will never lose their targets! • We will be alerted about any extraordinary events going on in the heavens, on or inside the Earth! • We will have machines discovering all knowledge! … But how far are we from these?
Automatic Pattern Recognition samples • Statistical Classifiers • Bayesian classifiers • polynomial discriminators • nearest-neighbor methods • decision trees & forests • neural networks • genetic algorithms • support vector machines • ensembles and classifier combination • Why are machines still far from perfect? • What is still missing in our techniques? features
Large Variations in Accuracies of Different Classifiers classifier application Will I have any luck for my problem?
Many classifiers are in close rivalry with each other. Why? • Do they represent the limit of our technology? • What do the new classifiers add to the methodology? • Is there still value in the older methods? • Have they used up all information contained in a data set? When I face a new recognition task … • How much can automatic classifiers do? • How should I choose a classifier? • Can I make the problem easier for a specific classifier?
Sources of Difficultyin Classification • Class ambiguity • Boundary complexity • Sample size and dimensionality
Class Ambiguity • Is the concept intrinsically ambiguous? • Are the classes well defined? • What information do the features carry? • Are the features sufficient for discrimination? Bayes error
Boundary Complexity • Kolmogorov complexity • Length can be exponential in dimensionality • Trivial description: list all points & class labels • Is there a shorter description?
Classification Boundaries As Decided by Different Classfiers Training samples for a 2D classification problem feature 2 feature 1
Classification Boundaries Inferred by Different Classfiers • XCS: a genetic algorithm • Nearest neighbor classifier • Linear classifier
Match between Classifiers and Problems Problem A Problem B Better! Better! error=0.6% error=0.06% error=0.7% error=1.9% XCS NN NN XCS
Measures of Geometrical Complexity of Classification Problems Our approach: develop mathematical language and algorithmic tools for studying • Characteristics of geometry & topology of high-dim data • How they change with feature transformations, noise conditions, and sampling strategies • How they interact with classifier geometry Focus on descriptors computable from real data and relevant to classifier geometry
Geometry of Datasets and Classifiers • Data sets: • length of class boundary • fragmentation of classes / existence of subclasses • global or local linear separability • convexity and smoothness of boundaries • intrinsic / extrinsic dimensionality • stability of these characteristics as sampling rate changes • Classifier models: • polygons, hyper-spheres, Gaussian kernels, axis-parallel hyper-planes, piece-wise linear surfaces, polynomial surfaces, their unions or intersections, …
Measures of Geometric Complexity Degree of Linear Separability Fisher’s Discriminant Ratio • Find separating hyper-plane by linear programming • Error counts and distances to plane measure separability • Classical measure of class separability • Maximize over all features to find the most discriminating Shapes of Class Manifolds Length of Class Boundary • Cover same-class pts with maximal balls • Ball counts describe shape of class manifold • Compute minimum spanning tree • Count class-crossing edges
Experiments with Controlled Data Sets • Real-World Data Sets: Benchmarking data from UC-Irvine archive 844 two-class problems 452 are linearly separable, 392 nonseparable • Synthetic Data Sets: Random labeling of randomly located points 100 problems in 1-100 dimensions
Patterns in Complexity Measure Spacelin.sep lin.nonsep random
Problem Distribution in 1st & 2nd Principal Components of Complexity Space
Interpretation of the First 4 Prin. Comp. • 50% of variance: Linearity of boundary and proximity of opposite class neighbors • 12% of variance: Balance between within-class scatter and between-class distance • 11% of variance: Concentration & orientation of intrusion into opposite class • 9% of variance: Within-class scatter
Problem Distribution in 1st & 2nd Principal Components of Complexity Space • Continuous distribution • Known easy & difficult problems occupy opposite ends • Few outliers • Empty regions Linearly separable Random labels
Questions for the Theoretician • Is the distribution necessarily continuous? • What caused the outliers? • Will the empty regions ever be filled? • How are the complexity measures related? • What is the intrinsic dimensionality of the distribution?
Questions for the Practitioner • Where does a particular problem fit in this continuum? • Can I use this to guide feature selection & transformation? • How do I set expectations on recognition accuracy? • Can I use this to help choosing classifiers?
? LC Complexity measure 2 XCS Decision Forest NN Complexity measure 1 Domains of Competence of Classifiers • Given a classification problem, determine which classifier is the best for it Here is my problem !
ensemble methods ensemble methods ensemble methods Domain of Competence Experiment • Use a set of 9 complexity measures Boundary, Pretop, IntraInter, NonLinNN, NonLinLP, Fisher, MaxEff, VolumeOverlap, Npts/Ndim • Characterize 392 two-class problems from UCI data All shown to be linearly non-separable • Evaluate 6 classifiers NN (1-nearest neighbor) LP (linear classifier by linear programming) Odt (oblique decision tree) Pdfc (random subspace decision forest) Bdfc (bagging based decision forest) XCS (a genetic-algorithm based classifier)
Classifier Domains of Competence Best Classifier for Benchmarking Data
Best Classifier Being nn,lp,odt vs an ensemble technique Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap ensemble + nn,lp,odt
Other Studies on Data Complexity Global vs. Local Properties Multi-Class Measures Intrinsic Ambiguity & Mislabeling Task Trajectory with Changing Sampling & Noise Conditions k=99 k=1
Extension to Multiple Classes • Fisher’s discriminant score Mulitple discriminant scores • Boundary point in a MST: a point is a boundary point as long as it is next to a point from other classes in the MST
Global vs. Local Properties • Boundaries can be simple locally but complex globally • These types of problems are relatively simple, but are characterized as complex by the measures • Solution: complexity measure at different scales • This can be combined with different error levels • Let Ni,k be the k neighbors of the i-th point defined by, say, Euclidean distance. The complexity measure for data set D, error level , evaluated at scale k is
Intrinsic Ambiguity • The complexity measures can be severely affected when there exists intrinsic class ambiguity (or data mislabeling) • Example: FeatureOverlap (in 1D only) • Cannot distinguish between intrinsic ambiguity or complex class decision boundary
Tackling Intrinsic Ambiguity • Compute the complexity measure at different error levels • f(D): a complexity measure on the data set D • D*: a “perturbed” version of D, so that some points are relabeled • h(D, D*): a distance measure between D and D* (error level) • The new complexity measure is defined as a curve: • The curve can be summarized by, say, area under curve • Minimization by greedy procedures • Discard erroneous points that decrease complexity by the most
Sampling Density Problem may appear deceptively simple or complex with small samples 2 points 10 points 100 points 500 points 1000 points
Real Problems Have a Mixture of Difficulties Sparse Sample & Complex Geometry cause ill-posedness Can’t tell which is better! Requires further hypotheses on data geometry
To Conclude: • We have some early success in using geometrical measures to characterize classification complexity, pointing to a potentially fruitful research area. Future Directions: • More, better measures; • Detailed studies of their utilities, interactions, and estimation uncertainty; • Deeper understanding of constraints on these measures from point set geometry and topology; • Apply these to understand practical recognition tasks; • Apply these to find transformations that simplify boundaries; • Apply these to make better pattern recognition algorithms. “Data Complexity In Pattern Recognition”, M.Basu, T.K. Ho (eds.), Springer-Verlag, in press.