490 likes | 593 Views
Geometrical Complexity of Classification Problems. Tin Kam Ho Computing Sciences Research Center Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law. What Is the Story in this Image?. Automatic Pattern Recognition. A fascinating theme.
E N D
Geometrical Complexity of Classification Problems Tin Kam Ho Computing Sciences Research Center Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law
Automatic Pattern Recognition A fascinating theme. What will the world be like if machines can do it? • Robots can see, read, hear, and smell. • We can get rid of all spam email, and find exactly what we need from the web. • We can protect everything with our signatures, fingerprints, iris patterns, or just our face. • We can track anybody by look, gesture, and gait. • We can be warned of disease outbreaks, terrorist threats. • Feed in our vital data and we will have perfect diagnosis. • We will know whom we can trust to lend our money.
Automatic Pattern Recognition And more … • We can tell the true emotions of all our encounters! • We can predict stock prices! • We can identify all potential criminals! • Our weapons will never lose their targets! • We will be alerted about any extraordinary events going on in the heavens, on or inside the Earth! • We will have machines discovering all knowledge! … But how far are we from these?
Automatic Pattern Recognition Samples Supervised learning Unsupervised learning Feature vectors Feature vectors Feature Extraction Classifier Training Clustering classifier feature 2 Cluster Validation Classification Cluster Interpretation class membership feature 1
Supervised Learning:Many automatically trainable classifiers • Statistical Classifiers (for numerical vectors) • Bayesian classifiers, • polynomial discriminators, • nearest-neighbors, • decision trees & forests, • neural networks, • support vector machines, • ensemble methods, … Also: • Syntactic Classifiers (for symbol strings) • Structural Classifiers (for graphs) Feature 2 Feature 1
We have had some successes. But … • Why are we still far from perfect? • What is still missing in our techniques?
Large Variations in Accuracies of Different Classifiers classifier application Will I have any luck for my problem?
Many new classifiers are in close rivalry with each other. Why? • Do they represent the limit of our technology? • What have they added to the body of methods? • Is there still value in the older methods? • What is still missing in our techniques? When I face a new problem … • How much can automatic classifiers do? • How should I choose a classifier?
Sources of Difficultyin Classification • Class ambiguity • Boundary complexity • Sample size and dimensionality
Class Ambiguity • Is the concept intrinsically ambiguous? • Are the classes well defined? • What information do the features carry? • Are the features sufficient for discrimination?
Boundary Complexity • Kolmogorov complexity • Length can be exponential in dimensionality • Trivial description: list all points & class labels • Is there a shorter description?
Classification Boundaries As Decided by Different Classfiers Training samples for a 2D classification problem feature 2 feature 1
Classification Boundaries • XCS (a genetic algorithm): hyperrectangles Classification boundaries in feature space feature 2 feature 1
Classification Boundaries • XCS : hyperrectangles Classification boundaries in feature space feature 2 feature 1
Other Classifiers: Other Boundaries • Nearest neighbor : Voronoi cells Classification boundaries in feature space feature 2 feature 1
Other Classifiers: Other Boundaries • Nearest neighbor : Voronoi cells Classification boundaries in feature space feature 2 feature 1
Other Classifiers: Other Boundaries • Linear Classifier: hyperplanes Classification boundaries in feature space feature 2 feature 1
Match between Classifiers and Problems Problem A Problem B Better! Better! error=0.6% error=0.06% error=0.7% error=1.9% XCS NN NN XCS
Sampling Density Problem may appear deceptively simple or complex with small samples 2 points 10 points 100 points 500 points 1000 points
Measures of Geometrical Complexity of Classification Problems Our goal: tools and languages for studying • Characteristics of geometry & topology of high-dim data sets • How they change with feature transformations and sampling • How they interact with classifier geometry We want to know: • What are real-world problems like? • What is my problem like? • What can be expected of a method on a specific problem?
Building up the Language • Identify key features of data geometry that are relevant for classification • Develop algorithms to extract such features from a dataset • Describe patterns of classifier behavior in terms of geometrical features • … pattern recognition in pattern recognition problems • … classification of classification problems
Geometry of Datasets and Classifiers • Data sets: • length of class boundary • fragmentation of classes / existence of subclasses • global or local linear separability • convexity and smoothness of boundaries • intrinsic / extrinsic dimensionality • stability of these characteristics as sampling rate changes • Classifier models: • polygons, hyper-spheres, Gaussian kernels, axis-parallel hyper-planes, piece-wise linear surfaces, polynomial surfaces, their unions or intersections, …
Fisher’s Discriminant Ratio • Defined for one feature: means, variances of classes 1,2 • One good feature makes a problem easy • Take maximum over all features
Linear Separability • Studies from early literature: • Perceptrons, Perceptron Cycling Theorem, 1962 • Fractional Correction Rule, 1954 • Widrow-Hoff Delta Rule, 1960 • Ho-Kashyap algorithm, 1965 • Many algorithms stop only with positive conclusions. • But this one works: • Linear programming, 1968
Length of Class Boundary • Friedman & Rafsky 1979 • Find MST (minimum spanning tree) connecting all points regardless of class • Count edges joining opposite classes • Sensitive to separability and clustering effects
Shapes of Class Manifolds • Lebourgeois & Emptoz 1996 • Packing of same-class points in hyper-spheres • Thick and spherical, or thin and elongated manifolds
Data Sets: UCI • UC-Irvine machine learning archive • 14 datasets (no missing values, > 500 pts) • 844 two-class problems • 452 linearly separable • 392 linearly nonseparable • 2 - 4648 points each • 8 - 480 dimensional feature spaces
Data Sets: Random Noise • Randomly located and labeled points • 100 artificial problems • 1 to 100 dimensional feature spaces • 2 classes, 1000 points per class
Space of Complexity Measures Random noise Linearly separable
Patterns in Complexity Measure Spacelin.seplin.nonseprandom
Observations on Distributions of Problems • Noise sets and linearly separable sets occupy opposite ends in many dimensions • In-between positions tell relative difficulty • Fan-like structure in most plots • At least 2 independent factors, joint effects • Noise sets are far from real data • Ranges of complexity value for the noise sets show effects of apparent complexity under the influence of sampling density
k=99 k=1 Continuum of Difficulty Spanned by Problems of the Same Family • Study specific application domains, alternative problem formulations • Guide feature transformations • Compare classifiers’ domain of competence • Help in choosing methods • Set expectations on recognition accuracy Random labels
? LC Complexity measure 2 XCS Decision Forest NN Complexity measure 1 Domains of Competence of Classifiers • Given a classification problem, determine which classifier is the best for it Here is my problem !
ensemble methods ensemble methods ensemble methods Domain of Competence Experiment • Use a set of 9 complexity measures Boundary, Pretop, IntraInter, NonLinNN, NonLinLP, Fisher, MaxEff, VolumeOverlap, Npts/Ndim • Characterize 392 two-class problems from UCI data All shown to be linearly non-separable • Evaluate 6 classifiers NN (1-nearest neighbor) LP (linear classifier by linear programming) Odt (oblique decision tree) Pdfc (random subspace decision forest) Bdfc (bagging based decision forest) XCS (a genetic-algorithm based classifier)
Best Classifier (Seen in Projection to Boundary-NonLinNN) Bdfc + LP X NN O XCS ⃟ Odt . Several
Best Classifier (Seen in Projection to IntraInter-Pretop) Bdfc + LP X NN O XCS ⃟ Odt . Several
Best Classifier (Seen in Projection to MaxEff-VolumeOverlap) Bdfc + LP X NN O XCS ⃟ Odt . Several
Worst Classifier Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap Bdfc + LP X NN O XCS ⃟ Odt . Several
Best Classifier Being nn,lp,odt vs an ensemble technique Boundary-NonLinNN IntraInter-Pretop MaxEff-VolumeOverlap ensemble + nn,lp,odt
Summary of Observations • 270 / 392 problems have a significantly best (dominant) classifier • 157 / 392 have a significant worst • 4 classifiers (NN, LP, Bdfc, XCS) are dominant for some problems • Almost all problems with a dominant classifier are best solved by either NN or LP • Simple methods (NN, LP) have extreme behavior: When conditions are favorable, they are optimal for the problem • Single decision trees (by Odt) have no domain of dominant competence: Significantly worst methods are almost always single decision trees (Odt), followed by LP and NN • Ensemble methods are in close rivalry for many problems; they are safe choices when boundary complexity is unknown
Extension to Multiple Classes • Fisher’s discriminant score Mulitple discriminant scores • Boundary point in a MST: a point is a boundary point as long as it is next to a point from other classes in the MST
Intrinsic Ambiguity • The complexity measures can be severely affected when there exists intrinsic class ambiguity (or data mislabeling) • Example: FeatureOverlap (in 1D only) • Cannot distinguish between intrinsic ambiguity or complex class decision boundary
Tackling Intrinsic Ambiguity • Compute the complexity measure at different error levels • f(D): a complexity measure on the data set D • D*: a “perturbed” version of D, so that some points are relabeled • h(D, D*): a distance measure between D and D* (error level) • The new complexity measure is defined as a curve: • The curve can be summarized by, say, area under curve • Minimization by greedy procedures • Discard erroneous points that decrease complexity by the most
Global vs. Local Properties • Boundaries can be simple locally but complex globally • These types of problems are relatively simple, but are characterized as complex by the measures • Solution: complexity measure at different scales • This can be combined with different error levels • Let Ni,k be the k neighbors of the i-th point defined by, say, Euclidean distance. The complexity measure for data set D, error level , evaluated at scale k is
Real Problems Have a Mixture of Difficulties E.g. Sparse Sample & Complex Geometry cause ill-posedness Can’t tell which is better! Requires further hypotheses on data geometry
To conclude: • We have some early success in using geometrical measures to characterize classification complexity, pointing to a potentially fruitful research direction. To continue: • More, better measures; • Detailed studies of their utilities, interactions, and estimation uncertainty; • Deeper understanding of constraints on these measures from point set geometry and topology; • Apply these to understand practical problems; • Apply these to make better pattern recognition algorithms.
Ongoing Applications • Simplifying class boundaries in MR spectra via feature reduction with Richard Baumgartner, Ray Somorjai et al. • Risk assessment of biometrical authentication with Anil Jain et al. “Data Complexity In Pattern Recognition”, M.Basu, T.K. Ho (eds.), Springer-Verlag, to appear, 2005 A collection of 15 papers about this subject