800 likes | 1.02k Views
EE 7700. Pattern Classification. Classification Example. Goal : Automatically classify incoming fish according to species, and send to respective packing plants. Features : Length, width, color, brightness, etc.
E N D
EE 7700 Pattern Classification
Classification Example Goal: Automatically classify incoming fish according to species, and send to respective packing plants. Features: Length, width, color, brightness, etc. Model: Sea bass have some typical length, and it is greater than that for salmon. Classifier: If the fish is longer than a value, l*, classify it as sea bass. Training Samples: To choose l*, make length measurements from training samples and inspect the results.
Classification Example Decision boundary Now, we have two features two classify the fish: the lightness x1, and the width x2. Feature vector:x=[x1 x2]’. The feature extractor reduces the image of a fish to a feature vectorx in a 2D feature space.
Feature Extraction • The goal of feature extractor is to characterize an object to be recognized by measurements whose values are very similar for objects in the same category, and very different for objects in different categories. • The features should be invariant to the irrelevant transformation of the input. For example, the location of a fish on the belt is irrelevant, and thus the representation should be insensitive to the location of the fish.
Classification • The task of the classifier is to use feature vectors (provided by the feature extractor) to assign the object to a category. • Perfect classification is often impossible, a more general task is to determine the probability for each of the possible categories. • The process of using data to determine the classifier is referred to as training the classifier.
x1 x2 Class 1 or 2 or ….. or c Raw Data Feature Extractor • • • Classifier xd Classical Model • We measure a fixed set of d features for an object that we want to classify. • For example, • x1 = height • x2 = perimeter • ... • xd = average pixel intensity
x3 x x1 x2 x1 x2 x = • • • xd Feature Vectors • We can think of our feature set as a feature vector x, where x is the d-dimensional column vector • Can think of x as being a point in a d-dimensional feature space. • By this process of feature measurement, we can represent an object as a point in feature space.
What is ahead • Template matching • Minimum-distance classifiers • Metrics • Inner products • Linear discriminants • Bayesian approach
Template Matching • To classify one of the noisy characters, simply compare it to the two ‘templates’ on the left • Comparison can be done in many ways - here are two: • Count the number of places where the template and pattern agree. Pick the class that has the maximum number of agreements. • Count the number of places where the template and pattern disagree. Pick the class that has the smallest number of disagreements. • This may not work well when there is rotation, scaling, warping, occlusion, etc.
? = g f Most popular Template Matching Question: How can we achieve rotation invariance?
Minimum Distance Classifiers • Template matching can be expressed mathematically through a notion of distance. • Let x be the feature vector for the unknown input, and let m1, m2, ..., mc be templates (i.e., perfect, noise-free feature vectors) for the c classes. • The error in matching x against mk is given by || x - mk ||. • Choose the class for which the error is a minimum. • Since || x - mk || is the distance from x to mk, the technique is called minimum distance classification.
m1 Distance x m2 Distance Class Minimum Selector • • • • • • mc Distance Minimum Distance Classifiers m3 x m2 m1 Euclidean distance “Sum of absolute values”
x’ = [x1, x2, ….., xd] d x’y = x1 y1 + x2 y2 ….., xd yd = S xkyk k=1 x1 x2 x = • • • xd Euclidean Distance • x is a column vector of d features, x1, x2, ... , xd. • By using the transpose operator ' we can convert the column vector x to the row vector x': • The inner product of two column vectors x and y is defined by • Thus the norm of x (using the Euclidean metric) is given by || x || = sqrt( x' x )
Inner Products • Important additional properties of inner products: • x' y = y' x = || x || || y || cos( angle between x and y ) • x' ( y + z ) = x' y + x' z . • The inner product of x and y is maximum when the angle between them is zero, i.e., when one is just a positive multiple of the other. • Sometimes we say • that x' y is the correlation between x and y, and • that the correlation is maximum when x and y point in the same direction. • If x' y = 0, the vectors x and y are said to be orthogonal or uncorrelated.
Minimum Distance Classifiers Example: Let m1=[4.3 1.3]’ and m2=[1.5 0.3]’. Find the decision boundary.
||x-mk||2 = (x -mk)’(x -mk) = x’ x -m’x - x’ mk+mk’ mk k = -2 [m’ x - .5 mk’ mk ]+x’ x k g(x) = m’ x - .5 ||mk||2 k Linear Discriminants • For minimum distance classifier, we chose the nearest class • Use the inner product to express the Euclidean distance from x to mk: • To find the template mk which minimizes ||x-mk||, it is sufficient to find the mk which maximizes the bracketed term above. • Define the linear discriminant function g(x) as constant constant
m1 x m2 Class Maximum Selector • • • • • • md Min Euclidean distance Classifier • A minimum-Euclidean-distance classifier classifies an input feature vector x by computing c linear discriminant functions g1(x), g2(x), ... , gc(x) and assigning x to the class corresponding to the maximum discriminant function. g1(x) g2(x) gc(x)
Feature Scaling • The numerical value for a feature x depends on the units used, .i.e., on the scale. • If x is multiplied by a scale factor a, both the mean and the standard deviation are multiplied by a. • The variance is multiplied by a2. • Sometimes it is desirable to scale the data so that the resulting standard deviation is unity. • divide x by the standard deviation s. • Similarly, in measuring the distance from x to m, it often makes sense to measure it relative to the standard deviation.
2 2 2 x1 - m1j x2 - m2j xd - mdj + + + r(x,mj)2 = •••• s1j s2j sdj Feature Scaling • This suggests an important generalization of a minimum-Euclidean-distance classifier. • Let x(i) be the value for Feature i, • let m(i,j) be the mean value of Feature i for Class j, and • let s(i,j) be the standard deviation of Feature i for Class j. • In measuring the distance between the feature vector x and the mean vector mj for Class j, use the standardized distance
Covariance • The covariance of two features measures their tendency to vary together, i.e., to co-vary. • The variance is the average of the squared deviation of a feature from its mean, the covariance is the average of the products of the deviations of feature values from their means. • Consider Feature i and Feature j. • Let { x(1,i), x(2,i), ... , x(n,i) } be a set of n examples of Feature i • Let { x(1,j), x(2,j), ... , x(n,j) } be a corresponding set of n examples of Feature j
[ x(1,i) - m(i) ] [ x(1,j) - m(j) ] + ... + [ x(n,i) - m(i) ] [ x(n,j) - m(j) ] c(i,j) = n-1 Covariance • Let m(i) be the mean of Feature i, and m(j) be the mean of Feature j. • Then the covariance of Feature i and Feature j is defined by • The covariance has several important properties: • If Feature i and Feature j tend to increase together, then c(i,j) > 0 • If Feature i tends to decrease when Feature j increases, then c(i,j) < 0 • If Feature i and Feature j are independent, then c(i,j) = 0 • | c(i,j) | <= s(i) s(j), where s(i) is the standard deviation of Feature i • c(i,i) = s(i)2 variance of Feature i
c(1,1) c(1,2) .... c(1,d) c(2,1) c(2,2) .... c(2,d) c(d,1) c(d,2) .... c(d,d) C = Covariance Matrix • All of the covariances c(i,j) can be collected together into a covariance matrix C:
-1 r2 = (x-mx)TCx (x-mx) 2 x - m 1 r2 = = (x-m) (x-m) s s2 Covariance Matrix • Need to normalize the distance • Recall what we did earlier to get a standardized distance for a single feature: • What is the matrix generalization of the scalar equation? “Mahalanobis distance”
Pros and Cons • The use of the Mahalanobis distance removes several of the limitations of the Euclidean metric: • It automatically accounts for the scaling of the coordinate axes • It corrects for correlation between the different features • It can provide curved as well as linear decision boundaries • Cons: • Covariance matrices can be hard to determine accurately, • Memory and time requirements grow with the number of features.
Bayesian Decision Theory • Return to fish example. There are two categories. Denote these categories as w1 for sea bass and w2 for salmon. • Assume that there is some prior probability (or simply prior) P(w1)that the next fish is sea bass, and some prior probability that P(w2)that it is salmon. • Suppose that we make a decision without making a measurement. The logical decision rule is Decide w1 if P(w1) > P(w2); otherwise decide w2
Bayesian Decision Theory • Suppose that we have a feature vector x; now the decision rule is Decide w1 if P(w1 | x) > P(w2 | x); otherwise decide w2 • Using the Bayes formula where
Bayesian Decision Theory • Define a set of discriminant functions gi(x), i=1,…,c
Gaussian Density Univariate Multivariate
Gaussian Density Center of the cluster is determined by the mean vector, and the shape of the cluster is determined by the covariance matrix. “Mahalonobis distance” from x to mean.
Discriminant Functions for Gaussian • Let us examine the discriminant function for
Discriminant Functions for Gaussian • Case I:
Discriminant Functions for Gaussian • Case I: As the priors change, the decision boundaries shift.
Discriminant Functions for Gaussian • Case I:
Discriminant Functions for Gaussian • Examples: Find the decision boundaries for 1D and 2D Gaussian data. Solve for x from
Parameter Estimation • We learned how we could design an optimal classifier if we knew the prior probabilities P(wi) and the class-conditional densities p(x|wi). • In a typical application, we rarely have complete knowledge. We typically have some general knowledge and a number of design samples (or training data). • We use the samples to estimate the unknown probabilities and probability densities, and then use these estimates as if they were true values. • If the densities could be parameterized, the problem is simplified significantly. (For example, for Gaussian distribution, mean and covariance matrix are the only parameters we need to estimate.)
Parameter Estimation Gaussian case:
Dimensionality • The accuracy degrades when the dimensionality is large. • The dimensionality can be reduced by combining features. • Linear combinations are attractive because they are simple to compute and analytically tractable. • Dimensionality reduction techniques include • Principal Component Analysis • Fisher’s Discriminant Analysis
Principal Component Analysis (PCA) • Find a lower dimensional space that best represents the data in a least-squares sense. Full N-dimensional space (here N = 2) d-dimensional subspace (here d = 1) U. of Delaware
Principal Component Analysis (PCA) • We begin by considering the problem of representing N-dimensional vectors x1, x2, …, xn by a single vector x0. • To be more specific, suppose that we want to find a vector x0 such that the sum of squared differences between x0 and xk is as small as possible. • Define cost function to be minimized: • The solution is the sample mean:
Principal Component Analysis (PCA) • The sample does not reveal any of the variability in the data. Let’s now consider a solution of the form where ak is a scalar and e is a unit vector. • Define cost function to be minimized: • The solution is
Principal Component Analysis (PCA) • What is the best direction e for the line? Using We get where Find e that maximizes
Principal Component Analysis (PCA) • The solution is where Since we select the eigenvector corresponding to the largest eigenvalue.
Principal Component Analysis (PCA) • Generalize it to d dimensions (d<=n) Find the eigenvectors e1, e2, …, edcorresponding to d largest eigenvalues of S.
Face Recognition Probe ?