310 likes | 676 Views
Part III The Nearest-Neighbor Classifier. NASA Space Program. Outline. Training and test data accuracy The nearest-neighbor classifier. Classification Accuracy. Say we have N feature vectors Say we know the true class label for each feature vector
E N D
Part IIIThe Nearest-Neighbor Classifier NASA Space Program
Outline • Training and test data accuracy • The nearest-neighbor classifier
Classification Accuracy • Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly • Accuracy = percentage of feature vectors correctly classified • training accuracy = accuracy on training data • test accuracy = accuracy on new data not used in training
Training Data and Test Data • Training data • labeled data used to build a classifier • Test data • new data, not used in the training process, to evaluate how well a classifier does on new data • Memorization versus Generalization • better training_accuracy • “memorizing” the training data: • better test_accuracy • “generalizing” to new data • in general, we would like our classifier to perform well on new test data, not just on training data, • i.e., we would like it to generalize to new data
Examples of Training and Test Data • Speech Recognition • Training data: words recorded and labeled in a laboratory • Test data: words recorded from new speakers, new locations • Zipcode Recognition • Training data: zipcodes manually selected, scanned, labeled • Test data: actual letters being scanned in a post office • Credit Scoring • Training data: historical database of loan applications with payment history or decision at that time • Test data: you
Some Notation • Training Data • Dtrain = { [x(1), c(1)] , [x(2), c(2)] , …………[x(N), c(N)] } • N pairs of feature vectors and class labels • Feature Vectors and Class Labels: • x(i) is the ith training data feature vector • in MATLAB this could be the ith row of an N x d matrix • c(i) is the class label of the ith feature vector • in general, c(i) can take m different class values, e.g., c = 1, c = 2, ... • Let y be a new feature vector whose class label we do not know, i.e., we wish to classify it.
Nearest Neighbor Classifier • y is a new feature vector whose class label is unknown • Search Dtrain for the closest feature vector to y • let this “closest feature vector” be x(j) • Classify y with the same label as x(j), i.e. • y is assigned label c(j) • How are “closest x” vectors determined? • typically use minimum Euclidean distance • dE(x, y) = sqrt(S (xi - yi)2 ) • Side note: this produces a “Voronoi tesselation” of the d-space • each point “claims” a cell surrounding it • cell boundaries are polygons • Analogous to “memory-based” reasoning in humans
Geometric Interpretation of Nearest Neighbor 1 2 Feature 2 1 2 2 1 Feature 1
Regions for Nearest Neighbors Each data point defines a “cell” of space that is closest to it. All points within that cell are assigned that class 1 2 Feature 2 1 2 2 1 Feature 1
Nearest Neighbor Decision Boundary Overall decision boundary = union of cell boundaries where class decision is different on each side 1 2 Feature 2 1 2 2 1 Feature 1
How should the new point be classified? 1 2 Feature 2 1 2 ? 2 1 Feature 1
Local Decision Boundaries Boundary? Points that are equidistant between points of class 1 and 2 Note: locally the boundary is (1) linear (because of Euclidean distance) (2) halfway between the 2 class points (3) at right angles to connector 1 2 Feature 2 1 2 ? 2 1 Feature 1
Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1
Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1
Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1
Overall Boundary = Piecewise Linear Decision Region for Class 1 Decision Region for Class 2 1 2 Feature 2 1 2 ? 2 1 Feature 1
Geometric Interpretation of kNN (k=1) ? 1 2 Feature 2 1 2 2 1 Feature 1
More Data Points Feature 2 1 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 Feature 1
More Complex Decision Boundary 1 In general: Nearest-neighbor classifier produces piecewise linear decision boundaries 1 1 Feature 2 2 2 1 1 2 2 1 2 1 1 2 2 2 Feature 1
K-Nearest Neighbor (kNN) Classifier • Find the k-nearest neighbors to y in Dtrain • i.e., rank the feature vectors according to Euclidean distance • select the k vectors which have smallest distance to y • Classification • ranking yields k feature vectors and a set of k class labels • pick the class label which is most common in this set (“vote”) • classify y as belonging to this class
K-Nearest Neighbor (kNN) Classifier • Theoretical Considerations • as k increases • we are averaging over more neighbors • the effective decision boundary is more “smooth” • as N increases, the optimal k value tends to increase in proportion to log N
K-Nearest Neighbor (kNN) Classifier • Notes: • In effect, the classifier uses the nearest k feature vectors from Dtrain to “vote” on the class label for y • the single-nearest neighbor classifier is the special case of k=1 • for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5,…) then there will never be any “ties” • “training” is trivial for the kNN classifier, i.e., we just use Dtrain as a “lookup table” when we want to classify a new feature vector
K-Nearest Neighbor (kNN) Classifier • Extensions of the Nearest Neighbor classifier • weighted distances • e.g., if some of the features are more important • e.g., if features are irrelevant • fast search techniques (indexing) to find k-nearest neighbors in d-space
Accuracy on Training Data versus Test Data Training Accuracy = 1/n SDtrain I( o(i), c(i) ) where I( o(i), c(i) ) = 1 if o(i) = c(i), and 0 otherwise where o(i) is the output of the classifier for training feature x(i) and c(i) is the true class for training data vector x(i) Let Dtest be a set of new data, unseen in the training process: but assume that Dtest is being generated by the same “mechanism” as generated Dtrain: Test Accuracy = 1/ntestSDtest I( o(j), c(j) ) Test Accuracy is what we are really interested in: unfortunately test accuracy is usually greater on average than train accuracy
Test Accuracy and Generalization • The accuracy of our classifier on new unseen data is a fair/honest assessment of the performance of our classifier • Why is training accuracy not good enough? • Training accuracy is optimistic • a classifier like nearest-neighbor can construct boundaries which always separate all training data points, but which do not separate new points • e.g., what is the training accuracy of kNN, k = 1? • A flexible classifier can “overfit” the training data • in effect it just memorizes the training data, but does not learn the general relationship between x and C
Test Accuracy and Generalization • Generalization • We are really interested in how our classifier generalizes to new data • test data accuracy is a good estimate of generalization performance
Assignment • 3 parts • classplot:Plot classification data in two-dimensions • knn: Implement a nearest-neighbor classifier • knn_test: Test the effect of the value k on the accuracy of the classifier • Test data • knnclassifier.mat
Plotting Function function classplot(data, x, y); % function classplot(data, x, y); % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % data: (a structure with the same fields as described above: % your comment header should describe the structure explicitly) % Note that if you are only using certain fields in the structure % in the function below, you need only define these fields in the input comments -------- Your code goes here -------
Nearest Neighbor Classifier function [class_predictions] = knn(traindata,trainlabels,k, testdata) % function [class_predictions] = knn(traindata,trainlabels,k, testdata) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % k: an odd positive integer indicating the number of neighbors to use % testdata: a N2 x d vector of feature data for testing the knn classifier % % Outputs % class_predictions: N2 x 1 vector of predicted class values -------- Pseudocode ------- Read in the training data set Dtrain y = feature vector to be classified kneighbors = k-nearest neighbors to y in Dtrain kclasses = class values of the kneighbors kvote = the most common target value in kclasses predicted_class(y) = kvote
Accuracy of kNN Classifier as k is varied function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % testdata: a N2 x d vector of feature data for testing the knn classifier % testlabels: a N2 x 1 vector of classlabels for traindata % kmax: an odd positive integer indicating the maximum number of neighbors % plotflag: (optional argument) if 1, the accuracy versus k is plotted, % otherwise no plot. % % Outputs % accuracies: r x 1 vector of accuracies on testdata, where r is the % number of values of k that are tested. -------- Pseudocode ------- Read in the training data set Dtrain, and Dtest For k = 1, 3, 5, ... Kmax (odd numbers) classify each point in Dtest using the k nearest neighbors in Dtest error_k = 100*(number of points incorrectly classified)/(number of points in Dtest) end
Summary • Important Concepts • classification is an important component in intelligent systems • a classifier = a mapping from feature space to a class label • decision boundaries = boundaries between classes • classification learning • using training data to define a classifier • the nearest-neighbor classifier • training accuracy versus test accuracy