1 / 31

The Nearest-Neighbor Classifier

The Nearest-Neighbor Classifier. NASA Space Program. Outline. Training and test data accuracy The nearest-neighbor classifier. Classification Accuracy. Say we have N feature vectors Say we know the true class label for each feature vector

eze
Download Presentation

The Nearest-Neighbor Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Nearest-Neighbor Classifier NASA Space Program

  2. Outline • Training and test data accuracy • The nearest-neighbor classifier

  3. Classification Accuracy • Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly • Accuracy = percentage of feature vectors correctly classified • training accuracy = accuracy on training data • test accuracy = accuracy on new data not used in training

  4. Training Data and Test Data • Training data • labeled data used to build a classifier • Test data • new data, not used in the training process, to evaluate how well a classifier does on new data • Memorization versus Generalization • better training_accuracy • “memorizing” the training data: • better test_accuracy • “generalizing” to new data • in general, we would like our classifier to perform well on new test data, not just on training data, • i.e., we would like it to generalize to new data

  5. Examples of Training and Test Data • Speech Recognition • Training data: words recorded and labeled in a laboratory • Test data: words recorded from new speakers, new locations • Zipcode Recognition • Training data: zipcodes manually selected, scanned, labeled • Test data: actual letters being scanned in a post office • Credit Scoring • Training data: historical database of loan applications with payment history or decision at that time • Test data: you

  6. Some Notation • Training Data • Dtrain = { [x(1), c(1)] , [x(2), c(2)] , …………[x(N), c(N)] } • N pairs of feature vectors and class labels • Feature Vectors and Class Labels: • x(i) is the ith training data feature vector • in MATLAB this could be the ith row of an N x d matrix • c(i) is the class label of the ith feature vector • in general, c(i) can take m different class values, e.g., c = 1, c = 2, ... • Let y be a new feature vector whose class label we do not know, i.e., we wish to classify it.

  7. Nearest Neighbor Classifier • y is a new feature vector whose class label is unknown • Search Dtrain for the closest feature vector to y • let this “closest feature vector” be x(j) • Classify y with the same label as x(j), i.e. • y is assigned label c(j) • How are “closest x” vectors determined? • typically use minimum Euclidean distance • dE(x, y) = sqrt(S (xi - yi)2 ) • Side note: this produces a “Voronoi tesselation” of the d-space • each point “claims” a cell surrounding it • cell boundaries are polygons • Analogous to “memory-based” reasoning in humans

  8. Geometric Interpretation of Nearest Neighbor 1 2 Feature 2 1 2 2 1 Feature 1

  9. Regions for Nearest Neighbors Each data point defines a “cell” of space that is closest to it. All points within that cell are assigned that class 1 2 Feature 2 1 2 2 1 Feature 1

  10. Nearest Neighbor Decision Boundary Overall decision boundary = union of cell boundaries where class decision is different on each side 1 2 Feature 2 1 2 2 1 Feature 1

  11. How should the new point be classified? 1 2 Feature 2 1 2 ? 2 1 Feature 1

  12. Local Decision Boundaries Boundary? Points that are equidistant between points of class 1 and 2 Note: locally the boundary is (1) linear (because of Euclidean distance) (2) halfway between the 2 class points (3) at right angles to connector 1 2 Feature 2 1 2 ? 2 1 Feature 1

  13. Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1

  14. Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1

  15. Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1

  16. Overall Boundary = Piecewise Linear Decision Region for Class 1 Decision Region for Class 2 1 2 Feature 2 1 2 ? 2 1 Feature 1

  17. Geometric Interpretation of kNN (k=1) ? 1 2 Feature 2 1 2 2 1 Feature 1

  18. More Data Points Feature 2 1 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 Feature 1

  19. More Complex Decision Boundary 1 In general: Nearest-neighbor classifier produces piecewise linear decision boundaries 1 1 Feature 2 2 2 1 1 2 2 1 2 1 1 2 2 2 Feature 1

  20. K-Nearest Neighbor (kNN) Classifier • Find the k-nearest neighbors to y in Dtrain • i.e., rank the feature vectors according to Euclidean distance • select the k vectors which have smallest distance to y • Classification • ranking yields k feature vectors and a set of k class labels • pick the class label which is most common in this set (“vote”) • classify y as belonging to this class

  21. K-Nearest Neighbor (kNN) Classifier • Theoretical Considerations • as k increases • we are averaging over more neighbors • the effective decision boundary is more “smooth” • as N increases, the optimal k value tends to increase in proportion to log N

  22. K-Nearest Neighbor (kNN) Classifier • Notes: • In effect, the classifier uses the nearest k feature vectors from Dtrain to “vote” on the class label for y • the single-nearest neighbor classifier is the special case of k=1 • for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5,…) then there will never be any “ties” • “training” is trivial for the kNN classifier, i.e., we just use Dtrain as a “lookup table” when we want to classify a new feature vector

  23. K-Nearest Neighbor (kNN) Classifier • Extensions of the Nearest Neighbor classifier • weighted distances • e.g., if some of the features are more important • e.g., if features are irrelevant • fast search techniques (indexing) to find k-nearest neighbors in d-space

  24. Accuracy on Training Data versus Test Data Training Accuracy = 1/n SDtrain I( o(i), c(i) ) where I( o(i), c(i) ) = 1 if o(i) = c(i), and 0 otherwise where o(i) is the output of the classifier for training feature x(i) and c(i) is the true class for training data vector x(i) Let Dtest be a set of new data, unseen in the training process: but assume that Dtest is being generated by the same “mechanism” as generated Dtrain: Test Accuracy = 1/ntestSDtest I( o(j), c(j) ) Test Accuracy is what we are really interested in: unfortunately test accuracy is usually greater on average than train accuracy

  25. Test Accuracy and Generalization • The accuracy of our classifier on new unseen data is a fair/honest assessment of the performance of our classifier • Why is training accuracy not good enough? • Training accuracy is optimistic • a classifier like nearest-neighbor can construct boundaries which always separate all training data points, but which do not separate new points • e.g., what is the training accuracy of kNN, k = 1? • A flexible classifier can “overfit” the training data • in effect it just memorizes the training data, but does not learn the general relationship between x and C

  26. Test Accuracy and Generalization • Generalization • We are really interested in how our classifier generalizes to new data • test data accuracy is a good estimate of generalization performance

  27. Assignment • 3 parts • classplot:Plot classification data in two-dimensions • knn: Implement a nearest-neighbor classifier • knn_test: Test the effect of the value k on the accuracy of the classifier • Test data • knnclassifier.mat

  28. Plotting Function function classplot(data, x, y); % function classplot(data, x, y); % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % data: (a structure with the same fields as described above: % your comment header should describe the structure explicitly) % Note that if you are only using certain fields in the structure % in the function below, you need only define these fields in the input comments -------- Your code goes here -------

  29. Nearest Neighbor Classifier function [class_predictions] = knn(traindata,trainlabels,k, testdata) % function [class_predictions] = knn(traindata,trainlabels,k, testdata) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % k: an odd positive integer indicating the number of neighbors to use % testdata: a N2 x d vector of feature data for testing the knn classifier % % Outputs % class_predictions: N2 x 1 vector of predicted class values -------- Pseudocode ------- Read in the training data set Dtrain       y = feature vector to be classified       kneighbors =  k-nearest neighbors to y in Dtrain       kclasses = class values of the kneighbors       kvote = the most common target value in kclasses  predicted_class(y) = kvote

  30. Accuracy of kNN Classifier as k is varied function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % testdata: a N2 x d vector of feature data for testing the knn classifier % testlabels: a N2 x 1 vector of classlabels for traindata % kmax: an odd positive integer indicating the maximum number of neighbors % plotflag: (optional argument) if 1, the accuracy versus k is plotted, % otherwise no plot. % % Outputs % accuracies: r x 1 vector of accuracies on testdata, where r is the % number of values of k that are tested. -------- Pseudocode ------- Read in the training data set Dtrain, and Dtest       For k = 1, 3, 5, ... Kmax (odd numbers)                classify each point in Dtest using the k nearest neighbors in Dtest             error_k = 100*(number of points incorrectly classified)/(number of points in Dtest)       end

  31. Summary • Important Concepts • classification is an important component in intelligent systems • a classifier = a mapping from feature space to a class label • decision boundaries = boundaries between classes • classification learning • using training data to define a classifier • the nearest-neighbor classifier • training accuracy versus test accuracy

More Related