Part III The Nearest-Neighbor Classifier

Part IIIThe Nearest-Neighbor Classifier NASA Space Program

Outline • Training and test data accuracy • The nearest-neighbor classifier

Classification Accuracy • Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly • Accuracy = percentage of feature vectors correctly classified • training accuracy = accuracy on training data • test accuracy = accuracy on new data not used in training

Training Data and Test Data • Training data • labeled data used to build a classifier • Test data • new data, not used in the training process, to evaluate how well a classifier does on new data • Memorization versus Generalization • better training_accuracy • “memorizing” the training data: • better test_accuracy • “generalizing” to new data • in general, we would like our classifier to perform well on new test data, not just on training data, • i.e., we would like it to generalize to new data

Examples of Training and Test Data • Speech Recognition • Training data: words recorded and labeled in a laboratory • Test data: words recorded from new speakers, new locations • Zipcode Recognition • Training data: zipcodes manually selected, scanned, labeled • Test data: actual letters being scanned in a post office • Credit Scoring • Training data: historical database of loan applications with payment history or decision at that time • Test data: you

Some Notation • Training Data • Dtrain = { [x(1), c(1)] , [x(2), c(2)] , …………[x(N), c(N)] } • N pairs of feature vectors and class labels • Feature Vectors and Class Labels: • x(i) is the ith training data feature vector • in MATLAB this could be the ith row of an N x d matrix • c(i) is the class label of the ith feature vector • in general, c(i) can take m different class values, e.g., c = 1, c = 2, ... • Let y be a new feature vector whose class label we do not know, i.e., we wish to classify it.

Nearest Neighbor Classifier • y is a new feature vector whose class label is unknown • Search Dtrain for the closest feature vector to y • let this “closest feature vector” be x(j) • Classify y with the same label as x(j), i.e. • y is assigned label c(j) • How are “closest x” vectors determined? • typically use minimum Euclidean distance • dE(x, y) = sqrt(S (xi - yi)2 ) • Side note: this produces a “Voronoi tesselation” of the d-space • each point “claims” a cell surrounding it • cell boundaries are polygons • Analogous to “memory-based” reasoning in humans

Geometric Interpretation of Nearest Neighbor 1 2 Feature 2 1 2 2 1 Feature 1

Regions for Nearest Neighbors Each data point defines a “cell” of space that is closest to it. All points within that cell are assigned that class 1 2 Feature 2 1 2 2 1 Feature 1

Nearest Neighbor Decision Boundary Overall decision boundary = union of cell boundaries where class decision is different on each side 1 2 Feature 2 1 2 2 1 Feature 1

How should the new point be classified? 1 2 Feature 2 1 2 ? 2 1 Feature 1

Local Decision Boundaries Boundary? Points that are equidistant between points of class 1 and 2 Note: locally the boundary is (1) linear (because of Euclidean distance) (2) halfway between the 2 class points (3) at right angles to connector 1 2 Feature 2 1 2 ? 2 1 Feature 1

Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1

Overall Boundary = Piecewise Linear Decision Region for Class 1 Decision Region for Class 2 1 2 Feature 2 1 2 ? 2 1 Feature 1

Geometric Interpretation of kNN (k=1) ? 1 2 Feature 2 1 2 2 1 Feature 1

More Data Points Feature 2 1 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 Feature 1

More Complex Decision Boundary 1 In general: Nearest-neighbor classifier produces piecewise linear decision boundaries 1 1 Feature 2 2 2 1 1 2 2 1 2 1 1 2 2 2 Feature 1

K-Nearest Neighbor (kNN) Classifier • Find the k-nearest neighbors to y in Dtrain • i.e., rank the feature vectors according to Euclidean distance • select the k vectors which have smallest distance to y • Classification • ranking yields k feature vectors and a set of k class labels • pick the class label which is most common in this set (“vote”) • classify y as belonging to this class

K-Nearest Neighbor (kNN) Classifier • Theoretical Considerations • as k increases • we are averaging over more neighbors • the effective decision boundary is more “smooth” • as N increases, the optimal k value tends to increase in proportion to log N

K-Nearest Neighbor (kNN) Classifier • Notes: • In effect, the classifier uses the nearest k feature vectors from Dtrain to “vote” on the class label for y • the single-nearest neighbor classifier is the special case of k=1 • for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5,…) then there will never be any “ties” • “training” is trivial for the kNN classifier, i.e., we just use Dtrain as a “lookup table” when we want to classify a new feature vector

K-Nearest Neighbor (kNN) Classifier • Extensions of the Nearest Neighbor classifier • weighted distances • e.g., if some of the features are more important • e.g., if features are irrelevant • fast search techniques (indexing) to find k-nearest neighbors in d-space

Accuracy on Training Data versus Test Data Training Accuracy = 1/n SDtrain I( o(i), c(i) ) where I( o(i), c(i) ) = 1 if o(i) = c(i), and 0 otherwise where o(i) is the output of the classifier for training feature x(i) and c(i) is the true class for training data vector x(i) Let Dtest be a set of new data, unseen in the training process: but assume that Dtest is being generated by the same “mechanism” as generated Dtrain: Test Accuracy = 1/ntestSDtest I( o(j), c(j) ) Test Accuracy is what we are really interested in: unfortunately test accuracy is usually greater on average than train accuracy

Test Accuracy and Generalization • The accuracy of our classifier on new unseen data is a fair/honest assessment of the performance of our classifier • Why is training accuracy not good enough? • Training accuracy is optimistic • a classifier like nearest-neighbor can construct boundaries which always separate all training data points, but which do not separate new points • e.g., what is the training accuracy of kNN, k = 1? • A flexible classifier can “overfit” the training data • in effect it just memorizes the training data, but does not learn the general relationship between x and C

Test Accuracy and Generalization • Generalization • We are really interested in how our classifier generalizes to new data • test data accuracy is a good estimate of generalization performance

Assignment • 3 parts • classplot:Plot classification data in two-dimensions • knn: Implement a nearest-neighbor classifier • knn_test: Test the effect of the value k on the accuracy of the classifier • Test data • knnclassifier.mat

Plotting Function function classplot(data, x, y); % function classplot(data, x, y); % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % data: (a structure with the same fields as described above: % your comment header should describe the structure explicitly) % Note that if you are only using certain fields in the structure % in the function below, you need only define these fields in the input comments -------- Your code goes here -------

Nearest Neighbor Classifier function [class_predictions] = knn(traindata,trainlabels,k, testdata) % function [class_predictions] = knn(traindata,trainlabels,k, testdata) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % k: an odd positive integer indicating the number of neighbors to use % testdata: a N2 x d vector of feature data for testing the knn classifier % % Outputs % class_predictions: N2 x 1 vector of predicted class values -------- Pseudocode ------- Read in the training data set Dtrain y = feature vector to be classified kneighbors = k-nearest neighbors to y in Dtrain kclasses = class values of the kneighbors kvote = the most common target value in kclasses predicted_class(y) = kvote

Accuracy of kNN Classifier as k is varied function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % testdata: a N2 x d vector of feature data for testing the knn classifier % testlabels: a N2 x 1 vector of classlabels for traindata % kmax: an odd positive integer indicating the maximum number of neighbors % plotflag: (optional argument) if 1, the accuracy versus k is plotted, % otherwise no plot. % % Outputs % accuracies: r x 1 vector of accuracies on testdata, where r is the % number of values of k that are tested. -------- Pseudocode ------- Read in the training data set Dtrain, and Dtest For k = 1, 3, 5, ... Kmax (odd numbers) classify each point in Dtest using the k nearest neighbors in Dtest error_k = 100*(number of points incorrectly classified)/(number of points in Dtest) end

Summary • Important Concepts • classification is an important component in intelligent systems • a classifier = a mapping from feature space to a class label • decision boundaries = boundaries between classes • classification learning • using training data to define a classifier • the nearest-neighbor classifier • training accuracy versus test accuracy

Part III The Nearest-Neighbor Classifier

Part III The Nearest-Neighbor Classifier

Presentation Transcript

K-nearest neighbor methods

K-Nearest Neighbor Learning

Nearest Neighbor Classifiers

Review of Assignment 2 (k-nearest neighbor classifier)

Reverse Nearest Neighbor Aggregates

Nearest-Neighbor Classifiers

Nearest Neighbor

Nearest neighbor matching

Nearest-Neighbor Classifiers

Classification Nearest Neighbor

Nearest Neighbor and Reverse Nearest Neighbor Queries for Moving Objects

The Nearest-Neighbor Classifier

Nearest Neighbor

K nearest neighbor

Exact Nearest Neighbor Algorithms

K-Nearest Neighbor

K-Nearest Neighbor Learning

A vector quantization method for nearest neighbor classifier design

Classification Nearest Neighbor

Learning: Nearest Neighbor

Nearest Neighbor Classifier

Classification Nearest Neighbor