Review of Assignment 2 (k-nearest neighbor classifier)

Review of Assignment 2(k-nearest neighbor classifier) CS 175, Fall 2007

Road Map • Assignment 3 (due Thursday) • Perceptron classifier • Assignment 4 • Image manipulation and image classification • Assignment 5 • Template matching and edge detection in images • Assignment 6 • Project proposals • Assignment 7 • Project progress report • In-class presentation (last week of classes) • Final project report (due finals week)

Class Grading • Assignments 1 through 7: • each worth 10% of your grade • best 4 from 1st 5 assignments are selected • => 40% of your grade • Assignments 6 and 7 (project proposals/progress report) • => 20% of your grade • Total for assignments is 60% • Final Project report and in-class demonstration • worth 40% of your grade

Scores on Assignments 1 and 2 • Assignment 1 (out of 40) • Mean: 34 • Max: 38 • Most scores between 30 and 38 • Assignment 2 (out of 50) • Mean: 34 • Max: 46 • Most scores between 30 and 46

Assignment 2 (kNN classifier) • 4 MATLAB functions, 10 points each • 6 pts for functioning correctly • 2 pts for error checking • 2 pts for comments • Graph + discussion = 10 points • For detailed explanation of point deductions, discuss with Nathan (TA)

Comments from TA • Problems with knn : Not returning the correctly sized vector/something completely unrelated to the classification task at hand. knn computation not being done properly.Comments on plot functions : In general these were all very good. Note that in particular if a large K and/or a large data set were passed to the knn_errors function, some people had problems dealing with the larger data sets (i.e., plot not coming back for 10 minutes or so). You should in the future test your code for scalability like this, as it is one thing we test for.Problems with write-up : The assignment simply asks for a plot or two and a short discussion. Many people put no significant discussion of the results, or didn't include a plot for clarity. For future writing assignments note that one sentence will not suffice as an analysis. Also, you lost points if you really didn't at least try to discuss why the error rate starts increasing as k->inf. All we expected was some ideas about why this is happening.

Comments on Assignment 2 • Use of sort.m for finding nearest-neighbors • [x index] = sort(distances) • x contains the sorted distances • index contains their indices (after sorting) • so kdistances = distances(index(1:k)) will contain the smallest k distances • and, klabels = labels(index(1:k)) will contain the labels corresponding to the smallest k distances • more convenient than finding minimum, removing, repeating k times • complexity? • Sorting is O(n log n), versus O(nk) for finding min k times • Use labels on your plots • see xlabel, ylabel, title, text, etc • IMPORTANT: test your code before submitting it!

Example of classplot.m function classplot(data, x, y); % function classplot(data, x, y); % % Makes a 2D plot of feature vectors, and color codes each vector (point) % depending on the 'class' of the vector % A. Student, CS 175 % % Inputs % data: A struct that contains the following fields: % shortname: a char vector which stores the name of the % data set % numfeatures: the number of features that each vector of % data has, in a 1x1 matrix % classnames: names of each class stored in a matrix, % with each class name being on its own row % numclasses: the number of classes, in a 1x1 matrix % description: a char vector which stores a short description % features: an N x numfeatures matrix which stores feature % values for each feature vector % classlabels: an N x 1 matrix which stores the class labels, % which are doubles, for each feature vector % x: (optional) Specifies feature column to be plotted on x-axis % y: (optional) Specifies feature column to be plotted on y-axis

% This section of code checks the important fields: % Checks that data is a struct if (~isstruct(data)) error('The first input (data) must be a struct'); % Checks that the size of each feature vector is equal to numfeatures elseif (size(data.features,2) ~= data.numfeatures) error('# of columns in data.features must equal data.numfeatures'); % Checks that the # of rows in 'features' equals # of rows in 'classlabels' elseif (size(data.classlabels, 1) ~= size(data.features, 1)) error('# of rows in data.classlabels must equal # of rows in data.features'); % Checks that the number of rows in classnames is equal to numclasses elseif (size(data.classnames, 1) ~= data.numclasses) error('The number of rows in classnames must be equal to numclasses'); classplot.m (continued)

% Checks that the number of classes is at least 2 elseif (data.numclasses < 2) error('The number of classes must be at least 2'); % Checks that x and y are within acceptable bounds elseif (nargin == 3 && (x > data.numfeatures || y > data.numfeatures)) error('Neither x nor y can be greater than the numfeatures'); % Checks that x and y are different elseif (nargin == 3 && x == y) error('The second and third inputs must not be equal'); end classplot.m (continued)

% initializes x and y to be 1 and 2 if the user does not specify otherwise if (nargin ~= 3) x = 1; y = 2; end % assumption: there are only two classes (1 & 2) figure % makes a new figure hold; % holds the current plot % Plots the data that has classlabels equal to 1 in red plot(data.features(data.classlabels==1,x), data.features(data.classlabels==1,y), 'r*') % Plots the data that has classlabels equal to 2 in green plot(data.features(data.classlabels==2,x), data.features(data.classlabels==2,y), 'g*') % Puts a title, class legend, and axis labels on the plot title (data.shortname); legend(data.classnames(1, :), data.classnames(2, :)); xlabel(cat(2, 'Feature ', num2str(x))); ylabel(cat(2, 'Feature ', num2str(y))); hold; % releases the current plot classplot.m (continued)

Example of knn.m function [class_predictions] = knn(traindata,trainlabels,k, testdata) % function [class_predictions] = knn(traindata,trainlabels,k, testdata) % % produces class labels on the test data using a k-nearest-neighbor classifier % % A. Student, CS 175 % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % k: an odd positive integer indicating the number of neighbors to use % testdata: a N2 x d vector of feature data for testing the knn classifier % % Outputs % class_predictions: N2 x 1 vector of predicted class values % determine the sizes of the input arrays [ntest dtest] = size(testdata); [ntrain dtrain] = size(traindata); nlabels = length(trainlabels);

knn.m (continued) …….. % check that the necessary dimensions match: if (dtest ~= dtrain) error('Dimensions of training and test data do not match'); elseif (nlabels ~= ntrain) error('Number of training data points and labels do not match'); end % initialize a vector to contain the class predictions class_predictions = zeros(ntest,1); for i=1:ntest % for each test point x = testdata(i,:); distances = sqdistances(x,traindata); % get distances to all training points [tmp index] = sort(distances); % sort the distances, store indices in "index" nnlabels = trainlabels(index(1:k)); % determine the labels of k neighbors class_predictions(i) = round(mean(nnlabels)); % vote and get a label end

Another example of knn.m function [class_predictions] = knn(traindata,trainlabels,k, testdata) % … header comments (omitted) % … error checking (omitted) for i = 1 : size(testdata,1) % find the nearest neighbors [y, idx, d] = k_nearest_neighbor(testdata(i,:), traindata, k); % make a prediction if sum(trainlabels(idx) == 1) > k/2 class_predictions(i) = 1; else class_predictions(i) = 2; end end

k_nearest_neighbor.m function [y, i, d] = k_nearest_neighbor(x, A, k); % function [y, i, d] = k_nearest_neighbor(x, A); % % finds the k nearest neighbors to x from the rows of the matrix A % A. N. Other Student, CS 175 % % Inputs % x: a vector of numbers of size 1 x n % A: m vectors of size 1 x n, "stacked" in a m x n matrix % % Outputs % y: matrix of the k closest vectors in A to x (of size k x n) % i: the integer (row) indices of y in A (of size k x 1) % d: the Euclidean distances of y to x (of size k x 1) % error checking if size(x,1) ~= 1 error('The first argument should be a row-vector'); end if size(x,2) ~= size(A,2) error('The arguments should have the same number of columns'); end

k_nearest_neighbor.m m = size(A,1); % number of rows in the A matrix % replicate the original input m times % (this allows us to avoid the use of a for-loop in computing distances below) B = repmat(x, m, 1); % calculate the distance between each row in A and B diff = A - B; % matrix of size m x n dist = sqrt(sum(diff.*diff), 2)); % sum squared differences across columns % could skip square root here for speed % sort the distances and find the k nearest neighbors and indices [d, i] = sort(dist); i = i(1:k); d = d(1:k); y = A(i, :);

Script to compute kNN error rates as k varies load simdata2; traindata = simdata2.features(1:1000,:); % create training data and labels trainlabels = simdata2.classlabels(1:1000); testdata = simdata2.features(1001:2000,:); % create test data and labels testlabels = simdata2.classlabels(1001:2000); total_errors = zeros(length(kvalues),1); kmax = 75; % define a vector of k values for k = 1:kmax; y = knn(traindata,trainlabels,k, testdata); total_errors(i) = sum(y ~= testlabels'); fprintf('Error rate on test data with k=%d is %5.3f\n',k,100*total_errors(i)/length(y)); end

A faster version of knntest.m ….. %Find the kmax closest vectors in traindata to each vector in testdata % **Note: I did not use function knn to classify the points because % it would have been redundant (and slow) to calculate all 1-k for every % single k instead of just all 1-kmax at once (like I do here instead) for i=1:ntest % find all kmax nearest_neighbors to our current vector i in testdata [nearest, nearest_rows(i,:)] = nearest_neighbor(testdata(i, :), traindata, kmax); % find the class values for all those neighbors (copy them to klabels) for j=1:kmax klabels(i, j) = trainlabels(nearest_rows(i,j)); end end

….. %For k = 1, 3, 5, ... Kmax (odd numbers) for k=1:2:kmax %classify each point in testdata using the k nearest neighbors for i=1:ntest %vote by finding most common class value (of k nearest to i) kvote1 = find(klabels(i,1:k) == 1); kvote2 = find(klabels(i,1:k) == 2); k1 = size(kvote1); k2 = size(kvote2); %make a class prediction for point i based on the vote if(k1(2) > k2(2)) class_predictions(i, k) = 1; else class_predictions(i, k) = 2; end end …… A faster version of knntest.m

Error Rate versus k for Simdata2

Smaller training set size

Different training data set sizes

Review of Assignment 2 (k-nearest neighbor classifier)

Review of Assignment 2 (k-nearest neighbor classifier)

Presentation Transcript

Part III The Nearest-Neighbor Classifier

Kernels

An Introduction of Support Vector Machine

DATA MINING LECTURE 10

DATA MINING LECTURE 10b

k -d trees

COMPE 467 - Pattern Recognition

Classification Model

Hanaa M. Hussain, Khaled Benkrid School of Engineering Edinburgh University, Edinburgh

The Nearest-Neighbor Classifier

Feature extraction using fuzzy complete linear discriminant analysis

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications

Pattern Recognition Project

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

K-Nearest Neighbor

Outline for these slides

Learning: Nearest Neighbor

Nearest Neighbor Classifier

Lecture 15. Pattern Classification (I): Statistical Formulation

Chapter 3. Classification: Basic Concepts