90 likes | 300 Views
Nearest Neighbor Classifiers. other names: instance-based learning case-based learning (CBL) non-parametric learning model-free learning. 1-NN. save all training data to classify a test example, compute distance to each training example Euclidean distance metric
E N D
Nearest Neighbor Classifiers • other names: • instance-based learning • case-based learning (CBL) • non-parametric learning • model-free learning
1-NN • save all training data • to classify a test example, • compute distance to each training example • Euclidean distance metric • report same class of nearest training example • for binary attributes, use Hamming distance • for nominal attributes, use equality (0 if equal, else 1) or VDM (Value-Difference Metric; Stanfill and Waltz, 1986) – difference of conditional probabilities squared, summed over classes • Result: often surprisingly good accuracy, comparable with decision trees & neural nets
k-NN • sensitivity to noise • take majority over k closest neighbors • optimizing k: use validation set • distance-weighting • can use all training examples
strengths of k-NN • simple, accurate • Theorem: In the limit (large N), the error of 1-NN is at most twice the error of the Bayes-optimal classifier (Cover & Hart, 1967) • weaknesses of k-NN • memory needed to store examples • classification speed (indexing can help) • no comprehensibility • (noise, curse of dimensionality, lack of adequate training examples) • basis for generalization • bias: similarity bias
NTGrowth (Aha and Kibler) • during training, save only those examples on which mistakes are made • also throw out examples that appear noisy • reduces memory requirements, increases accuracy
Scaling of attributes • for fairness, don’t want large values to dominate • pre-whiten data: • for continuous values, replace with z-scores, z=(x-m)/s • binary and nominal attributes are already on scale of 0-1
Feature Weighting • weighted Euclidean dist. metric • want to weight features by “relevance” • conditional probability • negEntropy • chi-squared • Mahalanobis metric • inverse of covariance matrix, dxy=(x-y)TS-1(x-y) • capture skewing of data distribution • con: class-independent
Feature Selection • curse of dimensionality – many attributes often leads to lower accuracy • PCA – principle component analysis • based on manipulation of covariance matrix • choose new orthogonal dimensions based on linear combinations of original attributes, chosen in order of most variance explained • filter methods: try to estimate relevance • negEntropy, RELIEF: hits vs. misses of neighbors • wrapper methods (use accuracy on training data to pick best features) • SFS: stepwise-forward selection • SBE: stepwise-backward elimination • DIET: try optimizing weights of one feature at a time by searching a grid