Instance Based Learning

Instance Based Learning IB1 and IBK Find in text Early approach

1- Nearest Neighbor • Basic distance function between attribute-values • If real, the absolute value • If nominal, d(v1,v2) = 1 if v1 \=v2, else 0. • Distance between 2 instances is square root of sum of squares, i.e. euclidean distance • Square root of sum of squares • Sqrt( (x1-y1)^2 +….(xn-yn)^2 ) • May normalize real-value distances for fairness amongst attributes.

Prediction or classification • For instance x, let y be closest instance to x in training set. • Predict class x is the class of y. • On some data sets, best algorithm. • In general, no best learning algorithm.

Voronoi Diagram

Voronoi Diagram • For each point, draw the boundary of all points closest to it. • Each point’s sphere of influence in convex. • If noisy, can be bad. • http://www.cs.cornell.edu/Info/People/chew/Delaunay.html - nice applet.

Problems and solutions • Noise • Remove bad examples • Use voting • Bad distance measure • Use probability class vector • Memory • Remove unneeded examples

Voting schemes • K nearest neighbor • Let all the closest k neighbors vote (use k odd) • Kernel K(x,y) – a similarity function • Let everyone vote, with decreasing weight according to K(x,y) • Ex: K(x,y) = e^(-distance(x,y)^2) • Ex. K(x,y) = inner product of x and y • Ex K(x,y) = inner product of f(x) and f(y) where f is some mapping of x and y into R^n.

Choosing the parameter K • Divide data into train and test • Run multiple values of k on train • Choose k that does best on test.

NOT • This is a serious methological error • You have used test data to pick the k. • Common in commercial evaluation of systems • Occasional in academic papers

Fix: Internal Cross-validation • This can be used for selecting any parameter. • Divide Data into Train and Test. • Now do 10-fold CV on the training data to determine the appropriate value of k. • Note: never touch the test data.

Probability Class Vector • Let A be an attribute with values v1, v2,..vn • Suppose class C1,C2,..Ck • Prob Class Vector for vi is: <P(C1|A=vi),P(C2|A=vi),..P(Ck|A=vi)> • Distance(vi,vj) = distance between probability class vectors.

Weather data @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

Distance(sunny,rainy) =? <P(play = yes|sunny), P(play=no|sunny)> = < 2/5, 3/5> = prob class vector for sunny <P(play = yes|rainy), P(play= no|rainy)> = <3/5,2/5> Distance(sunny,rainy) = 1/5*sqrt(2). Similarly: distance(sunny,overcast) = d(<2/5,3/5>, <4/4,0/4>) = 2/5*sqrt(2)

PCV • If an attribute is irrelevant and v and v’ are values, then PCV(v) ~ PCV(v’) so the distance will be close to 0. • This discounts irrelevant attributes. • It also works for real-attributes, after binning. • Binning is a way to make real-values symbolic. Simple break data into k bins, eg. K = 5 or 10 seems to work. Or use DTs.

Regression by NN • If 1-NN, use value of nearest example • If k-nn, interpolate values of k nearest neighbors. • Kernel methods work to. You avoid choice of k, but hide it in choice of kernel function.

Summary • NN works for multi-class and regression. • Sometimes called “poor man’s neural net’’ • With enough data, it achieves ½ the “bayes optimal” error rate. • Mislead by bad examples and bad features. • Separates classes via piecewise linear boundaries.

Instance Based Learning