1 / 36

Machine Learning 21431

Machine Learning 21431. Instance Based Learning. Outline. K-Nearest Neighbor Locally weighted learning Local linear models Radial basis functions. Literature & Software. T. Mitchell, “Machine Learning”, chapter 8, “Instance-Based Learning”

garyroy
Download Presentation

Machine Learning 21431

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning 21431 Instance Based Learning

  2. Outline • K-Nearest Neighbor • Locally weighted learning • Local linear models • Radial basis functions

  3. Literature & Software • T. Mitchell, “Machine Learning”, chapter 8, “Instance-Based Learning” • “Locally Weighted Learning”, Christopher Atkeson, Andrew Moore, Stefan Schaal ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html R. Duda et ak, “Pattern recognition”, chapter 4 “Non-Parametric Techniques” • Netlab toolbox • k-nearest neighbor classification • Radial basis function networks

  4. When to Consider Nearest Neighbors • Instances map to points in RN • Less than 20 attributes per instance • Lots of training data Advantages: • Training is very fast • Learn complex target functions • Do not loose information Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes

  5. Instance Based Learning Key idea: just store all training examples <xi,f(xi)> Nearest neighbor: • Given query instance xq, first locate nearest training example xn, then estimate f(xq)=f(xn) K-nearest neighbor: • Given xq, take vote among its k nearest neighbors (if discrete-valued target function) • Take mean of f values of k nearest neighbors (if real-valued) f(xq)=i=1k f(xi)/k

  6. Voronoi Diagram query point qf nearest neighbor qi

  7. 3-Nearest Neighbors query point qf 3 nearest neighbors 2x,1o

  8. 7-Nearest Neighbors query point qf 7 nearest neighbors 3x,4o

  9. Behavior in the Limit • Consider p(x) the probability that instance x is classified as positive (1) versus negative (0) • Nearest neighbor: As number of instances  approaches Gibbs algorithm Gibbs algorithm: with probability p(x) predict , else 0 • K-nearest neighbors: As number of instances  approaches Bayes optimal classifier Bayes optimal: if p(x)> 0.5 predict 1 else 0

  10. Nearest Neighbor (continuous) 3-nearest neighbor

  11. Nearest Neighbor (continuous) 5-nearest neighbor

  12. Nearest Neighbor (continuous) 1-nearest neighbor

  13. Locally Weighted Regression • Forms an explicit approximation f*(x) for region surrounding query point xq. • Fit linear function to k nearest neighbors • Fit quadratic function • Produces piecewiese approximation of f • Squared error over k nearest neighbors E(xq) = xi  nearest neighbors (f*(xq)-f(xi))2 • Distance weighted error over all neighbors E(xq) = i (f*(xq)-f(xi))2 K(d(xi,xq))

  14. Locally Weighted Regression • Regression means approximating a real-valued target function • Residual is the error f*(x)-f(x) in approximating the target function • Kernel function is the function of distance that is used to determine the weight of each training example. In other words, the kernel function is the function K such that wi=K(d(xi,xq))

  15. Distance Weighted k-NN Give more weight to neighbors closer to the query point f*(xq) = i=1k wi f(xi) / i=1k wi where wi=K(d(xq,xi)) and d(xq,xi) is the distance between xq and xi Instead of only k-nearest neighbors use all training examples (Shepard’s method)

  16. Distance Weighted Average • Weighting the data: f*(xq) = i f(xi) K(d(xi,xq))/ i K(d(xi,xq)) Relevance of a data point (xi,f(xi)) is measured by calculating the distance d(xi,xq) between the query xq and the input vector xi • Weighting the error criterion: E(xq) = i (f*(xq)-f(xi))2 K(d(xi,xq)) the best estimate f*(xq) will minimize the cost E(xq), therefore E(xq)/f*(xq)=0

  17. Kernel Functions

  18. Distance Weighted NN K(d(xq,xi)) = 1/ d(xq,xi)2

  19. Distance Weighted NN K(d(xq,xi)) = 1/(d0+d(xq,xi))2

  20. Distance Weighted NN K(d(xq,xi)) = exp(-(d(xq,xi)/0)2)

  21. Example: Mexican Hat f(x1,x2)=sin(x1)sin(x2)/x1x2 approximation

  22. Example: Mexican Hat residual

  23. Locally Weighted Linear Regression • Local linear function f*(x) = w0 + Sn wn xn • Error criterion E = i (w0 + Sn wn xqn-f(xi))2 K(d(xi,xq)) • Gradient descent Dwn = i (f*(xq)- f(xi)) xn K(d(xi,xq)) • Least square solution w = ((KX)T KX)-1 (KX)T f(X) with KX NxM matrix of row vectors K(d(xi,xq)) xi and f(X) is a vector whose i-th element is f(xi)

  24. Curse of Dimensionality Imagine instances described by 20 attributes but only are relevant to target function Curse of dimensionality: nearest neighbor is easily misled when instance space is high-dimensional One approach: • Stretch j-th axis by weight zj, where z1,…,zn chosen to minimize prediction error • Use cross-validation to automatically choose weights z1,…,zn • Note setting zj to zero eliminates this dimension alltogether (feature subset selection)

  25. Linear Global Models • The model is linear in the parameters bk, which can be estimated using a least squares algorithm • f^(xi) = k=1Dwk xki or F(x) = X b Where xi=(x1,…,xD)i, i=1..N, with D the input dimension and N the number of data points. Estimate the bk by minimizing the error criterion • E= i=1N (f^(xi) – yi)2 • (XTX) b = XTF(X) • b = (XT X)-1 XT F(X) • bk= m=1Dn=1N (l=1DxTkl xlm)-1 xTmn f(xn)

  26. Linear Regression Example

  27. Linear Local Models • Estimate the parameters bk such that they locally (near the query point xq) match the training data either by • weighting the data: wi=K(d(xi,xq))1/2 and transforming zi=wi xi vi=wi yi • or by weighting the error criterion: E= i=1N (xiT – yi)2 K(d(xi,xq)) still linear in  with LSQsolution  = ((WX)T WX)-1 (WX)T F(X)

  28. Kernel K(x,xq) Local linear model: f^(x)=b1x+b0 f^(xq)=0.266 query point Xq=0.35 Linear Local Model Example

  29. Linear Local Model Example

  30. Design Issues in Local Regression • Local model order (constant, linear, quadratic) • Distance function d feature scaling: d(x,q)=(j=1d mj(xj-qj)2)1/2 irrelevant dimensions mj=0 • kernel function K • smoothing parameter bandwidth h in K(d(x,q)/h) • h=|m| global bandwidth • h= distance to k-th nearest neighbor point • h=h(q) depending on query point • h=hi depending on stored data points See paper by Atkeson [1996] ”Locally Weighted Learning”

  31. Radial Basis Function Network • Global approximation to target function in terms of linear combination of local approximations • Used, e.g. for image classification • Similar to back-propagation neural network but activation function is Gaussian rather than sigmoid • Closely related to distance-weighted regression but ”eager” instead of ”lazy”

  32. Radial Basis Function Network output f(x) wn linear parameters Kernel functions Kn(d(xn,x))= exp(-1/2 d(xn,x)2/2) input layer xi f(x)=w0+n=1k wn Kn(d(xn,x))

  33. Training Radial Basis Function Networks • How to choose the center xn for each Kernel function Kn? • scatter uniformly across instance space • use distribution of training instances (clustering) • How to train the weights? • Choose mean xn and variance n for each Kn non-linear optimization or EM • Hold Kn fixed and use local linear regression to compute the optimal weights wn

  34. Radial Basis Network Example K1(d(x1,x))= exp(-1/2 d(x1,x)2/2) w1 x+ w0 f^(x) = K1 (w1 x+ w0) + K2 (w3 x + w2)

  35. and Eager Learning • Lazy: wait for query before generalizing • k-nearest neighbors, weighted linear regression • Eager: generalize before seeing query • Radial basis function networks, decision trees, back-propagation, LOLIMOT • Eager learner must create global approximation • Lazy learner can create local approximations • If they use the same hypothesis space, lazy can represent more complex functions (H=linear functions)

  36. Laboration 3 • Distance weighted average • Cross-validation for optimal kernel width s • Leave 1-out cross-validation f*(xq) = iq f(xi) K(d(xi,xq))/ i q K(d(xi,xq)) • Cross-validation for feature subset selection • Neural Network

More Related