Instance based learning

Instance based learning 基于实例的学习上海交通大学摘自 T. Mitchell Machine learning

Topics • Introduction • k-Nearest Neighbor Learning (kNN) • Locally Weighted Regression(LWR) • Radial Basis Function (RBF) IBL--NNet • Case-Based Reasoning (CBR) • Conclusion

IBL: Basic Idea • Key idea: • Store all training examples • When seeing a new instance: • Look at most similar stored instances • Make prediction based on those instances • E.g. k-nearest-neighbour: • Find kmost similar instances • Use most frequent class (classification) or mean target value (regression) as prediction • “Nearest neighbour” = 1-nearest-neighbour

e.g. MVT (now part of Agilent) • components present or absent • solder joints good or bad • Machine Vision for inspection of PCBs

Components present? Absent Present

Characterise image as a set of features

Properties of IBL • Advantages: • Learning is very fast • No information is lost • Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes (the curse of dimensionality) • Good similarity measure necessary • Easiest in numeric space (Rn)

+ + - - + + + + + + - + + + + + + + + + - - - - - - - - - - - Keeping All Information • Advantage: no details lost • Disadvantage: "details" may be noise

Lazy v’s Eager • D-Trees, Naïve Bayes and ANN are examples of Eager ML Algorithms • D-Tree is built in advance off-line • Less work to do at run-time • k-NN is a Lazy approach • Little work done off-line • keep training examples, • find k nearest at run time

Lazy v’s Eager: Differences • Differences? • Eager learner creates global approximation • 1 theory has to work for all predictions • Lazy learner creates local approximations on demand • In a sense, many different theories used • With same hypothesis space H, lazy learner is in fact more expressive

Classifying apples and pears To what class does this belong?

A M J C G Consider a Loan Approval System • What does similar mean? Amount Monthly_Sal Job Category Credit Score Age Amount Monthly_Sal Job Category Credit Score Age

Imagine just 2 features • 2 features • Amount • Monthly_Sal o o o o o Monthly_Sal o o o o x x x x x x x x Amount

k-NN and Noise • 1-NN easy to implement • susceptible to noise • a misclassification every time a noisy pattern retrieved • k-NN with k  3 will overcome this

k-Nearest Neighbor Learning • n-dimensional space Rn • Instance representation V—feature vector <a1(x ), a2(x ),… an(x ), > • Distance metric(Euclidean distance) • Target function (discrete valued/real-valued)

k-Nearest Neighbor Learning • Training algorithm: • For each training example <x,f(x)>, add the example to the list training-examples • Classification algorithm: • Given a query instance xq to be classified, • Let x1…xk denote the k instances from training-examples that are nearest to xq • Return Where δ (a,b)=1 if a=b δ (a,b)=0 otherwise

k-Nearest Neighbor D set of training samples Find k nearest neighbors to q according to this difference criterion For each x  D where (Heterogeneous Euclidean-Overlap Metric) Category of q decided by its k Nearest Neighbours

Voronoi diagram

Voronoi Diagrams • Indicate areas in which prediction influenced by same set of examples query point q nearest neighbor x

3-Nearest Neighbors query point q 2x,1o

7-Nearest Neighbors query point q 7 nearest neighbors 3x,4o

 functions and feature influence: an example To what class does this belong?

Rescaling Numeric Features • Features with different original scale should have equal importance • E.g. A1: [0, 1]; A2: [-10, +10] • Distance w.r.t. A1 always small, hence A1 has small influence • Solution: * divide the difference by the “feature range” (1 and 20 correspondingly) * …or by standard deviation

Curse of Dimensionality • The curse of dimensionality: high-dimensional instance space (many features) has bad effect on learnability (and time!!!) • Especially bad for IBL: • Assume there are 20 attributes, of which only 2 relevant • Similarity w.r.t. 18 non-relevant attributes dominates similarity w.r.t. 2 relevant ones!

Dimension reduction in k-NN q best features Feature Selection • Not all features required • noisy features a hindrance • Some examples redundant • retrieval time depends on no. of examples n covering examples m examples Case Selection (Prototyping) p features

Condensed NN D set of training samples Find E where E D; NN rule used with E should be as good as D choose x  D randomly, D  D \ {x}, E  {x}, DO learning?  FALSE, FOR EACH x  D classify x by NN using E, if classification incorrect then E  E  {x}, D  D \ {x}, learning  TRUE, WHILE (learning?  FALSE)

100 examples 2 categories Different CNN solutions Condensed NN

Improving Condensed NN • Different outcomes depending on data order • that’s a bad thing in an algorithm • identify exemplars near decision surface • in diagram • B more useful than A • it should be first A B

CNN using NUN 100 examples 2 categories Different CNN solutions Condensed NN

Distance-weighted kNN • Idea: give higher weight to closer instances • Can now use all training instances instead of only k : Shepard’s method

Distance-Weighted kNN • Give greater weight to closer neighbors where

Discussion on kNN • Highly effective inductive inference method • Robust to noisy training data • Quite effective when the training set is large enough • Inductive bias • Nearby instances • Irrelevant attributes solution: • To stretch the axes locally ---- may be overfitting, less common • To eliminate the least relevant attributes completely [Moore and Lee, 1994] • To index the instance lib eg. kd-tree [Bentley 1975]

Y + + + + + + + + X New instance Locally weighted regression • Obvious problem with following kind of data: • Given new x, what value for y would you predict? • What will k-NN (e.g., 3-NN) predict?

LWR: Building local models • Build local model in region around x • e.g. linear or quadratic model, ... • Minimizing • Squared error for k neighbours • Distance-weighted squared error for all neighbours • ...

Locally Weighted Regression • Terms • REGRESSION means approximating a real-valued target function • RESIDUAL（残余）is the error in approximating the target function • KERNEL FUNCTION is the function of distance that is used to determine the WEIGHT of each training example. wi=K(d(xi,xq))

Locally Weighted Regression • Generalization of kNN • The neighborhood surrounding xq • Using a linear function, a quadratic function, a multi-layer neural network,…

Locally Weighted Regression • Criterion Squared error Gradient descent [Atkeson,1997 & Bishop,1995]

Radial Basis Function Related to distance-weighted regression & artificial neural networks [Powell,1987; Broomhead & Lowe 1988; Moody & Darken 1989] Where xu is an intance from X Ku will decrease with d increases, and generally it is a Gaussian function

This function can be used to describe a two-layer network

Radial Basis Function • Other methods: • To allocate a Gaussian kernel function for each training example <xi,f(xi)>, …then combine them. • To choose a subset of training examples • Summarization on RBF • To provide a global approximation to the target function • Represented by a linear combination of many local kernel functions • To neglect the values out of defined region(region/width) • Can be trained more efficiently

Case-Based Reasoning • Features: • Lazy learning method • Instances’ representations are more rich symbolic • Instance retrieve methods are correspondingly more elaborate • Application • Conceptual design of mechanical devices based on previous experience [Sycara, 1992] • Reasoning about new legal cases based on previous rulings[Ashley,1990] • Solving planning and scheduling problems by reusing and combining portions of previous solutions to the similar problems[Veloso,1992]

The CADET System [sycara,1992]

Reference to CADET • CADET is a Case-based Design Tool. CADET is a system that aids conceptual design of electro-mechanical devices and is based on the paradigm of Case-based Reasoning. • CADET consists of sub-systems that we call CARD (Case-based Retrieval for Design). http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cadet/ftp/docs/CADET.html

CBR vs. kNN • Rich symbolic/relational descriptions for instances may require similarity metric instead of Euclidean distance. • Multiple retrieved cases may be combined relying on CBR instead of statistical methods. • Tight coupling between case retrieval, knowledge-based reasoning, and problem solving.

Lazy Learning vs. Eager Learning • When to generalize beyond the training data • Whether considered the new query instance when deciding how to generalize beyond the training data • Whether the methods have the option of selecting a different hypothesis or local approximation to the target function for each query instance

Conclusion • IBL: a lazy learning method • kNN: a IBL, real/discrete-valued • LWR: a generalizion of kNN • RBF: a type of artificial neural network • CBR: a IBL, more complex symbolic descriptions for instance

References • Tom M. Mitchell, Machine Learning, the MIT & McGraw-Hill Companies • 蔡自兴，人工智能及其应用，清华大学出版社 • 陆汝钤，人工智能，科学出版社 • 黄梯云，智能决策支持系统，电子工业出版社 • 计算机学报，2002.6 & 2002.8 • 中文信息学报，2002.3 • 冯是聪，博士生开题报告 • 程兆伟，硕士毕业论文 • 侯明强，学士毕业论文

Instance based learning