490 likes | 700 Views
Instance based learning. 基于实例的学习. 上海交通大学. 摘自 T. Mitchell Machine learning. Topics. Introduction k-Nearest Neighbor Learning (kNN) Locally Weighted Regression(LWR) Radial Basis Function (RBF) IBL--NNet Case-Based Reasoning (CBR) Conclusion. IBL: Basic Idea. Key idea:
E N D
Instance based learning 基于实例的学习 上海交通大学 摘自 T. Mitchell Machine learning
Topics • Introduction • k-Nearest Neighbor Learning (kNN) • Locally Weighted Regression(LWR) • Radial Basis Function (RBF) IBL--NNet • Case-Based Reasoning (CBR) • Conclusion
IBL: Basic Idea • Key idea: • Store all training examples • When seeing a new instance: • Look at most similar stored instances • Make prediction based on those instances • E.g. k-nearest-neighbour: • Find kmost similar instances • Use most frequent class (classification) or mean target value (regression) as prediction • “Nearest neighbour” = 1-nearest-neighbour
e.g. MVT (now part of Agilent) • components present or absent • solder joints good or bad • Machine Vision for inspection of PCBs
Components present? Absent Present
Properties of IBL • Advantages: • Learning is very fast • No information is lost • Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes (the curse of dimensionality) • Good similarity measure necessary • Easiest in numeric space (Rn)
+ + - - + + + + + + - + + + + + + + + + - - - - - - - - - - - Keeping All Information • Advantage: no details lost • Disadvantage: "details" may be noise
Lazy v’s Eager • D-Trees, Naïve Bayes and ANN are examples of Eager ML Algorithms • D-Tree is built in advance off-line • Less work to do at run-time • k-NN is a Lazy approach • Little work done off-line • keep training examples, • find k nearest at run time
Lazy v’s Eager: Differences • Differences? • Eager learner creates global approximation • 1 theory has to work for all predictions • Lazy learner creates local approximations on demand • In a sense, many different theories used • With same hypothesis space H, lazy learner is in fact more expressive
Classifying apples and pears To what class does this belong?
A M J C G Consider a Loan Approval System • What does similar mean? Amount Monthly_Sal Job Category Credit Score Age Amount Monthly_Sal Job Category Credit Score Age
Imagine just 2 features • 2 features • Amount • Monthly_Sal o o o o o Monthly_Sal o o o o x x x x x x x x Amount
k-NN and Noise • 1-NN easy to implement • susceptible to noise • a misclassification every time a noisy pattern retrieved • k-NN with k 3 will overcome this
k-Nearest Neighbor Learning • n-dimensional space Rn • Instance representation V—feature vector <a1(x ), a2(x ),… an(x ), > • Distance metric(Euclidean distance) • Target function (discrete valued/real-valued)
k-Nearest Neighbor Learning • Training algorithm: • For each training example <x,f(x)>, add the example to the list training-examples • Classification algorithm: • Given a query instance xq to be classified, • Let x1…xk denote the k instances from training-examples that are nearest to xq • Return Where δ (a,b)=1 if a=b δ (a,b)=0 otherwise
k-Nearest Neighbor D set of training samples Find k nearest neighbors to q according to this difference criterion For each x D where (Heterogeneous Euclidean-Overlap Metric) Category of q decided by its k Nearest Neighbours
Voronoi Diagrams • Indicate areas in which prediction influenced by same set of examples query point q nearest neighbor x
3-Nearest Neighbors query point q 2x,1o
7-Nearest Neighbors query point q 7 nearest neighbors 3x,4o
functions and feature influence: an example To what class does this belong?
Rescaling Numeric Features • Features with different original scale should have equal importance • E.g. A1: [0, 1]; A2: [-10, +10] • Distance w.r.t. A1 always small, hence A1 has small influence • Solution: * divide the difference by the “feature range” (1 and 20 correspondingly) * …or by standard deviation
Curse of Dimensionality • The curse of dimensionality: high-dimensional instance space (many features) has bad effect on learnability (and time!!!) • Especially bad for IBL: • Assume there are 20 attributes, of which only 2 relevant • Similarity w.r.t. 18 non-relevant attributes dominates similarity w.r.t. 2 relevant ones!
Dimension reduction in k-NN q best features Feature Selection • Not all features required • noisy features a hindrance • Some examples redundant • retrieval time depends on no. of examples n covering examples m examples Case Selection (Prototyping) p features
Condensed NN D set of training samples Find E where E D; NN rule used with E should be as good as D choose x D randomly, D D \ {x}, E {x}, DO learning? FALSE, FOR EACH x D classify x by NN using E, if classification incorrect then E E {x}, D D \ {x}, learning TRUE, WHILE (learning? FALSE)
100 examples 2 categories Different CNN solutions Condensed NN
Improving Condensed NN • Different outcomes depending on data order • that’s a bad thing in an algorithm • identify exemplars near decision surface • in diagram • B more useful than A • it should be first A B
CNN using NUN 100 examples 2 categories Different CNN solutions Condensed NN
Distance-weighted kNN • Idea: give higher weight to closer instances • Can now use all training instances instead of only k : Shepard’s method
Distance-Weighted kNN • Give greater weight to closer neighbors where
Discussion on kNN • Highly effective inductive inference method • Robust to noisy training data • Quite effective when the training set is large enough • Inductive bias • Nearby instances • Irrelevant attributes solution: • To stretch the axes locally ---- may be overfitting, less common • To eliminate the least relevant attributes completely [Moore and Lee, 1994] • To index the instance lib eg. kd-tree [Bentley 1975]
Y + + + + + + + + X New instance Locally weighted regression • Obvious problem with following kind of data: • Given new x, what value for y would you predict? • What will k-NN (e.g., 3-NN) predict?
LWR: Building local models • Build local model in region around x • e.g. linear or quadratic model, ... • Minimizing • Squared error for k neighbours • Distance-weighted squared error for all neighbours • ...
Locally Weighted Regression • Terms • REGRESSION means approximating a real-valued target function • RESIDUAL(残余)is the error in approximating the target function • KERNEL FUNCTION is the function of distance that is used to determine the WEIGHT of each training example. wi=K(d(xi,xq))
Locally Weighted Regression • Generalization of kNN • The neighborhood surrounding xq • Using a linear function, a quadratic function, a multi-layer neural network,…
Locally Weighted Regression • Criterion Squared error Gradient descent [Atkeson,1997 & Bishop,1995]
Radial Basis Function Related to distance-weighted regression & artificial neural networks [Powell,1987; Broomhead & Lowe 1988; Moody & Darken 1989] Where xu is an intance from X Ku will decrease with d increases, and generally it is a Gaussian function
Radial Basis Function • Other methods: • To allocate a Gaussian kernel function for each training example <xi,f(xi)>, …then combine them. • To choose a subset of training examples • Summarization on RBF • To provide a global approximation to the target function • Represented by a linear combination of many local kernel functions • To neglect the values out of defined region(region/width) • Can be trained more efficiently
Case-Based Reasoning • Features: • Lazy learning method • Instances’ representations are more rich symbolic • Instance retrieve methods are correspondingly more elaborate • Application • Conceptual design of mechanical devices based on previous experience [Sycara, 1992] • Reasoning about new legal cases based on previous rulings[Ashley,1990] • Solving planning and scheduling problems by reusing and combining portions of previous solutions to the similar problems[Veloso,1992]
Reference to CADET • CADET is a Case-based Design Tool. CADET is a system that aids conceptual design of electro-mechanical devices and is based on the paradigm of Case-based Reasoning. • CADET consists of sub-systems that we call CARD (Case-based Retrieval for Design). http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cadet/ftp/docs/CADET.html
CBR vs. kNN • Rich symbolic/relational descriptions for instances may require similarity metric instead of Euclidean distance. • Multiple retrieved cases may be combined relying on CBR instead of statistical methods. • Tight coupling between case retrieval, knowledge-based reasoning, and problem solving.
Lazy Learning vs. Eager Learning • When to generalize beyond the training data • Whether considered the new query instance when deciding how to generalize beyond the training data • Whether the methods have the option of selecting a different hypothesis or local approximation to the target function for each query instance
Conclusion • IBL: a lazy learning method • kNN: a IBL, real/discrete-valued • LWR: a generalizion of kNN • RBF: a type of artificial neural network • CBR: a IBL, more complex symbolic descriptions for instance
References • Tom M. Mitchell, Machine Learning, the MIT & McGraw-Hill Companies • 蔡自兴,人工智能及其应用,清华大学出版社 • 陆汝钤,人工智能,科学出版社 • 黄梯云,智能决策支持系统,电子工业出版社 • 计算机学报,2002.6 & 2002.8 • 中文信息学报,2002.3 • 冯是聪,博士生开题报告 • 程兆伟,硕士毕业论文 • 侯明强,学士毕业论文