CS 6243 Machine Learning

CS 6243 Machine Learning Instance-based learning

Lazy vs. Eager Learning • Lazy vs. eager learning • Eager learning (e.g., decision tree learning): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify • Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple • Lazy: less time in training but more time in predicting

Nearest neighbor classifier - Basic idea • For each test case h • Find the k training instances that are closest to h • Return the most frequent class label

Practical issues • Similarity / distance function • Number of neighbors • Instance weighting • Attribute weighting • Algorithms / data structure to improve efficiency • Explicit concept generalization

Similarity / distance measure • Euclidean distance • City-block (Manhattan) distance • Dot product / cosine function • Good for high-dimensional sparse feature vectors • Popular in document classification / information retrieval • Pearson correlation coefficient • Measures linear dependency • Popular in biology • Nominal attributes: distance is set to 1 if values are different, 0 if they are equal

Normalization and other issues • Different attributes are measured on different scales  need to be normalized: vi : the actual value of attribute i • Row normalization / column normalization • Common policy for missing values: assumed to be maximally distant (given normalized attributes) or witten&eibe

Number of neighbors • 1-NN is sensitive to noisy instances • In general, the larger the number of training instances, the larger the value of k • Can be determined by minimizing estimated classification error (using cross validation) • Search over K = (1,2,3,…,Kmax). Choose search size Kmax based on compute constraints • Estimate average classification error for each K • Pick K to minimize the classification error

Instance weighting • We might want to weight nearer neighbors more heavily • Each nearest neighbor cast its vote with a weight • Final prediction is the class with the highest sum of weights • In this case may use all instance (no need to choose k) • Shepard’s method • Can also do numerical prediction

Attribute weighting • Simple strategy: • Calculate correlation between attribute values and class labels • More relevant attributes have higher weights • More advanced strategy: • Iterative updating (IBk) • Slides for Ch6

Other issues • Algorithms / data structure to improve efficiency • Data structure to enable efficiently finding nearest neighbors: kD tree, ball tree • Does not affect classification results • Ch4 slides • Algorithms to select prototype • May affect classification results • IBk. Ch6 slides • Concept generalization • Should we do it or not do it? • Ch6 slides

Discussion of kNN • Pros: • Often very accurate • Easy to implement • Fast to train • Arbitrary decision boundary • Cons: • Classification is slow (remedy: ball tree, prototype selection) • Assumes all attributes are equally important (remedy: attribute selection or weights, but, still, curse of dimensionality) • No explicit knowledge discovery witten&eibe

Sec.14.6 Decision boundary _ y + x 12

CS 6243 Machine Learning