840 likes | 851 Views
Learn about instance based learning and nearest neighbor classification in machine learning. Understand the importance of kD-trees, locally weighted learning, and the challenges of noisy data. Discover strategies for reducing the number of exemplars and tuning the value of K.
E N D
Machine Learning in PracticeLecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Plan for the Day • Announcements • No quiz • Last assignment Thursday • 2nd midterm goes out next Thursday after class • Finish Instance Based Learning • Weka helpful hints • Clustering • Advanced Statistical Models • More on Optimization and Tuning
Instance Based Learning • Rote learning is at the extreme end of instance based representations • A more general form of instance based representation is where membership is computed based on a similarity measure between a centroid vector and the vector of the example instance • Advantage: Possible to learn incrementally
Why “Lazy”? ? http://www.cs.mcgill.ca/~cs644/Godfried/2005/Fall/mbouch/rubrique_fichiers/image003-3.png
Finding Nearest Neighbors Efficiently • Bruit force method – compute distance between new vector and every vector in the training set, and pick the one with the smallest distance • Better method: divide the search space, and strategically select relevant regions so you only have to compare to a subset of instances
kD-trees • kD-trees partition the space so that nearest neighbors can be found more efficiently • Each split takes place along one attribute and splits the examples at the parent node roughly in half • Split points chosen in such a way as to keep the tree as balanced as possible (*not* to optimize for accuracy or information gain) • Can you guess why you would want the tree to be as balanced as possible? • Hint – think about computational complexity of search algorithms
A B C D E F G Algorithm D E G F New Approx. Nearest Neighbor
A B C D E F G Algorithm Tweak: sometimes you Average over k nearest neighbors rather than taking the absolute nearest neighbor D E Tweak: Use ball shaped regions rather than rectangles to keep number of regions overlapped down G F New Approx. Nearest Neighbor
Locally Weighted Learning • Base predictions on models trained specifically for regions within the vector space • Caveat! This is an over-simplification of what’s happening • Weighting of examples accomplished as in cost sensitive classification • Similar idea to M5P (learning separate regressions for different regions of the vector space determined by a path through a decision tree)
Locally Weighted Learning • LBR baysean classification that relaxes independence assumptions using similarity between training and test instances • Only assumes independence within a neighborhood • LWL is a general locally weighted learning approach • Note that Baysean Networks are another way of taking non-independence into account with probabilistic models by explicitly modeling interactions (see last section of Chapter 6)
Problems with Nearest-Neighbor Classification • Slow for large numbers of exemplars • Performs poorly with noisy data if only single closest exemplar is used for classification • All attributes contribute equally to the distance comparison • If no normalization is done, then attributes with the biggest range have the biggest effect, regardless of their importance for classification
Problems with Nearest-Neighbor Classification • Even if you normalize, you still have the problem that attributes are not weighted by importance • Normally does not do any sort of explicit generalization
Reducing the Number of Exemplars • Normally unnecessary to retain all examples ever seen • Ideally only one important example per section of instance space is needed • One strategy that works reasonably well is to only keep exemplars that were initially classified wrong • Over time the number of exemplars kept increases, and the error rate goes down
Reducing the Number of Exemplars • One problem is that sometimes it is not clear that an exemplar is important until sometime after it has been thrown away • Also, this strategy of keeping just those exemplars that are classified wrong is bad for noisy data, because it will tend to keep the noisy examples
Tuning K for K-Nearest Neighbors Compensates for noise * Tune for optimal value of K
Pruning Noisy Examples • Using success ratios, it is possible to reduce the number of examples you are paying attention to based on their observed reliability • You can compute a success ratio for every instance within range K of the new instance • based on the accuracy of their prediction • Computed over examples seen since they were added to the space
Pruning Noisy Examples • Keep an upper and lower threshold • Throw out examples that fall below the lower threshold • Only use exemplars that are above the upper threshold • But keep updating the success ratio of all exemplars
Don’t do anything rash! • We can compute confidence intervals on the success ratios we compute based on the number of observations we have made • You won’t pay attention to an exemplar that just happens to look good at first • You won’t throw instances away carelessly Eventually, these will be thrown out.
What do we do about irrelevant attributes? • You can compensate for irrelevant attributes by scaling attribute values based on importance • Attribute weights modified after a new example is added to the space • Use the most similar exemplar to the new training instance
What do we do about irrelevant attributes? • Adjust the weights so that the new instance comes closer to the most similar exemplar if it classified it correctly or farther away if it was wrong • Weights are usually renormalized after this adjustment • Weights will be trained to emphasize attributes that lead to useful generalizations
Instance Based Learning with Generalization • Instances generalized to • regions. Allows instance based • learning algorithms to behave • like other machine learning • algorithms (just another complex • decision boundary) • Key idea is determining how far to • generalize from each instance
IB1: Plain Vanilla Nearest Neighbor Algorithm • Keeps all training instances, doesn’t normalize • Uses euclidean distance • Bases prediction on the first instance found with the shortest distance • Nothing to optimize • Published in 1991 by my AI programming professor from UCI!
IBK: More general than IB1 • kNN: how many neighbors to do pay attention to • crossValidate: use leave one out cross-validation to select optimal K • distanceWeighting: allows you to select the method for weighting based on distance • meanSquared: if it’s true, use mean squared error rather than absolute error for regression problems
IBK: More general than IB1 • noNormalization: turns off normalization • windowSize: sets the maximum number of instances to keep. Prunes off older instances when necessary. 0 means no limit.
K* • Uses an entropy based distance metric rather than euclidean distance • Much slower than IBK! • Optimizations related to concepts we aren’t learning in this course • Allows you to choose what to do with missing values
What is special about K*? • Distance is computed based on a computation of how many transformation operations it would take to map one vector onto another • There may be multiple transformation paths, and all of them are taken into account • So the distance is an average over all possible transformation paths (randomly generated – so branching factor matters!) • That’s why it’s slow!!! • Allows for a more natural way of handling distance when your attribute space has many different types of attributes
What is special about K*? • Also allows a natural way of handling unknown values (probabilistically imputing values) • K* is likely to do better than other approaches if you have lots of unknown values or a very heterogeneous feature space (in terms of types of features)
Locally Weighted Numeric Prediction • Two Main types of trees used for numeric prediction • Regression trees: average values computed at leaf nodes • Model trees: regression functions trained at leaf nodes • Rather than maximize information gain, these algorithms minimize variation within subsets at leaf nodes
Locally Weighted Numeric Prediction • Locally weighted regression is an alternative to regression trees where the regression is computed at testing time rather than training time • Compute a regression for instances that are close to the testing instance
Summary of Locally Weighted Learning • Use Instance Based Learning together with a base classifier – almost like a wrapper • Learn a model within a neighborhood • Basic idea: approximate non-linear function learning with simple linear algorithms
Summary of Locally Weighted Learning • Big advantage: allows for incremental learning, whereas things like SVM do not • If you don’t need the incrementality, then it is probably better not to go with instance based learning
Take Home Message • Many ways of evaluating similarity of instances, which lead to different results • Instance based learning and clustering both make use of these approaches • Locally weighted learning is another way (besides the “kernel trick”) to get nonlinearity into otherwise linear approaches
Remember SMOreg vs SMO… SMO is for classification SMOreg is for numeric prediction!
Setting the Exponent in SMO * Note that an exponent larger than 1.0 means you are using a non-linear kernel.
What is clustering • Finding natural groupings of your data • Not supervised! No class attribute. • Usually only works well if you have a huge amount of data!
What does clustering do? • Finds natural breaks in your data • If there are obvious clusters, you can do this with a small amount of data • If you have lots of weak predictors, you need a huge amount of data to make it work
What does clustering do? • Finds natural breaks in your data • If there are obvious clusters, you can do this with a small amount of data • If you have lots of weak predictors, you need a huge amount of data to make it work
Clustering in Weka * You can pick which clustering algorithm you want to use and how many clusters you want.
Clustering in Weka * Clustering is unsupervised, so you want it to ignore your class attribute! Click here Select the class attribute
Clustering in Weka * You can evaluate the clustering in comparison with class attribute assignments
Adding a Cluster Feature * You should set it explicitly to ignore the class attribute * Set the pulldown menu to No Class
Why add cluster features? Class 1 Class 2
Why add cluster features? Class 1 Class 2