280 likes | 1.28k Views
K nearest neighbor and Rocchio algorithm. LING 572 Fei Xia 1/11/2007. Announcement. Hw2 is online now. It is due on Jan 20. Hw1 is due at 11pm on Jan 13 (Sat). Lab session after the class. Read DT Tutorial before next Tues’ class. K-Nearest Neighbor (kNN). Instance-based (IB) learning.
E N D
K nearest neighborand Rocchio algorithm LING 572 Fei Xia 1/11/2007
Announcement • Hw2 is online now. It is due on Jan 20. • Hw1 is due at 11pm on Jan 13 (Sat). • Lab session after the class. • Read DT Tutorial before next Tues’ class.
Instance-based (IB) learning • No training: store all training instances. “Lazy learning” • Examples: • kNN • Locally weighted regression • Radial basis functions • Case-based reasoning • … • The most well-known IB method: kNN
kNN • For a new document d, • find k training documents that are closest to d. • perform majority voting or weighted voting. • Properties: • A “lazy” classifier. No training. • Feature selection and distance measure are crucial.
The algorithm • Determine parameter K • Calculate the distance between query-instance and all the training instances • Sort the distances and determine K nearest neighbors • Gather the labels of the K nearest neighbors • Use simple majority voting or weighted voting.
Picking K • Use N-fold cross validation: pick the one that minimizes cross validation error.
Normalizing attribute values • Distance could be dominated by some attributes with large numbers: • Ex: features: age, income • Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K) • Assume: age 2 [0,100], income 2 [0, 200K] • After normalization: x1=(0.35, 0.38), x2=(0.36, 0.40), x3 = (0.70, 0.395).
The Choice of Features • Imagine there are 100 features, and only 2 of them are relevant to the target label. • kNN is easily misled in high-dimensional space. Feature weighting or feature selection
Feature weighting • Stretch j-th axis by weight wj, • Use cross-validation to automatically choose weights w1, …, wn • Setting wj to zero eliminates this dimension altogether.
Similarity measure • Euclidean distance: • Weighted Euclidean distance: • Similarity measure: cosine
Voting • Majority voting: c* = arg maxci(c, fi(x)) • Weighted voting: weighting is on each neighbor c* = arg maxci wi(c, fi(x)) wi = 1/dist(x, xi) We can use all the training examples.
Summary of kNN • Strengths: • Simplicity (conceptual) • Efficiency at training: no training • Handling multi-class • Stability and robustness: averaging k neighbors • Predication accuracy: when the training data is large • Weakness: • Efficiency at testing time: need to calc all distances • Theoretical validity • It is not clear which types of distance measure and features to use.
Relevance Feedback for IR • The issue: “plane” vs. “aircraft” • Take advantage of user feedback on relevance of docs to improve IR results: • User issues a short, simple query • The user marks returned documents as relevant or non-relevant. • The system computes a better representation of the information need based on feedback. • Relevance feedback can go through one or more iterations. • Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
Rocchio Algorithm • The Rocchio algorithm incorporates relevance feedback information into the vector space model. • Want to maximize sim (Q, Cr) − sim (Q, Cnr) • The optimal query vector for separating relevant and non-relevant documents (with cosine sim.): • Qopt= optimal query; Cr= set of relevant doc vectors; N = collection size
Rocchio 1971 Algorithm (SMART) qm= modified query vector; q0 = original query vector; ,, : weights Dr = set of known relevant doc vectors; Dnr= set of known irrelevant doc vectors
Relevance feedback assumptions Relevance prototypes are “well-behaved”. • Term distribution in relevant documents will be similar • Term distribution in non-relevant documents will be different from those in relevant documents: • Either: All relevant documents are tightly clustered around a single prototype. • Or: There are different prototypes, but they have significant vocabulary overlap. • Similarities between relevant and irrelevant documents are small.
Rocchio Algorithm for text classification • Training time: construct a set of prototype vectors, one vector per class. • Testing time: for a new document, find the most similar prototype vector.
Training time Cj: the set of positive examples for class j. D: the set of positive and negative examples ,: weights
Why this formula? • Rocchio shows when ==1, each prototype vector maximizes • How maximizing this formula connects to the classification accuracy?
Testing time • Given a new document d
kNN vs. Rocchio • kNN: • Lazy learning: no training • Use all the training instances at testing time. • Rocchio algorithm: • At the training time, calculate prototype vectors. • At the test time, instead of using all the training instances, use only prototype vectors. • Linear classifier: not as expressive as kNN.
Summary of Rocchio • Strengths: • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Weakness: • Theoretical validity • Stability and robustness • Predication accuracy: it does not work well when the categories are not linearly separable.
Three major design choices • The weight of feature fk in document di: e.g., tf-idf • The document length normalization • The similarity measure: e.g., cosine
Extending Rocchio? • Generalized instance set (GIS) algorithm (Lam and Ho, 1998). • …