K nearest neighbor and Rocchio algorithm

K nearest neighborand Rocchio algorithm LING 572 Fei Xia 1/11/2007

Announcement • Hw2 is online now. It is due on Jan 20. • Hw1 is due at 11pm on Jan 13 (Sat). • Lab session after the class. • Read DT Tutorial before next Tues’ class.

K-Nearest Neighbor (kNN)

Instance-based (IB) learning • No training: store all training instances.  “Lazy learning” • Examples: • kNN • Locally weighted regression • Radial basis functions • Case-based reasoning • … • The most well-known IB method: kNN

kNN

kNN • For a new document d, • find k training documents that are closest to d. • perform majority voting or weighted voting. • Properties: • A “lazy” classifier. No training. • Feature selection and distance measure are crucial.

The algorithm • Determine parameter K • Calculate the distance between query-instance and all the training instances • Sort the distances and determine K nearest neighbors • Gather the labels of the K nearest neighbors • Use simple majority voting or weighted voting.

Picking K • Use N-fold cross validation: pick the one that minimizes cross validation error.

Normalizing attribute values • Distance could be dominated by some attributes with large numbers: • Ex: features: age, income • Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K) • Assume: age 2 [0,100], income 2 [0, 200K] • After normalization: x1=(0.35, 0.38), x2=(0.36, 0.40), x3 = (0.70, 0.395).

The Choice of Features • Imagine there are 100 features, and only 2 of them are relevant to the target label. • kNN is easily misled in high-dimensional space.  Feature weighting or feature selection

Feature weighting • Stretch j-th axis by weight wj, • Use cross-validation to automatically choose weights w1, …, wn • Setting wj to zero eliminates this dimension altogether.

Similarity measure • Euclidean distance: • Weighted Euclidean distance: • Similarity measure: cosine

Voting • Majority voting: c* = arg maxci(c, fi(x)) • Weighted voting: weighting is on each neighbor c* = arg maxci wi(c, fi(x)) wi = 1/dist(x, xi)  We can use all the training examples.

Summary of kNN • Strengths: • Simplicity (conceptual) • Efficiency at training: no training • Handling multi-class • Stability and robustness: averaging k neighbors • Predication accuracy: when the training data is large • Weakness: • Efficiency at testing time: need to calc all distances • Theoretical validity • It is not clear which types of distance measure and features to use.

Rocchio Algorithm

Relevance Feedback for IR • The issue: “plane” vs. “aircraft” • Take advantage of user feedback on relevance of docs to improve IR results: • User issues a short, simple query • The user marks returned documents as relevant or non-relevant. • The system computes a better representation of the information need based on feedback. • Relevance feedback can go through one or more iterations. • Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate

Rocchio Algorithm • The Rocchio algorithm incorporates relevance feedback information into the vector space model. • Want to maximize sim (Q, Cr) − sim (Q, Cnr) • The optimal query vector for separating relevant and non-relevant documents (with cosine sim.): • Qopt= optimal query; Cr= set of relevant doc vectors; N = collection size

Rocchio 1971 Algorithm (SMART) qm= modified query vector; q0 = original query vector; ,,  : weights Dr = set of known relevant doc vectors; Dnr= set of known irrelevant doc vectors

Relevance feedback assumptions Relevance prototypes are “well-behaved”. • Term distribution in relevant documents will be similar • Term distribution in non-relevant documents will be different from those in relevant documents: • Either: All relevant documents are tightly clustered around a single prototype. • Or: There are different prototypes, but they have significant vocabulary overlap. • Similarities between relevant and irrelevant documents are small.

Rocchio Algorithm for text classification • Training time: construct a set of prototype vectors, one vector per class. • Testing time: for a new document, find the most similar prototype vector.

Training time Cj: the set of positive examples for class j. D: the set of positive and negative examples ,: weights

Why this formula? • Rocchio shows when ==1, each prototype vector maximizes • How maximizing this formula connects to the classification accuracy?

Testing time • Given a new document d

kNN vs. Rocchio • kNN: • Lazy learning: no training • Use all the training instances at testing time. • Rocchio algorithm: • At the training time, calculate prototype vectors. • At the test time, instead of using all the training instances, use only prototype vectors. • Linear classifier: not as expressive as kNN.

Summary of Rocchio • Strengths: • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Weakness: • Theoretical validity • Stability and robustness • Predication accuracy: it does not work well when the categories are not linearly separable.

Additional slides

Three major design choices • The weight of feature fk in document di: e.g., tf-idf • The document length normalization • The similarity measure: e.g., cosine

Extending Rocchio? • Generalized instance set (GIS) algorithm (Lam and Ho, 1998). • …

K nearest neighbor and Rocchio algorithm