140 likes | 276 Views
New Algorithms for Efficient High-Dimensional Nonparametric Classification. Ting Liu, Andrew W. Moore, and Alexander Gray. Overview. Introduction k Nearest Neighbors ( k -NN) KNS1: conventional k -NN search New algorithms for k -NN classification KNS2: for skewed-class data
E N D
New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray
Overview • Introduction • k Nearest Neighbors (k-NN) • KNS1: conventional k-NN search • New algorithms for k-NN classification • KNS2: for skewed-class data • KNS3: ”are at least t of k-NN positive”? • Results • Comments
Introduction: k-NN • k-NN • Nonparametric classification method. • Given a data set of n data points, it finds the k closest points to a query point , and chooses the label corresponding to the majority. • Computational complexity is too high in many solutions, especially for the high-dimensional case.
Introduction: KNS1 • KNS1: • Conventional k-NN search with ball-tree. • Ball-Tree (binary): • Root node represents full set of points. • Leaf node contains some points. • Non-leaf node has two children nodes. • Pivot of a node: one of the points in the node, or the centroid of the points. • Radius of a node:
Introduction: KNS1 • Bound the distance from a query point q: • Trade off the cost of construction against the tightness of the radius of the balls.
Introduction: KNS1 • recursive procedure: PSout=BallKNN (PSin, Node) • PSin consists of the k-NN of q in V ( the set of points searched so far) • PSout consists of the k-NN of q in V and Node
KNS2 • KNS2: • For skewed-class data: one class is much more frequent than the other. • Find the # of the k NN in the positive class without explicitly finding the k-NN set. • Basic idea: • Build two ball-trees: Postree (small), Negtree • “Find Positive”: Search Postree to find k-nn set Possetk using KNS1; • “Insert negative”: Search Negtree, use Possetk as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.
KNS2 • Definitions: • Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order. • V: the set of points in the negative balls visited so far. • (n, C): n is the # of positive points in k NN of q. C ={C1,…,Cn}, Ciis # of the negative points in V closer than the ith positive neighbor to q. • and
KNS2 Step 2 “insert negative” is implemented by the recursive function (nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists) (nin, Cin) sumarize interesting negative points for V; (nout, Cout) sumarize interesting negative points for V and Node;
KNS3 • KNS3 • “are at least t of k nearest neighbors positive?” • No constraint of skewness in the class. • Proposition: • Instead of directly compute the exact values, we compute the lower and upper bound, since m+t=k+1
KNS3 P is a set of balls from Postree, N consists of balls from Negtree.
Experimental results • Real data
Experimental results k=9, t=ceiling(k/2), Randomly pick 1% negative records and 50% positive records as test (986 points) Train on the reaming 87372 data points
Comments • Why k-NN? Baseline • No free lunch: • For uniform high-dimensional data, no benefits. • Results mean the intrinsic dimensionality is much lower.