1 / 14

New Algorithms for Efficient High-Dimensional Nonparametric Classification

New Algorithms for Efficient High-Dimensional Nonparametric Classification. Ting Liu, Andrew W. Moore, and Alexander Gray. Overview. Introduction k Nearest Neighbors ( k -NN) KNS1: conventional k -NN search New algorithms for k -NN classification KNS2: for skewed-class data

marlow
Download Presentation

New Algorithms for Efficient High-Dimensional Nonparametric Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray

  2. Overview • Introduction • k Nearest Neighbors (k-NN) • KNS1: conventional k-NN search • New algorithms for k-NN classification • KNS2: for skewed-class data • KNS3: ”are at least t of k-NN positive”? • Results • Comments

  3. Introduction: k-NN • k-NN • Nonparametric classification method. • Given a data set of n data points, it finds the k closest points to a query point , and chooses the label corresponding to the majority. • Computational complexity is too high in many solutions, especially for the high-dimensional case.

  4. Introduction: KNS1 • KNS1: • Conventional k-NN search with ball-tree. • Ball-Tree (binary): • Root node represents full set of points. • Leaf node contains some points. • Non-leaf node has two children nodes. • Pivot of a node: one of the points in the node, or the centroid of the points. • Radius of a node:

  5. Introduction: KNS1 • Bound the distance from a query point q: • Trade off the cost of construction against the tightness of the radius of the balls.

  6. Introduction: KNS1 • recursive procedure: PSout=BallKNN (PSin, Node) • PSin consists of the k-NN of q in V ( the set of points searched so far) • PSout consists of the k-NN of q in V and Node

  7. KNS2 • KNS2: • For skewed-class data: one class is much more frequent than the other. • Find the # of the k NN in the positive class without explicitly finding the k-NN set. • Basic idea: • Build two ball-trees: Postree (small), Negtree • “Find Positive”: Search Postree to find k-nn set Possetk using KNS1; • “Insert negative”: Search Negtree, use Possetk as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.

  8. KNS2 • Definitions: • Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order. • V: the set of points in the negative balls visited so far. • (n, C): n is the # of positive points in k NN of q. C ={C1,…,Cn}, Ciis # of the negative points in V closer than the ith positive neighbor to q. • and

  9. KNS2 Step 2 “insert negative” is implemented by the recursive function (nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists) (nin, Cin) sumarize interesting negative points for V; (nout, Cout) sumarize interesting negative points for V and Node;

  10. KNS3 • KNS3 • “are at least t of k nearest neighbors positive?” • No constraint of skewness in the class. • Proposition: • Instead of directly compute the exact values, we compute the lower and upper bound, since m+t=k+1

  11. KNS3 P is a set of balls from Postree, N consists of balls from Negtree.

  12. Experimental results • Real data

  13. Experimental results k=9, t=ceiling(k/2), Randomly pick 1% negative records and 50% positive records as test (986 points) Train on the reaming 87372 data points

  14. Comments • Why k-NN? Baseline • No free lunch: • For uniform high-dimensional data, no benefits. • Results mean the intrinsic dimensionality is much lower.

More Related