New Algorithms for Efficient High-Dimensional Nonparametric Classification

New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray

Overview • Introduction • k Nearest Neighbors (k-NN) • KNS1: conventional k-NN search • New algorithms for k-NN classification • KNS2: for skewed-class data • KNS3: ”are at least t of k-NN positive”? • Results • Comments

Introduction: k-NN • k-NN • Nonparametric classification method. • Given a data set of n data points, it finds the k closest points to a query point , and chooses the label corresponding to the majority. • Computational complexity is too high in many solutions, especially for the high-dimensional case.

Introduction: KNS1 • KNS1: • Conventional k-NN search with ball-tree. • Ball-Tree (binary): • Root node represents full set of points. • Leaf node contains some points. • Non-leaf node has two children nodes. • Pivot of a node: one of the points in the node, or the centroid of the points. • Radius of a node:

Introduction: KNS1 • Bound the distance from a query point q: • Trade off the cost of construction against the tightness of the radius of the balls.

Introduction: KNS1 • recursive procedure: PSout=BallKNN (PSin, Node) • PSin consists of the k-NN of q in V ( the set of points searched so far) • PSout consists of the k-NN of q in V and Node

KNS2 • KNS2: • For skewed-class data: one class is much more frequent than the other. • Find the # of the k NN in the positive class without explicitly finding the k-NN set. • Basic idea: • Build two ball-trees: Postree (small), Negtree • “Find Positive”: Search Postree to find k-nn set Possetk using KNS1; • “Insert negative”: Search Negtree, use Possetk as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.

KNS2 • Definitions: • Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order. • V: the set of points in the negative balls visited so far. • (n, C): n is the # of positive points in k NN of q. C ={C1,…,Cn}, Ciis # of the negative points in V closer than the ith positive neighbor to q. • and

KNS2 Step 2 “insert negative” is implemented by the recursive function (nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists) (nin, Cin) sumarize interesting negative points for V; (nout, Cout) sumarize interesting negative points for V and Node;

KNS3 • KNS3 • “are at least t of k nearest neighbors positive?” • No constraint of skewness in the class. • Proposition: • Instead of directly compute the exact values, we compute the lower and upper bound, since m+t=k+1

KNS3 P is a set of balls from Postree, N consists of balls from Negtree.

Experimental results • Real data

Experimental results k=9, t=ceiling(k/2), Randomly pick 1% negative records and 50% positive records as test (986 points) Train on the reaming 87372 data points

Comments • Why k-NN? Baseline • No free lunch: • For uniform high-dimensional data, no benefits. • Results mean the intrinsic dimensionality is much lower.

New Algorithms for Efficient High-Dimensional Nonparametric Classification

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Presentation Transcript

Algorithms for Classification:

Efficient Training in high-dimensional weight space

Efficient classification for metric data

Classification Algorithms

Margin Trees for High-dimensional Classification

Efficient Algorithms for Matching

Nonparametric Bayesian Classification

Bayesian Nonparametric Classification and Applications

Algorithms for Classification:

Energy-Efficient Algorithms

Efficient Clustering of High-Dimensional Data Sets

Efficient classification for metric data

Algorithms for Efficient Collaborative Filtering

Cache Efficient Data Structures and Algorithms for d -Dimensional Problems

Booster in High Dimensional Data Classification

Classification Algorithms – Continued

Efficient Algorithms for Motif Search

Classification Algorithms

Classification Algorithms

Algorithms for Classification:

Algorithms for Classification: