Fast N-Body Algorithms for Massive Datasets

Fast N-Body Algorithmsfor Massive Datasets Alexander Gray Georgia Institute of Technology

Is science in 2007different from science in 1907? Instruments [Science, Szalay & J. Gray, 2001]

Is science in 2007different from science in 1907? Instruments Data: CMB Maps [Science, Szalay & J. Gray, 2001] 1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Data: Local Redshift Surveys Data: Angular Surveys 1986 CfA 3,500 1996 LCRS 23,000 2003 2dF 250,000 2005 SDSS 800,000 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2008 LSST 2B

Sloan Digital Sky Survey (SDSS)

1 billion objects 144 dimensions (~250M galaxies in 5 colors, ~1M 2000-D spectra) • Size matters! Now possible: • low noise: subtle patterns • global properties and patterns • rare objects and patterns • more info: 3d, deeper/earlier, bands • in parallel: more accurate simulations • 2008: LSST – time-varying phenomena

Happening everywhere! microarray chips Molecular biology nuclear mag. resonance Drug discovery satellite topography Earth sciences microprocessor Physical simulation functional MRI fiber optics Neuroscience Internet

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Bayesian inference O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) Machine learning/ statistics guy

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Bayesian inference O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) But I have 1 million points Machine learning/ statistics guy

Data: The Stack

Making fast algorithms • There are many large datasets. There are many questions we want to ask them. • Why we must not get obsessed with one specific dataset. • Why we must not get obsessed with one specific question. • The activity I’ll describe is about accerating computations which occur commonly across many ML methods.

Scope • Nearest neighbor • K-means • Hierarchical clustering • N-point correlation functions • Kernel density estimation • Locally-weighted regression • Mean shift tracking • Mixtures of Gaussians • Gaussian process regression • Manifold learning • Support vector machines • Affinity propagation • PCA • ….

Scope • ML methods with distances underneath • Distances only • Continuous kernel functions • ML methods with counting underneath

Scope • Computational ideas in this tutorial: • Data structures • Monte Carlo • Series expansions • Problem/solution abstractions • Challenges • Don’t introduce error, if possible • Don’t introduce tweak parameters, if possible

Two canonical problems • Nearest-neighbor search • Kernel density estimation

Ideas • Data structures and how to use them • Monte Carlo • Series expansions • Problem/solution abstractions

33 Distance Computations Nearest Neighbor - Naïve Approach • Given a query point X. • Scan through each point Y: • Calculate the distance d(X,Y) • If d(X,Y) < best_seen then Y is the new nearest neighbor. • Takes O(N) time for each query! Slides by Jeremy Kubica

Speeding Up Nearest Neighbor • We can speed up the search for the nearest neighbor: • Examine nearby points first. • Ignore any points that are further then the nearest point found so far. • Do this using a KD-tree: • Tree based data structure • Recursively partitions points into axis aligned boxes. Slides by Jeremy Kubica

KD-Tree Construction We start with a list of n-dimensional points. Slides by Jeremy Kubica

KD-Tree Construction X>.5 YES NO We can split the points into 2 groups by choosing a dimension X and value V and separating the points into X > V and X <= V. Slides by Jeremy Kubica

KD-Tree Construction X>.5 YES NO We can then consider each group separately and possibly split again (along same/different dimension). Slides by Jeremy Kubica

KD-Tree Construction X>.5 YES NO Y>.1 NO YES We can then consider each group separately and possibly split again (along same/different dimension). Slides by Jeremy Kubica

KD-Tree Construction We can keep splitting the points in each set to create a tree structure. Each node with no children (leaf node) contains a list of points. Slides by Jeremy Kubica

KD-Tree Construction We will keep around one additional piece of information at each node. The (tight) bounds of the points at or below this node. Slides by Jeremy Kubica

KD-Tree Construction Use heuristics to make splitting decisions: • Which dimension do we split along? Widest • Which value do we split at? Median of value of that split dimension for the points. • When do we stop? When there are fewer then m points left OR the box has hit some minimum width. Slides by Jeremy Kubica

Exclusion and inclusion, using point-nodekd-tree bounds. O(D) bounds on distance minima/maxima: Slides by Jeremy Kubica

Nearest Neighbor with KD Trees We traverse the tree looking for the nearest neighbor of the query point. Slides by Jeremy Kubica

Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first. Slides by Jeremy Kubica

Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node. Slides by Jeremy Kubica

Nearest Neighbor with KD Trees Then we can backtrack and try the other branch at each node visited. Slides by Jeremy Kubica

Nearest Neighbor with KD Trees Each time a new closest node is found, we can update the distance bounds. Slides by Jeremy Kubica

Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor. Slides by Jeremy Kubica

Simple recursive algorithm (k=1 case) NN(xq,R,dlo,xsofar,dsofar) { if dlo > dsofar, return. if leaf(R), [xsofar,dsofar]=NNBase(xq,R,dsofar). else, [R1,d1,R2,d2]=orderByDist(xq,R.l,R.r). NN(xq,R1,d1,xsofar,dsofar). NN(xq,R2,d2,xsofar,dsofar). } Slides by Jeremy Kubica

Nearest Neighbor with KD Trees Instead, some animations showing real data… • kd-tree with cached sufficient statistics • nearest-neighbor with kd-trees • range-count with kd-trees For animations, see: http://www.cs.cmu.edu/~awm/animations/kdtree Slides by Jeremy Kubica

Range-count example

Range-count example Pruned! (inclusion)

Range-count example

Fast N-Body Algorithms for Massive Datasets