1 / 143

Fast N-Body Algorithms for Massive Datasets

Fast N-Body Algorithms for Massive Datasets. Alexander Gray Georgia Institute of Technology. Is science in 2007 different from science in 1907?. Instruments. [ Science , Szalay & J. Gray, 2001]. Is science in 2007 different from science in 1907?. Instruments. Data: CMB Maps.

makan
Download Presentation

Fast N-Body Algorithms for Massive Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast N-Body Algorithmsfor Massive Datasets Alexander Gray Georgia Institute of Technology

  2. Is science in 2007different from science in 1907? Instruments [Science, Szalay & J. Gray, 2001]

  3. Is science in 2007different from science in 1907? Instruments Data: CMB Maps [Science, Szalay & J. Gray, 2001] 1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Data: Local Redshift Surveys Data: Angular Surveys 1986 CfA 3,500 1996 LCRS 23,000 2003 2dF 250,000 2005 SDSS 800,000 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2008 LSST 2B

  4. Sloan Digital Sky Survey (SDSS)

  5. 1 billion objects 144 dimensions (~250M galaxies in 5 colors, ~1M 2000-D spectra) • Size matters! Now possible: • low noise: subtle patterns • global properties and patterns • rare objects and patterns • more info: 3d, deeper/earlier, bands • in parallel: more accurate simulations • 2008: LSST – time-varying phenomena

  6. Happening everywhere! microarray chips Molecular biology nuclear mag. resonance Drug discovery satellite topography Earth sciences microprocessor Physical simulation functional MRI fiber optics Neuroscience Internet

  7. How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy

  8. How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Bayesian inference O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) Machine learning/ statistics guy

  9. How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Bayesian inference O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) Machine learning/ statistics guy

  10. How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Bayesian inference O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) But I have 1 million points Machine learning/ statistics guy

  11. Data: The Stack

  12. Data: The Stack

  13. Making fast algorithms • There are many large datasets. There are many questions we want to ask them. • Why we must not get obsessed with one specific dataset. • Why we must not get obsessed with one specific question. • The activity I’ll describe is about accerating computations which occur commonly across many ML methods.

  14. Scope • Nearest neighbor • K-means • Hierarchical clustering • N-point correlation functions • Kernel density estimation • Locally-weighted regression • Mean shift tracking • Mixtures of Gaussians • Gaussian process regression • Manifold learning • Support vector machines • Affinity propagation • PCA • ….

  15. Scope • ML methods with distances underneath • Distances only • Continuous kernel functions • ML methods with counting underneath

  16. Scope • Computational ideas in this tutorial: • Data structures • Monte Carlo • Series expansions • Problem/solution abstractions • Challenges • Don’t introduce error, if possible • Don’t introduce tweak parameters, if possible

  17. Two canonical problems • Nearest-neighbor search • Kernel density estimation

  18. Ideas • Data structures and how to use them • Monte Carlo • Series expansions • Problem/solution abstractions

  19. 33 Distance Computations Nearest Neighbor - Naïve Approach • Given a query point X. • Scan through each point Y: • Calculate the distance d(X,Y) • If d(X,Y) < best_seen then Y is the new nearest neighbor. • Takes O(N) time for each query! Slides by Jeremy Kubica

  20. Speeding Up Nearest Neighbor • We can speed up the search for the nearest neighbor: • Examine nearby points first. • Ignore any points that are further then the nearest point found so far. • Do this using a KD-tree: • Tree based data structure • Recursively partitions points into axis aligned boxes. Slides by Jeremy Kubica

  21. KD-Tree Construction We start with a list of n-dimensional points. Slides by Jeremy Kubica

  22. KD-Tree Construction X>.5 YES NO We can split the points into 2 groups by choosing a dimension X and value V and separating the points into X > V and X <= V. Slides by Jeremy Kubica

  23. KD-Tree Construction X>.5 YES NO We can then consider each group separately and possibly split again (along same/different dimension). Slides by Jeremy Kubica

  24. KD-Tree Construction X>.5 YES NO Y>.1 NO YES We can then consider each group separately and possibly split again (along same/different dimension). Slides by Jeremy Kubica

  25. KD-Tree Construction We can keep splitting the points in each set to create a tree structure. Each node with no children (leaf node) contains a list of points. Slides by Jeremy Kubica

  26. KD-Tree Construction We will keep around one additional piece of information at each node. The (tight) bounds of the points at or below this node. Slides by Jeremy Kubica

  27. KD-Tree Construction Use heuristics to make splitting decisions: • Which dimension do we split along? Widest • Which value do we split at? Median of value of that split dimension for the points. • When do we stop? When there are fewer then m points left OR the box has hit some minimum width. Slides by Jeremy Kubica

  28. Exclusion and inclusion, using point-nodekd-tree bounds. O(D) bounds on distance minima/maxima: Slides by Jeremy Kubica

  29. Exclusion and inclusion, using point-nodekd-tree bounds. O(D) bounds on distance minima/maxima: Slides by Jeremy Kubica

  30. Nearest Neighbor with KD Trees We traverse the tree looking for the nearest neighbor of the query point. Slides by Jeremy Kubica

  31. Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first. Slides by Jeremy Kubica

  32. Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first. Slides by Jeremy Kubica

  33. Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node. Slides by Jeremy Kubica

  34. Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node. Slides by Jeremy Kubica

  35. Nearest Neighbor with KD Trees Then we can backtrack and try the other branch at each node visited. Slides by Jeremy Kubica

  36. Nearest Neighbor with KD Trees Each time a new closest node is found, we can update the distance bounds. Slides by Jeremy Kubica

  37. Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor. Slides by Jeremy Kubica

  38. Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor. Slides by Jeremy Kubica

  39. Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor. Slides by Jeremy Kubica

  40. Simple recursive algorithm (k=1 case) NN(xq,R,dlo,xsofar,dsofar) { if dlo > dsofar, return. if leaf(R), [xsofar,dsofar]=NNBase(xq,R,dsofar). else, [R1,d1,R2,d2]=orderByDist(xq,R.l,R.r). NN(xq,R1,d1,xsofar,dsofar). NN(xq,R2,d2,xsofar,dsofar). } Slides by Jeremy Kubica

  41. Nearest Neighbor with KD Trees Instead, some animations showing real data… • kd-tree with cached sufficient statistics • nearest-neighbor with kd-trees • range-count with kd-trees For animations, see: http://www.cs.cmu.edu/~awm/animations/kdtree Slides by Jeremy Kubica

  42. Range-count example

  43. Range-count example

  44. Range-count example

  45. Range-count example

  46. Range-count example Pruned! (inclusion)

  47. Range-count example

  48. Range-count example

  49. Range-count example

  50. Range-count example

More Related