1 / 18

Parallel 2D Kolmogorov-Smirnov Statistic

Parallel 2D Kolmogorov-Smirnov Statistic. Ian Chan 5/12/02 6.338J/18.337J http://web.mit.edu/ianchan/www/KS2D. Motivation: my friend’s research. A colossal X-ray flare, likely sparked by a central Milky Way black hole, produced the bright spot in this Chandra image. [Source CNN].

dmarilyn
Download Presentation

Parallel 2D Kolmogorov-Smirnov Statistic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J http://web.mit.edu/ianchan/www/KS2D

  2. Motivation: my friend’s research A colossal X-ray flare, likely sparked by a central Milky Way black hole, produced the bright spot in this Chandra image. [Source CNN]

  3. 1D Kolmogorov Smirnov Statistic • test difference in two empirical distributions F ¹ G nonparametrically • D statistic: maximum difference between 2 CDF’s

  4. 1D KS Test Bound • Kolmogorov(1933) asymptotic bound:

  5. 2D Analog of KS Test • Peacock J, Monthly Notices of the Royal Astronomical Society, 1983, vol 202 p615: Two-Dimensional Goodness-of-Fit Testing in Astronomy • D statistic: considering all possible quadrant divisions, the largest possible difference in CDFs

  6. 2D KS Test Bound • Monte Carlo simulated bounds Z = D n1/2

  7. KS2D Test Brute Force Algorithm • O(n2), not exhaustive, quadrants centered at each data ponts • O(n3), exhaustive, quadrants centered at each possible data x and data y combination

  8. O(nlogn) KS2D algorithm • Author: A. Cooke (1999) • construction of binary tree data structure ( O(nlogn) ), require pre-sorted sample data by y

  9. How it works: (1) Tree construction • quadrants centered at (x,y) must have upper left quadrant contains all samples (a,b) where a < x AND b < y • If childless node, Dmin = Dmax = 1/Nsquare or –1/Ncircle, depending on class

  10. How it works: (2) Upward Propagation At node (2,3), we find the MIN and MAX from the 3 choices: 1 inherit Dmin/max from its left child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D| 2 D = delta(left child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x 3 D = delta(left child) + (0/Ns-1/Nc) + Dmin/max (right child), which implies Q contains (2,3) and (2,3) is not on its border

  11. The other 3 quadrants • We have considered the Top Left Quadrant, but the Top Right quadrant can be obtained from the same tree structure if we modify the upward propagation rule by swapping left & right, i.e. At node (2,3), we find the MIN and MAX from the 3 choices: 1 inherit Dmin/max from its right child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D| 2 D = delta(right child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x 3 D = delta(right child) + (0/Ns-1/Nc) + Dmin/max (left child), which implies Q contains (2,3) and (2,3) is not on its border • The Bottom Left/Right Quadrants can be obtained if the tree is built with samples sorted by reverse order of y.

  12. Parallel KS2D Algorithm • Speed possibly scales linearly with number of processors during the upward propagation step, cannot parallelize the tree construction step • Problem size scales linearly with number of processors because sample nodes are stored in processors distributively Challenges • Load Balancing: Dividing the tree nodes equally among processors • Minimize communications: Try to store an entire subtree into a single processor so that less inter-processor communication is necessary.

  13. Load Balancing and Minimum Communications • Ideally…

  14. Load Balancing Strategy (1) Pre-processing Randomly sample 1000 data points. Sort them by x. Consider the 1/numproc*1000th, 2/numproc*1000th…, (numproc-1)/n*1000th positions and use them to define intervals for load balancing Drawback: assumes x and y to be more or less independent

  15. Load Balancing Strategy (2): adaptive Keep a running average of the x values of nodes stored in each processor. For every CHECKPOINT(=2000) number of samples, if the load is skewed (if difference of load between the heaviest load processor and the lightest load processor > 30% of load of lightest processor) change the load balancing intervals to midpoints of the running averages.

  16. Performance(1) • 20,000 and 200,000 samples from uniform [0,1] distribution

  17. Performance (2) • Effects of adaptive load-balancing on performance for samples from standard normal distributions centered at (-0.7,0.7) and (0.7, -0.7)

  18. Conclusion for Parallel KS2D • Speedup is not great, especially when more processors are used because of communication overhead. • Load balancing strategies is noticeably effective for certain data distributions, need dependent on samples • Distributive Memory: Gains the ability to solve larger problems

More Related