Parallel 2D Kolmogorov-Smirnov Statistic

Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J http://web.mit.edu/ianchan/www/KS2D

Motivation: my friend’s research A colossal X-ray flare, likely sparked by a central Milky Way black hole, produced the bright spot in this Chandra image. [Source CNN]

1D Kolmogorov Smirnov Statistic • test difference in two empirical distributions F ¹ G nonparametrically • D statistic: maximum difference between 2 CDF’s

1D KS Test Bound • Kolmogorov(1933) asymptotic bound:

2D Analog of KS Test • Peacock J, Monthly Notices of the Royal Astronomical Society, 1983, vol 202 p615: Two-Dimensional Goodness-of-Fit Testing in Astronomy • D statistic: considering all possible quadrant divisions, the largest possible difference in CDFs

2D KS Test Bound • Monte Carlo simulated bounds Z = D n1/2

KS2D Test Brute Force Algorithm • O(n2), not exhaustive, quadrants centered at each data ponts • O(n3), exhaustive, quadrants centered at each possible data x and data y combination

O(nlogn) KS2D algorithm • Author: A. Cooke (1999) • construction of binary tree data structure ( O(nlogn) ), require pre-sorted sample data by y

How it works: (1) Tree construction • quadrants centered at (x,y) must have upper left quadrant contains all samples (a,b) where a < x AND b < y • If childless node, Dmin = Dmax = 1/Nsquare or –1/Ncircle, depending on class

How it works: (2) Upward Propagation At node (2,3), we find the MIN and MAX from the 3 choices: 1 inherit Dmin/max from its left child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D| 2 D = delta(left child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x 3 D = delta(left child) + (0/Ns-1/Nc) + Dmin/max (right child), which implies Q contains (2,3) and (2,3) is not on its border

The other 3 quadrants • We have considered the Top Left Quadrant, but the Top Right quadrant can be obtained from the same tree structure if we modify the upward propagation rule by swapping left & right, i.e. At node (2,3), we find the MIN and MAX from the 3 choices: 1 inherit Dmin/max from its right child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D| 2 D = delta(right child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x 3 D = delta(right child) + (0/Ns-1/Nc) + Dmin/max (left child), which implies Q contains (2,3) and (2,3) is not on its border • The Bottom Left/Right Quadrants can be obtained if the tree is built with samples sorted by reverse order of y.

Parallel KS2D Algorithm • Speed possibly scales linearly with number of processors during the upward propagation step, cannot parallelize the tree construction step • Problem size scales linearly with number of processors because sample nodes are stored in processors distributively Challenges • Load Balancing: Dividing the tree nodes equally among processors • Minimize communications: Try to store an entire subtree into a single processor so that less inter-processor communication is necessary.

Load Balancing and Minimum Communications • Ideally…

Load Balancing Strategy (1) Pre-processing Randomly sample 1000 data points. Sort them by x. Consider the 1/numproc*1000th, 2/numproc*1000th…, (numproc-1)/n*1000th positions and use them to define intervals for load balancing Drawback: assumes x and y to be more or less independent

Load Balancing Strategy (2): adaptive Keep a running average of the x values of nodes stored in each processor. For every CHECKPOINT(=2000) number of samples, if the load is skewed (if difference of load between the heaviest load processor and the lightest load processor > 30% of load of lightest processor) change the load balancing intervals to midpoints of the running averages.

Performance(1) • 20,000 and 200,000 samples from uniform [0,1] distribution

Performance (2) • Effects of adaptive load-balancing on performance for samples from standard normal distributions centered at (-0.7,0.7) and (0.7, -0.7)

Conclusion for Parallel KS2D • Speedup is not great, especially when more processors are used because of communication overhead. • Load balancing strategies is noticeably effective for certain data distributions, need dependent on samples • Distributive Memory: Gains the ability to solve larger problems

Parallel 2D Kolmogorov-Smirnov Statistic

Parallel 2D Kolmogorov-Smirnov Statistic

Presentation Transcript

A GENERALIZED KOLMOGOROV-SMIRNOV STATISTIC FOR DETRITAL ZIRCON ANALYSIS OF MODERN RIVERS

A Kolmogorov -Smirnov Correlation-Based Filter for Microarray Data

Proving Lines Parallel (G.2d)

Kolmogorov :

The Kolmogorov-Smirnov Test

Kolmogorov -Smirnov Test

Kolmogorov :