570 likes | 632 Views
Sublinear. Algorithms. Sloan Digital Sky Survey. 4 petabytes (~1MG). 10 petabytes/yr. Biomedical imaging. 150 petabytes/yr. Data. Data. massive input. output. Sample tiny fraction. Sublinear algorithms. Approximate MST. [CRT ’01]. Optimal!.
E N D
Sublinear Algorithms
Sloan Digital Sky Survey 4 petabytes (~1MG) 10 petabytes/yr Biomedical imaging 150 petabytes/yr
massive input output Sample tiny fraction Sublinear algorithms
ApproximateMST [CRT ’01] Optimal!
E = no. connected components 2 var << (no. connected components)
Shortest Paths [CLM ’03]
Ray Shooting [CLM ’03] Optimal! Volume Intersection Point location
Self-Improving Algorithms
011010110110101010110010101010110100111001101010010100010 low-entropy data • Takens embeddings • Markov models (speech)
Self-Improving Algorithms Arbitrary, unknown random source Sorting Matching MaxCut All pairs shortest paths Transitive closure Clustering
Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under some postulated prior. 2. Learning phase: Algorithm finetunes itself as it learns about the random source through repeated use. 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) random source.
Self-Improving Algorithms 0110101100101000101001001010100010101001 time T1 time T2 time T3 time T4 time T5 E Tk Optimal expected time for random source
Sorting (x1, x2, … , xn) each xi independent from Di H = entropy of rank distribution Optimal!
Clustering K-median (k=2)
d Minimize sum of distances Hamming cube {0,1}
d Minimize sum of distances Hamming cube {0,1} NP-hard
d Minimize sum of distances Hamming cube {0,1} [KSS]
How to achieve linear limiting expected time? dn Input space {0,1} Identify core Use KSS prob < O(dn)/KSS Tail:
NP vs P: input vicinity algorithmic vicinity How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm
Online Data Reconstruction
011010110110101010110010101010110100111001101010010100010 011010110***110101010110010101010***10011100**10010***010 1. Data is accessible before noise 2. Or it’s not 2. Or ?
011010110***110101010110010101010***10011100**10010***010 1. Data is accessible before noise
011010110110101010110010101010110100111001101010010100010 010*10*0**001 decode encode error correcting codes
011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise Assumptions are necessary !
011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise 1. Sorted sequence 2. Bipartite graph, expander 3. Solid w/ angular constraints 4. Low dim attractor set
011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise data must satisfy some property P but does not quite
f(x) = ? f =access function x data f(x) But life being what it is…
f(x) = ? x data f(x)
Humans Define distance from any object to data class
no undo f(x) = ? filter x x1, x2,… g(x) f(x1), f(x2),… g is access function for:
Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)
d Monotone function: [n] R Filter requires polylog (n) queries
Online reconstruction don't mortgage the future
Online reconstruction early decisions are crucial !
Frequency of a point x Smallest interval I containing > |I|/2 violations involving f(x)
Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)
To prove: 1. Frequencies can be estimated in polylog time 2. Function is monotone over zero-frequency domain 3. ZF domain occupies (1-2 ) fraction
Bivariate concave function Filter requires polylog (n) queries