Self-Improving Algorithms for Massive Data Sets in Sublinear Time

Sublinear Algorithms

Sloan Digital Sky Survey 4 petabytes (~1MG) 10 petabytes/yr Biomedical imaging 150 petabytes/yr

Data

massive input output Sample tiny fraction Sublinear algorithms

ApproximateMST [CRT ’01] Optimal!

Reduces to counting connected components

E = no. connected components 2 var << (no. connected components)

Shortest Paths [CLM ’03]

Ray Shooting [CLM ’03] Optimal! Volume Intersection Point location

Self-Improving Algorithms

011010110110101010110010101010110100111001101010010100010 low-entropy data • Takens embeddings • Markov models (speech)

Self-Improving Algorithms Arbitrary, unknown random source Sorting Matching MaxCut All pairs shortest paths Transitive closure Clustering

Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under some postulated prior. 2. Learning phase: Algorithm finetunes itself as it learns about the random source through repeated use. 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) random source.

Self-Improving Algorithms 0110101100101000101001001010100010101001 time T1 time T2 time T3 time T4 time T5 E Tk  Optimal expected time for random source

Sorting (x1, x2, … , xn) each xi independent from Di H = entropy of rank distribution Optimal!

Clustering K-median (k=2)

d Minimize sum of distances Hamming cube {0,1}

d Minimize sum of distances Hamming cube {0,1} NP-hard

d Minimize sum of distances Hamming cube {0,1} [KSS]

How to achieve linear limiting expected time? dn Input space {0,1} Identify core Use KSS prob < O(dn)/KSS Tail:

NP vs P: input vicinity  algorithmic vicinity How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm

Main difficulty: How to spot the tail?

Online Data Reconstruction

011010110110101010110010101010110100111001101010010100010 011010110***110101010110010101010***10011100**10010***010 1. Data is accessible before noise 2. Or it’s not 2. Or ?

011010110***110101010110010101010***10011100**10010***010 1. Data is accessible before noise

011010110110101010110010101010110100111001101010010100010 010*10*0**001 decode encode error correcting codes

011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise Assumptions are necessary !

011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise 1. Sorted sequence 2. Bipartite graph, expander 3. Solid w/ angular constraints 4. Low dim attractor set

011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise data must satisfy some property P but does not quite

f(x) = ? f =access function x data f(x) But life being what it is…

f(x) = ? x data f(x)

Humans Define distance from any object to data class

no undo f(x) = ? filter x x1, x2,… g(x) f(x1), f(x2),… g is access function for:

Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)

d Monotone function: [n]  R Filter requires polylog (n) queries

Offline reconstruction

Online reconstruction

Online reconstruction don't mortgage the future

Online reconstruction early decisions are crucial !

monotonefunction

Frequency of a point x Smallest interval I containing > |I|/2 violations involving f(x)

Frequency of a point

Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)

To prove: 1. Frequencies can be estimated in polylog time 2. Function is monotone over zero-frequency domain 3. ZF domain occupies (1-2 ) fraction

Bivariate concave function Filter requires polylog (n) queries

Self-Improving Algorithms for Massive Data Sets in Sublinear Time

Self-Improving Algorithms for Massive Data Sets in Sublinear Time

Presentation Transcript

Estimating the Unseen: Sublinear Statistics

Sketching, Sampling and other Sublinear Algorithms: Streaming

Sublinear Algorihms for Big Data

Sublinear Algorithms for Approximating Graph Parameters

Sublinear Algorihms for Big Data

Implicit regularization in sublinear approximation algorithms

multiplication by a constant is sublinear

Sublinear Algorithms via Precision Sampling

Sublinear Algorihms for Big Data

RA PRESENTATION Sublinear Geometric Algorithms

Towards Sublinear Time Multiclass Object Detection

Property Testing: Sublinear-Time Approximate Decisions

Sublinear Algorithms

Sublinear Algorithms

Sublinear Algorihms for Big Data

Sublinear Algorihms for Big Data

Sublinear FPTASs for Stochastic Optimization Problems

Finding Cycles and Trees in Sublinear Time

Sublinear-Time Error-Correction and Error-Detection