So Much Data

So Much Data So Little Time Bernard Chazelle Princeton University

So Many Slides (before lunch) So Little Time Bernard Chazelle Princeton University

math algorithms experimentation 2006 computation

Computers have two problems

1. They don’t have steering wheels

2. End of Moore’s Law 2020 party’s over !

algorithms experimentation computation

This is not me 32 x 17 224 32 = 544

FFT RSA

The Era of the Algorithm

Data

unevenly priced noisy Data big uncertain low entropy

Sloan Digital Sky Survey 4 petabytes (~1MG) 10 petabytes/yr Biomedical imaging 150 petabytes/yr

My A(9,9)-th paper Collected works of Micha Sharir

massive input output Sample tiny fraction Sublinear Algorithms

Shortest Paths [C-Liu-Magen ’03] New York Delphi

Ray Shooting Optimal! Volume Intersection Point location

Approximate MST [C-Rubinfeld-Trevisan ’01] Optimal!

Reduces to counting connected components

E = no. connected components 2 var << (no. connected components) is a good estimator whp, of # connected components

input space worst case average case (uniform)

worst case

average case = actuarial view

“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “

arbitrary, unknown random source Self-Improving Algorithms

Yes ! This could be YOU, too !

0110101100101000101001001010100010101001 time T1 time T2 time T3 time T4 E Tk  Optimal expected time for random source

Clustering [ Ailon-C-Liu-Comandur ’05 ] K-median over Hamming cube

minimize sum of distances

minimize sum of distances NP-hard

[ Kumar-Sabharwal-Sen ’04 ] ) ( 1 + COST OPT

How to achieve linear limiting time? dn Input space {0,1} Identify core Use KSS prob < O(dn)/KSS Tail:

Store sample of precomputed KSS Nearest neighbor Incremental algorithm

Main difficulty: How to spot the tail?

011010110110101010110010101010110100111001101010010100010 011010110***110101010110010101010***10011100**10010***010 Bring in da noise !

011010110110101010110010101010110100111001101010010100010 011010110***110101010110010101010***10011100**10010***010 encode

011010110110101010110010101010110100111001101010010100010 011010110***110101010110010101010***10011100**10010***010 decode

011010110110101010110010101010110100111001101010010100010 error correcting codes

011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise What makes you think it’s wrong?

011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise must satisfy some property (eg, convex, bipartite) but does not quite

f(x) = ? f =access function x data f(x)

f(x) = ? f =access function x f(x)

f(x) = ? x f(x) But life being what it is…

f(x) = ? x f(x)

Humans Define distance from any object to data class

no undo f(x) = ? filter x x1, x2,… g(x) f(x1), f(x2),… g is access function for:

Online Data Reconstruction early decisions are crucial !

d Monotone function: [n]  R Filter requires polylog (n) lookups [ Ailon-C-Liu-Comandur ’04 ]

So Much Data

So Much Data

Presentation Transcript

Love You So Much

So Much Data

I Love You So Much

So much alike!

so much depends upon…

LOVE YOU SO MUCH

Why Drugs Cost So Much…

So much of everything

So much for being optimistic .

Why So Much Temptation?

So Much Change, So Many Opportunities

Why So Much for So Little?

So much to remember, so little space

So Much Data

Thank you so much!

SO much / many

i waste so much money

Volcanoes and so much more!

So Much Data – Where Do I Start?

“Together, we’re so much more!”

So Much Financial Literacy: So Few Results