520 likes | 606 Views
Density estimation in linear time (+approximating L 1 -distances). Satyaki Mahalanabis Daniel Štefankovič. University of Rochester. Density estimation. f 6. f 1. f 2. +. DATA. f 4. f 3. f 5. F = a family of densities. density. Density estimation - example. 0.418974, 0.848565,
E N D
Density estimation in linear time (+approximating L1-distances) Satyaki Mahalanabis Daniel Štefankovič University of Rochester
Density estimation f6 f1 f2 + DATA f4 f3 f5 F = a family of densities density
Density estimation - example 0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625 N(,1) + F = a family of normal densities with =1
Measure of quality: g=TRUTH f=OUTPUT L1 – distance from the truth |f-g|1 = |f(x)-g(x)| dx WhyL1? 1) small L1 all events estimated with small additive error 2) scale invariant
Obstacles to “quality”: + DATA F bad data ? weak class of densities dist1(g,F)
What is bad data ? | h-g |1 g = TRUTH h = DATA (empirical density) = 2max |h(A)-g(A)| AY(F) Y(F) = Yatracos class of F Aij={ x | fi(x)>fj(x) } f2 f3 f1 A12 A13 A23
Density estimation F f + with small |g-f|1 DATA (h) assuming these are small: dist1(g,F) = 2max |h(A)-g(A)| AY(F)
Why would these be small ??? dist1(h,F) = 2max |h(A)-g(A)| AY(F) They will be if: 1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small 3) data are iid from h E[max|h(A)-g(A)|] Theorem (Haussler,Dudley, Vapnik, Chervonenkis): VC(Y) samples AY
How to choose from 2 densities? f1 f2 +1 +1 +1 -1
How to choose from 2 densities? T f1 T f2 Th f1 f2 +1 +1 +1 -1 T
How to choose from 2 densities? T f1 T f2 Th f1 f2 Scheffé: if T h > T (f1+f2)/2 f1 else f2 Theorem (see DL’01): |f-g|1 3dist1(g,F) + 2 +1 +1 +1 -1 T
Density estimation F f + with small |g-f|1 DATA (h) assuming these are small: dist1(g,F) = 2max |h(A)-g(A)| AY(F)
Test functions F={f1,f2,...,fN} Tij (x) = sgn(fi(x) – fj(x)) Tij(fi – fj) = (fi-fj)sgn(fi-fj) = |fi– fj|1 Tijh fj wins fi wins Tijfj Tijfi
Density estimation algorithms Scheffé tournament: Pick the density with the most wins. Theorem (DL’01): |f-g|1 9dist1(g,F)+8 n2 Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| n3 ij Theorem (DL’01): |f-g|1 3dist1(g,F)+2
Density estimation algorithms Can we do better? Scheffé tournament: Pick the density with the most wins. Theorem (DL’01): |f-g|1 9dist1(g,F)+8 n2 Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| n3 ij Theorem (DL’01): |f-g|1 3dist1(g,F)+2
Our algorithm: Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L1) 2) eliminate the loser Theorem [MS’08]: |f-g|1 3dist1(g,F)+2 n * Take the most “discriminative” action. * after preprocessing F
Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct) OUTPUT: REPORT: heaviest edge {u1,v1} in G ADVERSARY eliminates u1 or v1 G1 REPORT: heaviest edge {u2,v2} in G1 ADVERSARY eliminates u2 or v2 G2 ..... OBJECTIVE: minimize total time spent generating reports
Tournament revelation problem A report the heaviest edge 4 3 2 B 5 6 D C 1
Tournament revelation problem A report the heaviest edge BC 4 3 2 B 5 6 D C 1
Tournament revelation problem A report the heaviest edge BC 3 2 eliminate B report the heaviest edge D C 1
Tournament revelation problem A report the heaviest edge BC 3 2 eliminate B report the heaviest edge D C 1 AD
Tournament revelation problem report the heaviest edge BC eliminate B report the heaviest edge D C 1 AD eliminate A report the heaviest edge CD
Tournament revelation problem A BC B C 4 3 2 AD BD B A D B D 5 6 D DC AC AB C AD 1 2O(F) preprocessing O(F) run-time O(F2 log F) preprocessing O(F2) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???
Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser (in practice 2) is more costly) 2O(F) preprocessing O(F) run-time O(F2 log F) preprocessing O(F2) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???
Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser Theorem: |f-g|1 3dist1(g,F)+2 n Proof: “that guy lost even more badly!” For every f’ to which f loses |f-f’|1 max |f’-f’’|1 f’ loses to f’’
Proof: “that guy lost even more badly!” For every f’ to which f loses |f-f’|1 max |f’-f’’|1 f’ loses to f’’ 2hT23 f2T23 + f3T23 f1 (f1-f2)T12 (f2-f3) T23 (f4-h)T23 (fi-fj)(Tij-Tkl) 0 bad loss f3 |f1-g|1 3|f2-g|1+2 BEST=f2
Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56) K = kernel h = density kernel used to smooth empirical g (x1,x2,...,xn i.i.d. samples from h) n 1 K(y-xi) h * K n as n i=1 = g * K
What K should we choose? g * K n 1 = K(y-xi) h * K n as n i=1 Dirac is not good Dirac would be good Something in-between: bandwidth selection for kernel density estimates K(x/s) as s 0 Ks(x) Dirac Ks(x)= s Theorem (see DL’01): as s 0 with sn |g*K – h|1 0
Data splitting methods for kernel density estimates How to pick the smoothing factor ? n ( ) 1 y-xi K ns s i=1 n-m ( ) y-xi 1 K x1,...,xn-m fs = s (n-m)s i=1 x1,x2,...,xn choose s using density estimation xn-m+1,...,xn
Kernels we will use: ( ) 1 y-xi K ns s piecewise uniform piecewise linear
Bandwidth selection for uniform kernels E.g. Nn1/2 mn5/4 N distributions each is piecewise uniform with n pieces m datapoints Goal: run the density estimation algorithm efficiently TIME MD EMLW (fi+fj)Tij gTij n+m log n N 2 (fk-h) Tkj N2 n+m log n |fi-fj|1 n N2
Bandwidth selection for uniform kernels Can speed this up? E.g. Nn1/2 mn5/4 N distributions each is piecewise uniform with n pieces m datapoints Goal: run the density estimation algorithm efficiently TIME MD EMLW (fi+fj)Tij gTij n+m log n N 2 (fk-h) Tkj N2 n+m log n |fi-fj|1 n N2
Bandwidth selection for uniform kernels Can speed this up? E.g. Nn1/2 mn5/4 N distributions each is piecewise uniform with n pieces m datapoints absolute error bad relative error good Goal: run the density estimation algorithm efficiently TIME MD EMLW (fi+fj)Tij gTij n+m log n N 2 (fk-h) Tkj N2 n+m log n |fi-fj|1 n N2
Approximating L1-distances between distributions N piecewise uniform densities (each n pieces) (N2+Nn) (log N) WE WILL DO: 2 TRIVIAL (exact): N2n
Dimension reduction for L2 |S|=n Johnson-Lindenstrauss Lemma (’82) : L2 Lt2t = O(-2 ln n) ( x,y S) d(x,y) d((x),(y)) (1+)d(x,y) N(0,t-1/2)
Dimension reduction for L1 |S|=n Cauchy Random Projection (Indyk’00) : L1 Lt1t = O(-2 ln n) ( x,y S) d(x,y) est((x),(y)) (1+)d(x,y) N(0,t-1/2) C(0,1/t) (Charikar, Brinkman’03 : cannot replace est by d)
Cauchy distribution C(0,1) density function: 1 (1+x2) FACTS: XC(0,1) aXC(0,|a|) XC(0,a), YC(0,b) X+YC(0,a+b)
Cauchy random projection for L1 (Indyk’00) A B D X1 X2 X3 X4 X5 X6 X7 X8 X9 X1C(0,z) A(X2+X3) + B(X5+X6+X7+X8) z
Cauchy random projection for L1 (Indyk’00) A B D X1 X2 X3 X4 X5 X6 X7 X8 X9 X1C(0,z) A(X2+X3) + B(X5+X6+X7+X8) D(X1+X2+...+X8+X9) z Cauchy(0,|-|1)
All pairs L1-distances piece-wise linear densities
All pairs L1-distances piece-wise linear densities R=(3/4)X1 + (1/4)X2 B=(3/4)X2 + (1/4)X1 R-BC(0,1/2) X1 X2 C(0,1/2)
All pairs L1-distances piece-wise linear densities Problem: too many intersections! Solution: cut into even smaller pieces! Stochastic measures are useful.
Brownian motion 1 exp(-x^2/2) (2)1/2 Cauchy motion 1 (1+x)2
Brownian motion 1 exp(-x^2/2) (2)1/2 computing integrals is easy f:RRd f dL = Y N(0,S)
Cauchy motion 1 (1+x)2 computing integrals is easy f:RRd f dL = Y C(0,s) for d=1 computing integrals is hard d>1 * * obtaining explicit expression for the density
X1 X2 X3 X4 X5 X6 X7 X8 X9 What were we doing? (f1,f2,f3) dL = (w1)1,(w2)1,(w3)1
X1 X2 X3 X4 X5 X6 X7 X8 X9 What were we doing? (f1,f2,f3) dL = (w1)1,(w2)1,(w3)1 Can we efficiently compute integrals dL for piecewise linear?
Can we efficiently compute integrals dL for piecewise linear? : R R2 (z)=(1,z) (X,Y)= dL
: R R2 (z)=(1,z) (X,Y)= dL u+v,u-v (2(X-Y),2Y) has density at 2