1 / 52

Satyaki Mahalanabis Daniel Štefankovič

Density estimation in linear time (+approximating L 1 -distances). Satyaki Mahalanabis Daniel Štefankovič. University of Rochester. Density estimation. f 6. f 1. f 2. +. DATA. f 4. f 3. f 5. F = a family of densities. density. Density estimation - example. 0.418974, 0.848565,

jesse-bush
Download Presentation

Satyaki Mahalanabis Daniel Štefankovič

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Density estimation in linear time (+approximating L1-distances) Satyaki Mahalanabis Daniel Štefankovič University of Rochester

  2. Density estimation f6 f1 f2 + DATA f4 f3 f5 F = a family of densities density

  3. Density estimation - example 0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625 N(,1) + F = a family of normal densities with =1 

  4. Measure of quality: g=TRUTH f=OUTPUT L1 – distance from the truth |f-g|1 =  |f(x)-g(x)| dx WhyL1? 1) small L1 all events estimated with small additive error 2) scale invariant

  5. Obstacles to “quality”: + DATA F bad data  ? weak class of densities dist1(g,F)

  6. What is bad data ? | h-g |1 g = TRUTH h = DATA (empirical density) = 2max |h(A)-g(A)| AY(F) Y(F) = Yatracos class of F Aij={ x | fi(x)>fj(x) } f2 f3 f1 A12 A13 A23

  7. Density estimation F f + with small |g-f|1 DATA (h) assuming these are small: dist1(g,F) = 2max |h(A)-g(A)| AY(F)

  8. Why would these be small ??? dist1(h,F) = 2max |h(A)-g(A)| AY(F) They will be if: 1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small 3) data are iid from h E[max|h(A)-g(A)|] Theorem (Haussler,Dudley, Vapnik, Chervonenkis): VC(Y) samples AY

  9. How to choose from 2 densities? f1 f2

  10. How to choose from 2 densities? f1 f2 +1 +1 +1 -1

  11. How to choose from 2 densities? T f1 T f2 Th  f1 f2 +1 +1 +1 -1 T

  12. How to choose from 2 densities? T f1 T f2 Th  f1 f2 Scheffé: if T h > T (f1+f2)/2  f1 else  f2 Theorem (see DL’01): |f-g|1 3dist1(g,F) + 2 +1 +1 +1 -1 T

  13. Density estimation F f + with small |g-f|1 DATA (h) assuming these are small: dist1(g,F) = 2max |h(A)-g(A)| AY(F)

  14. Test functions F={f1,f2,...,fN} Tij (x) = sgn(fi(x) – fj(x)) Tij(fi – fj) =  (fi-fj)sgn(fi-fj) = |fi– fj|1 Tijh fj wins fi wins Tijfj Tijfi

  15. Density estimation algorithms Scheffé tournament: Pick the density with the most wins. Theorem (DL’01): |f-g|1 9dist1(g,F)+8 n2 Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| n3 ij Theorem (DL’01): |f-g|1 3dist1(g,F)+2

  16. Density estimation algorithms Can we do better? Scheffé tournament: Pick the density with the most wins. Theorem (DL’01): |f-g|1 9dist1(g,F)+8 n2 Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| n3 ij Theorem (DL’01): |f-g|1 3dist1(g,F)+2

  17. Our algorithm: Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L1) 2) eliminate the loser Theorem [MS’08]: |f-g|1 3dist1(g,F)+2 n * Take the most “discriminative” action. * after preprocessing F

  18. Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct) OUTPUT: REPORT: heaviest edge {u1,v1} in G ADVERSARY eliminates u1 or v1 G1 REPORT: heaviest edge {u2,v2} in G1 ADVERSARY eliminates u2 or v2 G2 ..... OBJECTIVE: minimize total time spent generating reports

  19. Tournament revelation problem A report the heaviest edge 4 3 2 B 5 6 D C 1

  20. Tournament revelation problem A report the heaviest edge BC 4 3 2 B 5 6 D C 1

  21. Tournament revelation problem A report the heaviest edge BC 3 2 eliminate B report the heaviest edge D C 1

  22. Tournament revelation problem A report the heaviest edge BC 3 2 eliminate B report the heaviest edge D C 1 AD

  23. Tournament revelation problem report the heaviest edge BC eliminate B report the heaviest edge D C 1 AD eliminate A report the heaviest edge CD

  24. Tournament revelation problem A BC B C 4 3 2 AD BD B A D B D 5 6 D DC AC AB C AD 1 2O(F) preprocessing  O(F) run-time O(F2 log F) preprocessing  O(F2) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

  25. Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser (in practice 2) is more costly) 2O(F) preprocessing  O(F) run-time O(F2 log F) preprocessing  O(F2) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

  26. Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser Theorem: |f-g|1 3dist1(g,F)+2 n Proof: “that guy lost even more badly!” For every f’ to which f loses |f-f’|1 max |f’-f’’|1 f’ loses to f’’

  27. Proof: “that guy lost even more badly!” For every f’ to which f loses |f-f’|1 max |f’-f’’|1 f’ loses to f’’ 2hT23 f2T23 + f3T23 f1 (f1-f2)T12 (f2-f3) T23 (f4-h)T23  (fi-fj)(Tij-Tkl) 0 bad loss f3 |f1-g|1 3|f2-g|1+2 BEST=f2

  28. Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56) K = kernel h = density kernel used to smooth empirical g (x1,x2,...,xn i.i.d. samples from h) n 1  K(y-xi) h * K n as n i=1 = g * K

  29. What K should we choose? g * K n 1  = K(y-xi) h * K n as n i=1 Dirac  is not good Dirac  would be good Something in-between: bandwidth selection for kernel density estimates K(x/s) as s 0 Ks(x) Dirac  Ks(x)= s Theorem (see DL’01): as s 0 with sn |g*K – h|1 0

  30. Data splitting methods for kernel density estimates How to pick the smoothing factor ? n ( ) 1  y-xi K ns s i=1 n-m ( )  y-xi 1 K x1,...,xn-m fs = s (n-m)s i=1 x1,x2,...,xn choose s using density estimation xn-m+1,...,xn

  31. Kernels we will use: ( ) 1  y-xi K ns s piecewise uniform piecewise linear

  32. Bandwidth selection for uniform kernels E.g. Nn1/2 mn5/4 N distributions each is piecewise uniform with n pieces m datapoints Goal: run the density estimation algorithm efficiently TIME MD EMLW (fi+fj)Tij gTij n+m log n N 2 (fk-h) Tkj N2 n+m log n |fi-fj|1 n N2

  33. Bandwidth selection for uniform kernels Can speed this up? E.g. Nn1/2 mn5/4 N distributions each is piecewise uniform with n pieces m datapoints Goal: run the density estimation algorithm efficiently TIME MD EMLW (fi+fj)Tij gTij n+m log n N 2 (fk-h) Tkj N2 n+m log n |fi-fj|1 n N2

  34. Bandwidth selection for uniform kernels Can speed this up? E.g. Nn1/2 mn5/4 N distributions each is piecewise uniform with n pieces m datapoints absolute error bad relative error good Goal: run the density estimation algorithm efficiently TIME MD EMLW (fi+fj)Tij gTij n+m log n N 2 (fk-h) Tkj N2 n+m log n |fi-fj|1 n N2

  35. Approximating L1-distances between distributions N piecewise uniform densities (each n pieces) (N2+Nn) (log N) WE WILL DO: 2 TRIVIAL (exact): N2n

  36. Dimension reduction for L2 |S|=n Johnson-Lindenstrauss Lemma (’82) : L2 Lt2t = O(-2 ln n) ( x,y  S) d(x,y)  d((x),(y))  (1+)d(x,y) N(0,t-1/2)

  37. Dimension reduction for L1 |S|=n Cauchy Random Projection (Indyk’00) : L1 Lt1t = O(-2 ln n) ( x,y  S) d(x,y) est((x),(y))  (1+)d(x,y) N(0,t-1/2) C(0,1/t) (Charikar, Brinkman’03 : cannot replace est by d)

  38. Cauchy distribution C(0,1) density function: 1 (1+x2) FACTS: XC(0,1) aXC(0,|a|) XC(0,a), YC(0,b) X+YC(0,a+b)

  39. Cauchy random projection for L1 (Indyk’00) A B D X1 X2 X3 X4 X5 X6 X7 X8 X9 X1C(0,z) A(X2+X3) + B(X5+X6+X7+X8) z

  40. Cauchy random projection for L1 (Indyk’00) A B D X1 X2 X3 X4 X5 X6 X7 X8 X9 X1C(0,z) A(X2+X3) + B(X5+X6+X7+X8) D(X1+X2+...+X8+X9) z Cauchy(0,|-|1)

  41. All pairs L1-distances piece-wise linear densities

  42. All pairs L1-distances piece-wise linear densities R=(3/4)X1 + (1/4)X2 B=(3/4)X2 + (1/4)X1 R-BC(0,1/2) X1 X2  C(0,1/2)

  43. All pairs L1-distances piece-wise linear densities Problem: too many intersections! Solution: cut into even smaller pieces! Stochastic measures are useful.

  44. Brownian motion 1 exp(-x^2/2) (2)1/2 Cauchy motion 1 (1+x)2

  45. Brownian motion 1 exp(-x^2/2) (2)1/2 computing integrals is easy f:RRd f dL = Y  N(0,S)

  46. Cauchy motion 1 (1+x)2 computing integrals is easy f:RRd f dL = Y  C(0,s) for d=1 computing integrals is hard d>1 * * obtaining explicit expression for the density

  47. X1 X2 X3 X4 X5 X6 X7 X8 X9 What were we doing? (f1,f2,f3) dL = (w1)1,(w2)1,(w3)1

  48. X1 X2 X3 X4 X5 X6 X7 X8 X9 What were we doing? (f1,f2,f3) dL = (w1)1,(w2)1,(w3)1 Can we efficiently compute integrals dL for piecewise linear?

  49. Can we efficiently compute integrals dL for piecewise linear? : R R2 (z)=(1,z) (X,Y)= dL

  50. : R R2 (z)=(1,z) (X,Y)= dL u+v,u-v (2(X-Y),2Y) has density at 2

More Related