520 likes | 740 Views
Big Data. Lecture 5: Estimating the second moment, dimension reduction, applications . The second moment. A , B , A ,C, D , D , A , A , E , B , E , E ,F,…. The second moment: . Alon , Matias , Szegedy 96. Gödel Prize 2005. Maintain: . Alon , Matias , Szegedy 96. Gödel Prize 2005.
E N D
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications
The second moment A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment:
Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain:
Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain:
2-wise independent hash family Suppose h : [d] [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? t1 x1 ? x2 t2
2-wise independent hash family H, a family of hash functions h, is 2-wise independent iff x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 t1 x1 ? x2 t2
2-wise independent hash family H={(ax+b) mod T | 0 a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0 a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions
Draw h from 2-wise ind. family Z2 is an unbiased estimator for F2 !
What is the variance of Z2 ? Here we will assume that h is drawn from a 4-wise inde. family H
Chebyshev’s Inequality If is small this is meaningless… We need to reduce the variance How ?
Averaging Draw k ind. hash functions h1, h2, …. , hk Use
Boosting the confidence – Chernoff bounds Pick 1/4 1/4
Boosting the confidence – Chernoff bounds Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?
Boosting the confidence – Chernoff bounds Each of A1,…..,As is bad ((1 ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As
Boosting the confidence – Chernoff bounds What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then s = O(log(1/)) with a large enough constant
Recap =
This is a random projection.. = Preserve distances in the sense:
Make it look more familiar.. = Preserve distances in the sense:
Dimension reduction (A random orthonormal k d) = We project into a random k-dim. subspace
Dimension reduction (A random orthonormal k d) = We project into a random k-dim. subspace JL: ε[0,1]
Dimension reduction (A random orthonormal k d) = We project into a random k-dim. subspace JL: ε[0,1]
Johnson-Lindenstrauss JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :
The proof (A random orthonormal k d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:
The proof (A random orthonormal k d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:
The proof (A random orthonormal k d) = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:
The proof Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:
The case k=1 Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:
The case k=1 Random unit vec = JL:
The case k=1 1 ε[0,1]
An application: approximate period m 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized
An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized
An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized
An exact algorithm Find r such that For each value of r takes linear time O(m2) is minimized
An exact algorithm Find r such that For each value of r takes linear time O(m2) is minimized We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…
Obs1: We can sketch faster.. B h A A running inner-product with a unit vector This is similar to a convolution of two vectors
Convolution 4 5 0 2 1 3 3 1 2 0
Convolution 4 5 0 2 1 3 3 1 2 0
Convolution 4 5 0 2 1 3 3 1 2 0
Convolution 4 5 0 2 1 3 3 1 2 0
Convolution 4 5 0 2 1 3 3 1 2 0 We can compute the convolution in O(mlog(r)) time using the FFT
Obs1: We can sketch faster h We can compute the first coordinate of all sketches in O(mlog(r)) time We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…
Obs2: Sketch only in powers of 2 We compute all sketches in O(log(m)mlog(r)k)
When r is not a power of 2 ? z x y S(x) S(y) Use S(x) + S(y) as S(z)
The algorithm z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)
The algorithm z x y S(x) S(y) Total running time is O(mlog3m)
Bibliography • Noga Alon, YossiMatias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), 137-147 • W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. • JiríMatousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): 142-156 (2008) • PiotrIndyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000: 363-372