360 likes | 526 Views
Improve sketching of Hamming Distance with Error Correcting. Ely Porat Bar-Ilan University Google Inc. Ohad Lipsky Bar-Ilan University Check Point Inc. December 2003. Problem Definition (1). Alice. Bob. T A. T B. n. n. hamm(T A ,T B ). Given k - bound on the number of mismatches.
E N D
Improve sketching of Hamming Distance with Error Correcting Ely Porat Bar-Ilan University Google Inc Ohad Lipsky Bar-Ilan University Check Point Inc December 2003
Problem Definition (1) Alice Bob TA TB n n hamm(TA,TB) Given k - bound on the number of mismatches December 2003
Problem Definition (2) TA TB n n S S SA SB Calculate hamm(TA,TB) given only SA,SB Finding the mistakes Given k - bound on the number of mismatches December 2003
Motivations • Data Bases • Internet • Error Correcting Router C Router B Router A Router D December 2003
Outline: • Simple Solution • Error Correcting • Improved Solution • Improve more • Recursion • File sharing December 2003
Simplest Solution - O(k2log1/) • Binary Alphabet • Allocate k2 cells. • Take the input array and hash each bit to one of the cells. • In each cell remember the xor of all the values hash to it. 0 1 1 0 December 2003
Simplest Solution - O(k2log1/) 0 1 0 0 1 1 0 0 December 2003
Simplest Solution - O(k2log1/) • Due to the birthday principal: The probability that 2 Error will fall to the same cell < 1/2 • log1/ - to get a probability to fail 0 1 1 0 December 2003
Alphabet • Denote with S the size of the alphabet. • We can encode each latter with it’s unary representation. • The only effect is that each mistake will be counted twice. 0 - 1000000….0 1 - 0100000….0 . S-1 - 0000000….1 0 - 1000000….0 5 - 0000010….0 December 2003
Error correcting - O(k2logNS) • Here we allocate two kind of k2cellsk2 of logS bits.k2 of logNS bits. C1[h(A[i])]+=A[i] 5 8 3 2 C2[h(A[i])]+=iA[i] 15 6 7 8 December 2003
Error correcting - O(k2logNS) • As before with probability > 1/2 there won’t fall 2 Errors in the same cell. C1[h(A[i])]+=A[i] 5 8 3 2 C1[h(A[i])]+=iA[i] 15 6 7 8 December 2003
Error correcting - O(k2logNS) • We get from the red cells: 5 5 8 3 2 C1[h(A[i])]+=A[i] 5 6 3 2 3 8 - 6 = 5 - 3 December 2003
Error correcting - O(k2logNS) • We get from the blue cells: 0 1 2 5 15 11 7 5 C2[h(A[i])]+=iA[i] 15 9 7 5 3 11 - 9 = 2*(5 - 3) => i=2 December 2003
Error correcting - O(k2logNS) • The probability to succeed is about 1/2. • To lower the failer probability we will run it 3 times. • We will get a list of possible mistakes each time. • Output all the mistakes that appear in at least 2 of the 3 runs. December 2003
O(klog2k) - Solution • The Idea is two stage hashes: k/logk w.h.p O(logk) Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003
O(klog2k) - Solution keep accumulated XOR The Probability to fail is less then 1/2. Run it 2logk times And take the max. => failer probabilty less then 1/k2 O(logk) O(log2k) Space = O(log3k) Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003
O(klog2k) - Solution k/logk O(log3k) O(log3k) O(log3k) O(log3k) O(klog2k) P(Failer) k/logk * 1/k2 < 1/k Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003
O(k2log*klogk) -Idea (recursion) k/logk Pr(F)<1/logck logk/loglogk logk/loglogk runs, take max December 2003
Error Correcting O(klogNS) Alice Bob TA TB n n r0r1r2… p=(N3S) Constant Probability December 2003
Error Correcting O(klogNS) Alice Bob TA TB n n If we wrong w.h.p j>n December 2003
Error Correcting O(klogNS) Alice Bob TA TB n n rj , aj - bj December 2003
Error Correcting O(klogNS) Alice Bob TA TB n n O(klnk) December 2003
Recursion Alice Bob TA TB n n ck TA TB n n December 2003
Recursion Alice Bob TA TB n n ck O(klogNS) December 2003
Complexity TA TB n n S S SA SB Size: O(klogNS) Computing sketch: O(nlogk) Comparing sketches: O(klogk) December 2003
O(klogk) -Solution • We can just encode in unary and hash the input to k3 cells and then run the O(klogNS)=O(klogk) algorithm. December 2003
Reed-Solomon Codes We manage to develop a deterministic algorithm based on that. But the encoding and the decoding is slower. Amir, Farach 95Feigenbaum, Ishai, Malkin, Nissim, Strauss, Wright 01Bar-Yossef, Jayram, Kumar, Sivakumar 03 Efremenko, Porat, Rothschild 06Efremenko, Porat 07
File Sharing Napster source n Source need to stay until someone will have the whole file. (and willing to stay) There is bottleneck at the end.
File Sharing emule/kazaa/torrent source n The source has to send nlnn blocks before disconnecting. Sometimes there are some bottlenecks
Improved File Sharing - Ver 1 a0a1a2…………….an-1 source n n6
Improved File Sharing - Ver 1 n6 Each client that got n points can recreate the file There is no more nlnn Almost no bottlenecks
Improved File Sharing - Ver 2 a0a1a2…………….an-1 source n Send linear equations on the file.
Improved File Sharing - Ver 2 a0a1a2…………….an-1 source n Problems: 1. Heavy to encode each packet we need to go over all the file. 2. Very heavy to decode O(n2) block operation + O(n3) fields operations. Facts: 1. If you get n(1/2-) random combination of two blocks you won’t have dependents w.h.p. 2. If you have d - pairs combinations you can easilly reduce your system to n-d variables. Solution: Use sparse functionals
Improved File Sharing - Ver 2 a0a1a2…………….an-1 source n Futures: Backward compatibility. Even if you don’t have the whole file you can mix functionals.