300 likes | 443 Views
Fast Fingerprint Calculations. Thomas Schwarz, S.J. Fingerprints. Definition: A fingerprint (a.k.a. signature) of an object Ob is a small string f(Ob) with the following properties: 1. f is a function of Ob. In particular, if two objects are equal, then so are their fingerprints.
E N D
Fast Fingerprint Calculations Thomas Schwarz, S.J.
Fingerprints Definition: A fingerprint (a.k.a. signature) of an object Ob is a small string f(Ob) with the following properties: 1. f is a function of Ob. In particular, if two objects are equal, then so are their fingerprints. 2. Prob(f(Ob1) = f(Ob2)) << 1 for “random” objects Ob1≠ Ob2.
Usage Fingerprints are used to: • Identify Objects • Compare Objects Remotely • Test an Object for Changes Since fingerprints are smaller, they are very useful as stand-ins for remote objects.
Usage Object identification example: Software Cloning: During maintenance, the need arises for a module very similar in character to one that already exists. Because of time pressure, this module is simply copied, all names are systematically changed, and then modified to serve the new needs. Maintenance Problem: If a bug is detected in a clone or in the original, it probably subsides in the original and / or other clones. Besides, clones arise because of time pressure, but the short-cut ends up costing in the long run. Thus, it is better during maintenance to identify clones. Clone Identification: Systematically suppress names, then test for function code to be identical. Johnson, J.H. Substring Matching for Software Clone Detection, and Change Tracking. International Conference on Software Maintenance. Victoria, BC, 1994, p. 120 - 126
Usage Similarity Testing for Files n-gram: Contiguous substring of n characters in a file. File Similarity: Count the number of occurrences of a particular n-gram. Use the fingerprint of an n-gram as a hash value. Count the fingerprints instead of the n-gram. Cohen, J.D. Recursive Hashing Functions for n-grams. ACM Trans. Information Systems, p. 291 -320.
Usage Remote String Searches: Find all occurrences of a given string in files on remote servers. Instead of sending the string to all servers, only a fingerprint and the length l of the string is sent. The servers generate running fingerprints of l-grans and compare them with the string’s fingerprint.
Usage Remote File Comparison Original Problem: How to compare pages of remote replicas of a database. Solution: Calculate fingerprints (“signatures”) of each page. Calculate a super-signature from the pages. If super-signatures coincide, conclude that the replicas are in sync. If not, run a “smart” protocol to find the non-fitting signatures. Abdel-Ghaffar, K. A. S., El-Abbadi, A. Efficient Detection of Corrupted Pages in a Replicated File. ACM Symp. Distributed Computing, 1993, p. 219-227. Barbara, D., Garcia-Molina, H. , Feijoo, B. Exploiting Symmetries for Low-Cost Comparison of File Copies. Proc. Int. Conf. Distributed Computing Systems, 1988, p. 471-479. Barbara, D., Lipton, R. J.: A class of Randomized Strategies for Low-Cost Comparison of File Copies. IEEE Trans. Parallel and Distributed Systems, vol. 2(2), 1991, p. 160-170. Fuchs, W. Wu, K. L., Abraham, J. A. Low-Cost Comparison and Diagnosis of Large, Remotely Located Files. Proc. Symp. Reliability Distributed Software and Database Systems, p. 67-73, 1986. Schwarz, Th., Bowdidge, B., Burkhard, W., Low Cost Comparison of Files, Int. Conf. on Distr. Comp. Syst., (ICDCS 90) , 196-201.
Usage Secure Signatures: To identify an object, maintain its signature. If the object is altered by an adversary, the adversary cannot do so in a computationally feasible way without changing the signature. “Cryptographically secure signature” SHA-1, MD5 Used for authentication, e.g. in computer forensics, digital signatures, etc.
SHA-1 • 20B long. • Designed for Fast Calculation • Considered unbreakable • Used increasingly in applications were cryptographic security is not needed. Radia Pearlman’s Law of Cryptography: “If a lot of smart people spent lots of time trying to break a scheme, and did not succeed, then it cannot be done.”
Useful Properties of Fingerprints • Fast Calculation. • Low collision rate. If the fingerprints have length l then the probability of a collision should be 2-l. If there are small changes, then fingerprints should change. • Cryptographically unbreakable. Given a signature, one cannot construct an object with this signature.
Useful Properties of Fingerprints • Updatable If the object changes, then we can update the signature from the old signature and the changes. • Concatenation of Objects If an object is made up of several objects, then we can calculate the signature of the super-object from its constituents. Possibly in a way that allows us to quickly pinpoint different component objects.
Karp Rabin Style Fingerprints Here, the calculation takes places in a ring R with multiplication and addition. Karp, R. M., Rabin, M. O. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, Vol. 31, No. 2, March 1987.
Karp Rabin Style Fingerprints • Calculation time linear in N (1 multiplication and 1 addition)
Karp Rabin Style Fingerprints • Easy to calculate consecutive n-grams • Easy to calculate signatures of concatenations sig(a0, a1, …al-1,al,al+1…al+n-1) = n sig (a0, a1, …al-1,)+ sig (al,al+1…al+n-1) Possibly use a table of values of n
Karp Rabin Style Fingerprints • Cryptographically not secure A cryptographically secure signature is a one-way function, in order to be efficiently calculable, it needs to process large portions of a string at a time. Thus, cryptographical security is not a desirable property in general.
Choice of Ring R 1. Integers modulo prime p The ring is then a well-understood field. But, reducing modulo p is a costly operation. 2. Integers modulo 2f Powers of two are zero dividers, e.g. 2·2f-1=0 This excludes powers of 2 as .
Choice of Ring 3. Reduction by a polynomial. The space of unsigned integers 0, ..., 232-1 can be naturally identified with the space of all polynomials k[t] over k = {0,1} with degree up to 31. Select a polynomial (t) with degree f up to 31 and consider the ring R = k[t]/((t)). Elements in this are naturally identified with all unsigned integers 0, ..., 2f-1. Addition of these polynomials corresponds to the fast XOR, multiplication is more difficult, but multiplication by t is a left shift followed by conditionally XORing with . This is the most promising construction.
R = k[t]/((t)) Example • Set = t5+t+1, that is, = 10011. • Elements of R are all bit strings of length 4. • To add 0101 and 1100, just XOR: 0101+1100 = 1001. • To multiply with t = 0010, left shift and XOR conditionally with . • To multiply 0010 with 0010, left-shift the first and obtain 0 0100. The leading coefficient is zero, which is dropped. Result is 0100. • To multiply 1100 with 0010, left shift to obtain 1,1000, the leading coefficient is one, so XOR with = 10011 to obtain result 1011.
Galois Fields If is irreducible, then R is a Galois field. If we use a Galois field, we can concatenate fingerprints to obtain a signature:
Galois Fields If we use • = (, 2, 3…n) then the signature are the parity symbols of a generalized, non-systematic Reed-Solomon code. Since these codes are MDS, the signature will change for up to n changes in the object.
Galois Field Signatures To calculate a Galois field footprint, we only need per symbol: One XOR One left-shift One test whether the leading coefficient is now one. Conditionally one XOR.
Speeding up Galois field footprints However, we do not have to execute the reduction step each time. Instead, left shift and XOR b times (Broder’s idea). Then do a table to reduce the “overhang”. Broder, A. Some applications of Rabin's fingerprinting method. In Capocelli, De Santis, and Vaccaro, (ed.), Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.
Speeding up Galois field footprints String is (1000, 1100, 1010, 0111, 0011, ... Choose = 10011. This is an irreducible polynomial. Step 0: 1000 Step 1: 1,0000+1100 = 1,1100 Step 2: 11,1000+1010 = 11,0010 Step 3: 110,0100+0111 = 110,0011 Step 4: 1100,0110+0011 = 1100,0101 Now use table look-up to calculate 0101 + table[1100] = 0101 + 0111 = 0010. 8 elementary ops + 1 shift right + 1 table look-up+1 XOR.
Speeding up Galois field footprints String is (1000, 1100, 1010, 0111, 0011, ... Step 0: 1000 Step 1: 1,000 + 1100 = 1,1100 = 1,1100+ = 1,1100+1,0011 = 1111. Step 2: 1,1110 + 1010 = 1,0100 = 0111. Step 3: 0,1110 + 0111 = 1001. Step 4: 1,0010 + 0011 = 1,0001 = 0010. 8 elementary ops + 4 condition evaluations + 2 elementary ops on average.
Speeding up Galois field footprints How do we calculate the table entries: Systematically reduce by t-multiples of = 10011. To calculate table entry for 12 = 1100 reduce 1100,0000 in four steps. 1100,0000 + t3· = 1100,0000 + 1001,1000 = 0101,1000 Now reduce with t2·: 0101,1000 + t2· = 0101,1000 + 0100,1100 = 0001,0100 No step with t· since the corresponding coefficient is zero. Reduce with : 0001,0100 + 0001,0011 = 0000,0111 = 0111.
Speeding up Galois field footprints Optimal table size needs to be determined experimentally. Table needs to fit in cache, so it cannot be much bigger than 216. If the table is too small, then the look-up costs does not amortize well.
Galois field signatures Galois field signatures are concatenations of Galois field fingerprints. Broder tables work for multiplication with t2, t3, t4as well, but less efficiently, since now we shift two, three, or four times so that we need to use table-lookup more often.
Performance Results • 1.772 msec per MB for 16 bit parity • 2.012 msec per MB for 16 bit 1power. • 3.114 msec per MB for 16 bit 2power. Results for a 1.99GHz Pentium 4 w. 512MB memory.
How Long should Signatures be? • Key fact: There are 31,557,600 seconds in a year. • At x calculations per second, there will be 31,577,600x incidents, which will lead to a collision < 2-31 *31,577,600x = 0.015x times per year for 32 bit signatures. • So, minimum length should be 64 bits. We can achieve this easily. • Larger signatures protects at a better rate than hard drive failures (writing on the wrong track), software failures, etc.
Research Questions • * Property of a signature: Changing n symbols changes the n-fold signature for sure. • It is known that if we change to a different vector , e.g. one where the components are all primitive elements, we loose the * property. Are there other with this property? • We can use Broder tabling with different irreducible and then concatenate the Galois field footprints. Can we find a condition under which the * property holds? • What properties hold when is not irreducible? It seems statistically fine as long as has a constant coefficient.