170 likes | 295 Views
Fast Modular Reduction. Will Hasenplaugh Gunnar Gaubatz Vinodh Gopal June 27, 2007. Modular Multiplication. Modular Multiplication is used in Public Key Cryptography Diffie-Hellman and RSA Prime-field Elliptic Curve Cryptography
E N D
Fast Modular Reduction Will Hasenplaugh Gunnar Gaubatz Vinodh Gopal June 27, 2007
Modular Multiplication • Modular Multiplication is used in Public Key Cryptography • Diffie-Hellman and RSA • Prime-field Elliptic Curve Cryptography • Compute AB mod M where A,B and M are typically 100’s to 1000’s of bits • We present a variant of Barrett’s Modular Reduction Algorithm which exploits Karatsuba Multiplication and Modular Folding • Analysis is software focused • We use an abstract processor to compare algorithms fairly • The native word size is w-bits (a power of 2) • 1-cycle add and an m-cycle multiply • We present example data on an 8-bit processor with a 2-cycle multiplier • Atmel AVR series - representative of embedded handheld devices • Our algorithm is also applicable to hardware acceleration Digital Enterprise Group
Word-Serial Montgomery Pro: Regularity Interleaved Multiply and Reduce Low-Complexity Quotient Estimation Right-to-Left computation leads to convenient hardware pipelines Con: Transformation Overhead n2 complexity Barrett Pro: No Transformation Overhead Large Digit Based Computation Allows sub-n2 multiplication techniques Flexible ‘Off the Shelf’ hardware Con: Quotient Estimation requires a ‘large digit’ multiplication Left-to-Right computation is less convenient for hardware Montgomery vs. Barrett Digital Enterprise Group
Barrett vs. Montgomery • Performance of n2 Barrett approaches ~2/3 of Montgomery • Quotient Estimation for Montgomery is amortized as operands grow Digital Enterprise Group
Karatsuba Multiplication • Recursive multiplication algorithm with O( n1.585 ) complexity. • ‘Schoolbook’ multiplication complexity scales as O( n2 ), but requires fewer additions per recursion. • N=AB • A=a12n+a0 • B=b12n+b0 • Schoolbook Multiplication - • N=a1b122n+(a1b0+a0b1)2n+a0b0 • Karatsuba Multiplication - • N=a1b122n+ • [(a1+a0)(b1+b0)-a1b1-a0b0]2n+a0b0 a1 a0 b1 b0 A B x a1+a0 b1+b0 a1b1 a0b0 + (a1+a0)(b1+b0) - a0b0 - a1b1 N=AB Digital Enterprise Group
Recursive Karatsuba Decomposition a1 A a0 <= 1 <= 2 For k recursions: ‘extra’ word is <= log2k bits <= 3 a1+a0 There are fewer particles in the universe than that. Just one extra word on an 8-bit machine is sufficient to handle multiplication of numbers up to 2^258 bits. So, we probably won’t need to rewrite this code. Digital Enterprise Group
Carry Handling • There is considerable overhead in the naïve implementation of Karatsuba. • At a recursion depth of 4, ~20% of the multiplies are with sparsely populated ‘extra’ words. We turn sparsely populated multiplies into branches and adds. N=AB A=ah2n+al B=bh2n+bl ahand bhare booleans N=ahbh22n+[ahbl+bhal]2n+albl ah al bh bl x albl + if =1 al bh ah + if =1 bl + if & =1 1 ah bh N Each recursion is a conveniently-sized multiply -> No ‘extra’ words. Digital Enterprise Group
Karatsuba vs. Schoolbook Multiplication Digital Enterprise Group
Barrett’s Algorithm • A, B and M are n-bit numbers. We seek to find R = AB mod M using Barrett’s Algorithm. • A total of 3 n-bit multiplies. A B x N / 2n N N mod 2n μ x μN / 2n ~μN / 22n M x ~μNM / 22n - R Digital Enterprise Group
Barrett vs. Montgomery Digital Enterprise Group
Folding • We accelerate the reduction process by partially reducing N ( =AB ) with an inexpensive method called Folding: A B x N / 23s N N mod 23s M’=23s mod M x ~NM’ / 23s + N’ Digital Enterprise Group
We can play the same trick again. F times, in fact. Iterative Folding N / 21.5n N N mod 21.5n M(1) x + N(1) N(1) mod 21.25n M(2) x + N(2) N(2) mod 21.125n Digital Enterprise Group
Iterative Folding ( F = 2 ) Digital Enterprise Group
Summary • This Fast Modular Reduction technique is ~2x faster than Montgomery on RSA Encryption on 512 – 1024 bit keys. • As security requirements heighten, key sizes will grow to meet them and the asymptotic advantage of Karatsuba will continue to shine. We see a ~3x and ~4x advantage, respectively, for 2048 and 4096 bit keys. • The speedup of a multiplier-bound, w-bit architecture is • Strong encryption on low-power handheld devices is challenging • Ex: A 16MHz 8-bit Atmel AVR computes a 4096-bit RSA in almost 4 minutes with Montgomery, but we can do it in 1. Digital Enterprise Group