1 / 18

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm. Miaoqing Huang Nov. 5, 2010. Outline. Background Optimized hardware architecture Avoid the extra clock cycle delay The overall architecture Each PE focuses on the computation of one word of S

mcantu
Download Presentation

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010

  2. Outline • Background • Optimized hardware architecture • Avoid the extra clock cycle delay • The overall architecture • Each PE focuses on the computation of one word of S • The data dependency graph of the proposed architecture • Comparison with other published architecture • Demonstration of computation • Resource utilization and performance comparison • High-radix architecture • Conclusion • Reference

  3. Background • Montgomery Multiplication Algorithm is used in modular exponentiation to avoid the division by modulus, M. • Following is one implementation of Montgomery Multiplication, Radix-2 Montgomery Multiplication Algorithm, assuming we want to calculate S = X • Y mod M in which S, X, Y and M are all n-bit long.

  4. Background (cont.) • Some definitions • n : the bit-length of original operands • w : the word-length used in real computation • e=(n+1)/w : the quantity of words to store S • S(j): one word in S • Multiple-Word Radix-2 Montgomery Multiplication Algorithm • Scan the X bit-by-bit and scan Y and M word-by-word • Calculate S word-by-word • Easy for hardware implementation because of small propagation

  5. Background (cont.) • Data dependency in the original architecture [4] of MWR2MM algorithm • Task A consists of three steps: • Test the parity of least significant bit of S • Addition of words from S, xi•Y, and M if applicable • One-bit right shift of a S word • Task B corresponds to the last two steps of Task A [4] Tenca, A.F. and Koç, Ç. K.: A scalable architecture for Montgomery multiplication, CHES 99, LNCS 1717:94--108, 1999

  6. Background (cont.) • One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit Xi. • The delay between two contiguous PEs is 2 clock cycles. • The minimum computation time in terms of clock cycle is 2•n+e given (e+1)/2 PEs are implemented to work in parallel.

  7. Avoid the extra clock cycle delay • The origin of the extra clock cycle delay The computation of S(j-1) (of next round) requires one extra bit from S(j) (of current round), S(j)0 • Solution • Compute the two possible results of S(j) (of next round) in the same clock cycle as computing the S(j+1) (of current round); make a decision at the end of clock cycle

  8. Avoid the extra clock cycle delay (cont.) • One singe PE is responsible to update one fixed word in S • It has two branches corresponding to two possibilities of S(i+1)0 • The correct results, the carry and the S(i)w-1, is selected from two sets of possible results by S(i+1)0, both available and registered at the same moment

  9. The overall architecture • Every PE focuses on the computation of one single word of S The computation pattern of the architecture in [4] The computation pattern of the proposed architecture

  10. The overall architecture (cont.) • The data dependency graph of the proposed architecture • Task D consists of three steps • Generate qi • Pre-compute two sets of data • Select one set from two • Task E corresponds to the last steps of Task D • Task F (invisible in the graph) is responsible to compute S(e-1) • Only has one branch

  11. The overall architecture (cont.) • e PEs are required to compute the e words in S respectively. • Two shift registers, one providing single bits in X and one providing the parities of S(0)0, parallel these PEs. • (n+e-1) clock cycles are required to process the Montgomery multiplication of two n-bit operands.

  12. Demonstration of computation • Sequential S(e-1) S(2) S(1) S(0) ←X0 • Tenca & Koç’s proposal PE#0 ←X0 S(4) S(3) S(1) S(0) S(2) PE#1 ←X1 S(0) S(1) S(2) PE#2 ←X2 S(0)

  13. Demonstration of computation (cont.) • The proposed optimized architecture PE#0 S(0) S(0) S(0) S(0) S(0) ←X2 ←X0 ←X3 ←Xe-1 ←X1 ←X1 ←Xe-2 ←X0 ←X2 PE#1 S(1) S(1) S(1) S(1) ←X0 ←X1 ←Xe-3 PE#2 S(2) S(2) S(2) PE#3 S(3) S(3) ←X0 ←Xe-4 PE#(e-1) S(e-1) ←X0

  14. Resource utilization and performance comparison Test platform: Xilinx Virtex-II 6000 FF1517-4

  15. High-radix Architecture • Same optimization concept can be applied to high-radix implementation • The number of pre-computation branches is 2k • The hardware implementation beyond radix-4 becomes less viable Comparison between radix-2 and radix-4 of proposed architecture (n=1024, w=16)

  16. Conclusion • An optimized hardware architecture to implement MWR2MM algorithm is proposed • The radix-2 version of this architecture takes (n+e-1) clock cycles to process the Montgomery multiplication of two n-bit operands • Compared to original architecture by Tenca & Koç, the new approach takes half time for processing and introduces less than 10% area penalty • The same optimization technique can be applied onto the original architecture by Tenca & Koç, keeping the scalability while reducing the processing latency to half

  17. Reference • [1] Rivest, R. L., Shamir, A. and Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, vol.21, no.2, pp.120--126, 1978 • [2] Montgomery, P. L.: Modular multiplication without trial division. Mathematics of Computation, vol.78, pp.315--333, 1985 • [3] Gaj, K., et al.: Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware. In CHES 2006, LNCS, vol.4249, pp.119--133, 2006 • [4] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for Montgomery multiplication. In CHES 99, LNCS, vol.1717, pp.94--108, 1999 • [5] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for modular multiplication based on Montgomery's algorithm, IEEE Trans. Computers, vol.52, no.9, pp.1215--1221, 2003 • [6] Tenca, A.F., Todorov, G., and Koç, Ç.K.: High-radix design of a scalable modular multiplier, In CHES 2001, LNCS, vol.2162, pp.185--201, 2001

  18. Reference • [7] Harris, D., Krishnamurthy, R., Anders, M., Mathew, S. and Hsu, S.: An Improved Unified Scalable Radix-2 Montgomery Multiplier. In Proc. ARITH 17, pp.172--178, 2005 • [8] Michalski, E. A. and Buell, D. A.: A scalable architecture for RSA cryptography on large FPGAs. In Proc. FPL 2006, pp.145--152, 2006 • [9] Koç, Ç.K., Acar, T. and Kaliski Jr., B. S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, vol.16, no.3, pp.26--33, 1996 • [10] McIvor, C., McLoone, M. and McCanny, J.V.: High-Radix Systolic Modular Multiplication on Reconfigurable Hardware. In Proc. FPT 2005, pp.13--18, 2005 • [11] McIvor, C., McLoone, M. and McCanny, J.V.: Modified Montgomery Modular Multiplication and RSA Exponentiation Techniques. IEE Proceedings -- Computers & Digital Techniques, vol.151, no.6, pp.402--408, 2004

More Related