1 / 44

Workshop on Factorization, CITS Bochum, 11-12 September 2009

Workshop on Factorization, CITS Bochum, 11-12 September 2009. Implementing the Elliptic Curve Method (ECM) on Special-Purpose Hardware Ralf Zimmermann , Tim Güneysu, Christof Paar Horst Görtz Institute for IT-Security Ruhr-University Bochum. Outline. Introduction

kalona
Download Presentation

Workshop on Factorization, CITS Bochum, 11-12 September 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop on Factorization, CITS Bochum, 11-12 September 2009 Implementing the Elliptic Curve Method (ECM) on Special-Purpose Hardware Ralf Zimmermann, Tim Güneysu, Christof PaarHorst Görtz Institute for IT-SecurityRuhr-University Bochum

  2. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  3. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  4. Factorn= 7626668401 080283463 p=2349834551 q=3245619313 Design Goals and Decisions Usage of Elliptic Curve Method (ECM) in co-factorization Goals of this work group • ECM implementation for small bit numbers (up to 200 bits) • Implement both Phases 1 and 2 • Usable on COPACOBANA for massive parallelism Four Individual Phases of this Work Package • Determine the best platform for ECM • Redesign COPACOBANA architecture for target platform • Implement ECM on the target platform • Optimize ECM on the target platform

  5. Selection Process • Task: Find is the best platform for ECM (≠ CPUs)! • Based on three (predefined) platforms: • FPGAs: Spartan-3/Virtex-4 • Digital Signal Processors (DSP): TI C6713 • Smartcards with PK-Accelerator: Siemens SL88 Winner of this competition: Virtex-4 FPGAs

  6. Design Goals and Decisions II Use of Elliptic Curves in Montgomery Form • Efficient formulas for hardware • Computation/storage of y-coordinate can be omitted External inputs provided by the host PC • Initial values: k (Phase 1), prime table (Phase 2) • Each unit: curve parameters, modified moduli • GCD on Host PC

  7. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  8. General FPGA Architecture • Configurability: by mesh of programmable elements • Configurable Logic Blocks (CLB) • Modern FPGAs contain tens of thousands of CLBs • Connection via interconnect (switching matrix) • Modern FPGAs contain hardcores • Dedicated memory elements • Arithmetic hardcores, e.g., to accelerate integer multiplication and addition • Embedded PowerPC processors • High-speed I/O Transceivers

  9. Generic FPGA Structure (simplified) Long lines Switch matrix Input/output Configurable Logic Block

  10. Embedded Memory Elements in Virtex-4 • 18kbit storage element (BlockRAM) • 400 MHz with connected output register • Flexible BRAM configuration • Dual-Port storage • Single-port storage • RAM or ROM • Cascading possible

  11. Digital Signal Processing (DSP) Hardcores 8 4 32 5 37 • Fast signed 18x18-bit multiplication and signed 48-bit addition/subtraction (400 MHz) • Integrated pipeline register • Controllable by an OPMODE signal • DSP elements can be cascaded

  12. Simplified Architecture of Virtex-4 FPGAs CLB PowerPC (optional) DSP elements Block RAM elements Location and columnwise alignment of elements are important for place and route of FPGA designs!

  13. Input: Modulus M Operands A  M, B  M Output: A+B (mod M) S = A+B S‘ = A+B-M If (b = 1) then Return S Else Return S‘ Basic Modular Addition A 2 3 5 4 B 4 8 7 5 M 6 8 8 8 2 3 5 4 + 4 8 7 5 0 1 1 0 S 7 2 2 9 7 2 2 9 - 6 8 8 8 0 1 1 0 S‘ 0 3 4 1 borrow (b)

  14. b b m m j , 17 .. 33 j , 0 .. 16 j , 17 .. 33 j , 0 .. 16 18 18 18 18 PCOUT a j 34 CIN CIN 1 1 48 35 35 34 34 M U X MODE _ S 34 E L s j DSP-supported Modular Addition A 2 3 5 4 B 4 8 7 5 M 6 8 8 8 8 6 5 7 8 4 8 6 4 5 3 2 1 0 1 0 0 1 1 0 1 9 12 4 3 12 0 7 0 3 4 1 7 2 2 9

  15. Basic Modular Multiplication block of k bits Modular Multiplication with Quotient Pipelining (Orup) Input: Modulus dependent par M’’(M, k, d), operands A, B Output: S ≡ A · B · R-1 (mod M) // n is number of rounds for i = 0 to n do qi = Si (mod 2k) Si+1 = Si/2k + qiM‘‘ + biA Return Sn+1 A a1 a2 a3 an a0 block selection of Si,0 b0 b1 b2 b3 bn M‘‘ m1 m2 m3 mn m0 shift by k bits Si,j Si,0 Si,1 Si,2 Si,3 Si,n Si,1 Si,2 Si,3 Si,4 0 scalarmultiplication accumulation

  16. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  17. Introduction to Elliptic Curve Method • Phase 1 and 2 of the elliptic curve method • Phase 1 was originally proposed by Lenstra • Idea: Pollard’sp-1 method adapted for elliptic curves • Brent and Montgomery extended the Phase 1 by a continuation (Phase 2) • Cost intensive operations in Phase 1 & 2 • Phase 1 is basically a scalar multiplication • Gaj, et. al. described an algorithm for the standard continuation (CHES 2006) • Standard continuation in hardware • Precomputations • Scalar multiplications jQ • Storing primes in a table • Main computations • point addition • accumulation of product d

  18. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  19. Example: ECM Operations • Arithmetic unit comprises of 2 multipliers and 1 addition/subtraction component • Example: concurrent point doubling/point addition in 6 steps • Implementation of Montgomery Ladder for ECC point multiplication

  20. Initial Project Implementation • Initial Project implemented only Phase 1 • Pros: • Proof-of-concept implementation • DSP implementation of multiplier and addition/subtraction • 8 units calculating Phase 1 • Cons: • Very complex instructions (size and routing) • Addition works with different bit-width than multiplication • Special way to feed multiplier with data • Design reaches DSPs limit • Extension to Phase 2 might not be possible

  21. Initial ECM System Architecture ECM system ECM unit • 8 independent units • supports up to 151-bit numbers • Each unit computes a different parameter set

  22. Initial ECM Systems on COPACOBANA • Host Computer: performs (simple) precomputations • COBACOBANA: performs cost intensive operations • Each module consists of 8 Virtex-4 FPGA • For more information about COPACOBANA, please see talk tomorrow

  23. 151-bit multiplier using 10 DSP elements Special structure: aj || aj+5 and mj || mj+5 Output: sj after 66 clock cycles in 17-bit blocks Initial Multiplication with DSPs

  24. Initial Data storage storage block I storage block II storage block III • BRAM 0 to 5 contain input values for the multipliers • BRAM 6/7 contain input values for the addition/subtraction unit • BRAM 8 contain moduli for multiplication and addition/subtraction • Multipliers transfers 17-bit resulting blocks to storage block I (BRAM 0-2), or storage block II (BRAM 3-5), and/or storage block III (BRAM 6-7) • Addition/subtraction transfers 34-bit resulting block to one of storage block I/II/III Too complex! Routing kills the frequency  max 100 / 400 MHz

  25. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  26. Optimization of Addition/Subtraction • 17-bit internal bus width • No conversation between add/sub (34 bit) and multiplication (17 bit) • Less space in fabric when buffering bus signals • Better routing possible • Changed code of modular addition/subtraction algorithm • Maximized clock frequency under optimal conditions • Removed output buffers and distributed RAM

  27. Optimization of Multiplication • Previous approach to multiplying: • Use n DSPs to multiply n*17 bit values (parallel)  (n+1) * 6 cycles • New approach: • Use 3 DSPs to multiply n*17 bit values (sequential)  6 + (n+2)*n cycles • Here n=10: Reduce DSPs by 7 on cost of 60 clock cycles DSP 10 9 8 7 6 5 4 3 2 1 0 cycles 0 66 126

  28. Optimization of Multiplication • Multiplication results: •  Algorithmic-Logic Unit (ALU) results: Promising results for the basic structure!

  29. Optimization of Memory Usage • Instructions • Reduced bit width by factor > 2,5 • Instructions for all ECM operations: 1 BRAM • ECM instructions merged with ECM Parametersk and prime table • Memory usage per unit • ALU: 6 BRAMs • ECM Unit: 1 BRAM 1 – Project 1 uses global instructions for phase 1 using 17 BRAMs for 8 units 2 – Project 2 uses 2 BRAMs, one for ALU, one for ECM instruction for both phases.

  30. Implementation of ECM Phase 2 • Different Clock Domains • 100 MHz for ECM finite state machine (FSM) • 200 MHz for algorithmic-logic unit (ALU) • Workspace RAM: ALU playground • 32 Cells, each containing 2 blocks • Each block stores 16 values of 17-bit •  17.408 bits (equals 1 BRAM) Phase 2 Precomputation Phase 2 P1/P2 temporary result storage Point addition / doubling results P2 pre temporary points

  31. Optimizing Phase 2 Precalculation • From Gaj, et. al. (CHES 2006) Calculate set JS of integers j Use the j multiple of Q0

  32. Optimizing Phase 2 Precalculation II • Suggested parameter D = 210 for hardware • Which multiples of Q0 are needed? • JS(D) = {1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103} • Calculation suggested in [1] calculates:JS(D) = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 2527, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 5355, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 8183, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103, 105} • Are there more needed? • 210*Q0 and MMIN*D*Q0 • But: Point Addition on elliptic curves? • To calculate (Px : : Pz) + (Qx : : Qz) we efficientlywe need the coordinates ([P-Q]x : : [P-Q]z) •  What to do with R = R + Q ?

  33. Optimizing Phase 2 Precalculation III What about MMIN * D? • MMIN = round(B1 / D) [math] • Suggested B1 = 960  MMIN = 5 • MMIN*D = 1050 What about [R-Q]? • Start: R = 5*D*Q0, Q = D*Q0 • Point addition needs: [R-Q] = 4*D*Q0 = 840*Q0 • Iteration increases factor by one:5*D*Q0, 6*D*Q0, … Calculations suggested: • 1 D + 52 A + {210, 1050}*Q0 [(8 + 11) AD using montgomery ladder] • 840*Q0 never mentioned [10 AD using montgomery ladder] •  not implemented and/or inefficient

  34. Implementation of Phase 2 Precalculation • Use a chain of point addition / doublingto calculate multiples • Number of operations: 30 A + 7 D • Number of points in RAM: 27 + 2 temporary •  reduce number of operations drastically •  calculate every point needed for addition •  calculate every point needed for phase 2

  35. Implementation of Phase 2 Precalculation II • Parameters might change – what now? • Change of B1 results in a change of MMIN • MMIN = 0 not possible • MMIN = 1: first R + Q changes to doubling Q • MMIN > 1 is implementable using this strategy • Required space in RAM unchanged • Required space in fabric unchanged • Varies the number of operations in precomputation

  36. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  37. Results of the new ECM implementation Data transfer Phase 1 Phase 2 Data transfer Results for ECM Phase 1 + 2 (B1 = 960 and B2 = 57600)

  38. Results for ECM Phase 1 + 2 ECM Phase 1 + 2 using B1 = 960, B2 = 57600 [1723 bit k] 69,120 ECM calculations / second on COPACOBANA1 1 COPACOBANA using 4 FPGA modules (= 32 FPGA)

  39. Comparison (ECM Phase 1) Comparing ECM Phase 1 Implementations using B1 = 960, B2 = 57000 [1323 bit k] 1 Cost for 1 FPGA: www.em.avnet.com (September 2009)

  40. Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

  41. Conclusions • Novel and complete implementation • Implementation results: implementation of phase 2 on FPGA – not just estimates • Optimization results: at least twice as effective as the documented result • Generic and scalable system • Architecture can be easily optimized for (n*17) – 2 bit with 2 <= n <= 14 [implemented: n = 9] • only small changes in fabric/resources, but exchange instruction ROM • Highly parallel architecture for co-factorization • Multiple ECM-units per FPGA (24 units supporting 151-bit on Virtex-4 SX 35) • Multiple FPGAs on the COPACOBANA cluster (128 Virtex-4 SX 35)

  42. Thank you for your attention!Any Questions? Ralf Zimmermann, Tim Güneysu, Christof Paar

  43. Redesign of FPGA Module for ECM Original: 6xSpartan-3 XC3S1000 Redesign: 8xVirtex-4 XCV4SX35

  44. COPACOBANA with Virtex-4 Devices

More Related