450 likes | 606 Views
Workshop on Factorization, CITS Bochum, 11-12 September 2009. Implementing the Elliptic Curve Method (ECM) on Special-Purpose Hardware Ralf Zimmermann , Tim Güneysu, Christof Paar Horst Görtz Institute for IT-Security Ruhr-University Bochum. Outline. Introduction
E N D
Workshop on Factorization, CITS Bochum, 11-12 September 2009 Implementing the Elliptic Curve Method (ECM) on Special-Purpose Hardware Ralf Zimmermann, Tim Güneysu, Christof PaarHorst Görtz Institute for IT-SecurityRuhr-University Bochum
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Factorn= 7626668401 080283463 p=2349834551 q=3245619313 Design Goals and Decisions Usage of Elliptic Curve Method (ECM) in co-factorization Goals of this work group • ECM implementation for small bit numbers (up to 200 bits) • Implement both Phases 1 and 2 • Usable on COPACOBANA for massive parallelism Four Individual Phases of this Work Package • Determine the best platform for ECM • Redesign COPACOBANA architecture for target platform • Implement ECM on the target platform • Optimize ECM on the target platform
Selection Process • Task: Find is the best platform for ECM (≠ CPUs)! • Based on three (predefined) platforms: • FPGAs: Spartan-3/Virtex-4 • Digital Signal Processors (DSP): TI C6713 • Smartcards with PK-Accelerator: Siemens SL88 Winner of this competition: Virtex-4 FPGAs
Design Goals and Decisions II Use of Elliptic Curves in Montgomery Form • Efficient formulas for hardware • Computation/storage of y-coordinate can be omitted External inputs provided by the host PC • Initial values: k (Phase 1), prime table (Phase 2) • Each unit: curve parameters, modified moduli • GCD on Host PC
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
General FPGA Architecture • Configurability: by mesh of programmable elements • Configurable Logic Blocks (CLB) • Modern FPGAs contain tens of thousands of CLBs • Connection via interconnect (switching matrix) • Modern FPGAs contain hardcores • Dedicated memory elements • Arithmetic hardcores, e.g., to accelerate integer multiplication and addition • Embedded PowerPC processors • High-speed I/O Transceivers
Generic FPGA Structure (simplified) Long lines Switch matrix Input/output Configurable Logic Block
Embedded Memory Elements in Virtex-4 • 18kbit storage element (BlockRAM) • 400 MHz with connected output register • Flexible BRAM configuration • Dual-Port storage • Single-port storage • RAM or ROM • Cascading possible
Digital Signal Processing (DSP) Hardcores 8 4 32 5 37 • Fast signed 18x18-bit multiplication and signed 48-bit addition/subtraction (400 MHz) • Integrated pipeline register • Controllable by an OPMODE signal • DSP elements can be cascaded
Simplified Architecture of Virtex-4 FPGAs CLB PowerPC (optional) DSP elements Block RAM elements Location and columnwise alignment of elements are important for place and route of FPGA designs!
Input: Modulus M Operands A M, B M Output: A+B (mod M) S = A+B S‘ = A+B-M If (b = 1) then Return S Else Return S‘ Basic Modular Addition A 2 3 5 4 B 4 8 7 5 M 6 8 8 8 2 3 5 4 + 4 8 7 5 0 1 1 0 S 7 2 2 9 7 2 2 9 - 6 8 8 8 0 1 1 0 S‘ 0 3 4 1 borrow (b)
b b m m j , 17 .. 33 j , 0 .. 16 j , 17 .. 33 j , 0 .. 16 18 18 18 18 PCOUT a j 34 CIN CIN 1 1 48 35 35 34 34 M U X MODE _ S 34 E L s j DSP-supported Modular Addition A 2 3 5 4 B 4 8 7 5 M 6 8 8 8 8 6 5 7 8 4 8 6 4 5 3 2 1 0 1 0 0 1 1 0 1 9 12 4 3 12 0 7 0 3 4 1 7 2 2 9
Basic Modular Multiplication block of k bits Modular Multiplication with Quotient Pipelining (Orup) Input: Modulus dependent par M’’(M, k, d), operands A, B Output: S ≡ A · B · R-1 (mod M) // n is number of rounds for i = 0 to n do qi = Si (mod 2k) Si+1 = Si/2k + qiM‘‘ + biA Return Sn+1 A a1 a2 a3 an a0 block selection of Si,0 b0 b1 b2 b3 bn M‘‘ m1 m2 m3 mn m0 shift by k bits Si,j Si,0 Si,1 Si,2 Si,3 Si,n Si,1 Si,2 Si,3 Si,4 0 scalarmultiplication accumulation
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Introduction to Elliptic Curve Method • Phase 1 and 2 of the elliptic curve method • Phase 1 was originally proposed by Lenstra • Idea: Pollard’sp-1 method adapted for elliptic curves • Brent and Montgomery extended the Phase 1 by a continuation (Phase 2) • Cost intensive operations in Phase 1 & 2 • Phase 1 is basically a scalar multiplication • Gaj, et. al. described an algorithm for the standard continuation (CHES 2006) • Standard continuation in hardware • Precomputations • Scalar multiplications jQ • Storing primes in a table • Main computations • point addition • accumulation of product d
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Example: ECM Operations • Arithmetic unit comprises of 2 multipliers and 1 addition/subtraction component • Example: concurrent point doubling/point addition in 6 steps • Implementation of Montgomery Ladder for ECC point multiplication
Initial Project Implementation • Initial Project implemented only Phase 1 • Pros: • Proof-of-concept implementation • DSP implementation of multiplier and addition/subtraction • 8 units calculating Phase 1 • Cons: • Very complex instructions (size and routing) • Addition works with different bit-width than multiplication • Special way to feed multiplier with data • Design reaches DSPs limit • Extension to Phase 2 might not be possible
Initial ECM System Architecture ECM system ECM unit • 8 independent units • supports up to 151-bit numbers • Each unit computes a different parameter set
Initial ECM Systems on COPACOBANA • Host Computer: performs (simple) precomputations • COBACOBANA: performs cost intensive operations • Each module consists of 8 Virtex-4 FPGA • For more information about COPACOBANA, please see talk tomorrow
151-bit multiplier using 10 DSP elements Special structure: aj || aj+5 and mj || mj+5 Output: sj after 66 clock cycles in 17-bit blocks Initial Multiplication with DSPs
Initial Data storage storage block I storage block II storage block III • BRAM 0 to 5 contain input values for the multipliers • BRAM 6/7 contain input values for the addition/subtraction unit • BRAM 8 contain moduli for multiplication and addition/subtraction • Multipliers transfers 17-bit resulting blocks to storage block I (BRAM 0-2), or storage block II (BRAM 3-5), and/or storage block III (BRAM 6-7) • Addition/subtraction transfers 34-bit resulting block to one of storage block I/II/III Too complex! Routing kills the frequency max 100 / 400 MHz
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Optimization of Addition/Subtraction • 17-bit internal bus width • No conversation between add/sub (34 bit) and multiplication (17 bit) • Less space in fabric when buffering bus signals • Better routing possible • Changed code of modular addition/subtraction algorithm • Maximized clock frequency under optimal conditions • Removed output buffers and distributed RAM
Optimization of Multiplication • Previous approach to multiplying: • Use n DSPs to multiply n*17 bit values (parallel) (n+1) * 6 cycles • New approach: • Use 3 DSPs to multiply n*17 bit values (sequential) 6 + (n+2)*n cycles • Here n=10: Reduce DSPs by 7 on cost of 60 clock cycles DSP 10 9 8 7 6 5 4 3 2 1 0 cycles 0 66 126
Optimization of Multiplication • Multiplication results: • Algorithmic-Logic Unit (ALU) results: Promising results for the basic structure!
Optimization of Memory Usage • Instructions • Reduced bit width by factor > 2,5 • Instructions for all ECM operations: 1 BRAM • ECM instructions merged with ECM Parametersk and prime table • Memory usage per unit • ALU: 6 BRAMs • ECM Unit: 1 BRAM 1 – Project 1 uses global instructions for phase 1 using 17 BRAMs for 8 units 2 – Project 2 uses 2 BRAMs, one for ALU, one for ECM instruction for both phases.
Implementation of ECM Phase 2 • Different Clock Domains • 100 MHz for ECM finite state machine (FSM) • 200 MHz for algorithmic-logic unit (ALU) • Workspace RAM: ALU playground • 32 Cells, each containing 2 blocks • Each block stores 16 values of 17-bit • 17.408 bits (equals 1 BRAM) Phase 2 Precomputation Phase 2 P1/P2 temporary result storage Point addition / doubling results P2 pre temporary points
Optimizing Phase 2 Precalculation • From Gaj, et. al. (CHES 2006) Calculate set JS of integers j Use the j multiple of Q0
Optimizing Phase 2 Precalculation II • Suggested parameter D = 210 for hardware • Which multiples of Q0 are needed? • JS(D) = {1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103} • Calculation suggested in [1] calculates:JS(D) = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 2527, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 5355, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 8183, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103, 105} • Are there more needed? • 210*Q0 and MMIN*D*Q0 • But: Point Addition on elliptic curves? • To calculate (Px : : Pz) + (Qx : : Qz) we efficientlywe need the coordinates ([P-Q]x : : [P-Q]z) • What to do with R = R + Q ?
Optimizing Phase 2 Precalculation III What about MMIN * D? • MMIN = round(B1 / D) [math] • Suggested B1 = 960 MMIN = 5 • MMIN*D = 1050 What about [R-Q]? • Start: R = 5*D*Q0, Q = D*Q0 • Point addition needs: [R-Q] = 4*D*Q0 = 840*Q0 • Iteration increases factor by one:5*D*Q0, 6*D*Q0, … Calculations suggested: • 1 D + 52 A + {210, 1050}*Q0 [(8 + 11) AD using montgomery ladder] • 840*Q0 never mentioned [10 AD using montgomery ladder] • not implemented and/or inefficient
Implementation of Phase 2 Precalculation • Use a chain of point addition / doublingto calculate multiples • Number of operations: 30 A + 7 D • Number of points in RAM: 27 + 2 temporary • reduce number of operations drastically • calculate every point needed for addition • calculate every point needed for phase 2
Implementation of Phase 2 Precalculation II • Parameters might change – what now? • Change of B1 results in a change of MMIN • MMIN = 0 not possible • MMIN = 1: first R + Q changes to doubling Q • MMIN > 1 is implementable using this strategy • Required space in RAM unchanged • Required space in fabric unchanged • Varies the number of operations in precomputation
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Results of the new ECM implementation Data transfer Phase 1 Phase 2 Data transfer Results for ECM Phase 1 + 2 (B1 = 960 and B2 = 57600)
Results for ECM Phase 1 + 2 ECM Phase 1 + 2 using B1 = 960, B2 = 57600 [1723 bit k] 69,120 ECM calculations / second on COPACOBANA1 1 COPACOBANA using 4 FPGA modules (= 32 FPGA)
Comparison (ECM Phase 1) Comparing ECM Phase 1 Implementations using B1 = 960, B2 = 57000 [1323 bit k] 1 Cost for 1 FPGA: www.em.avnet.com (September 2009)
Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions
Conclusions • Novel and complete implementation • Implementation results: implementation of phase 2 on FPGA – not just estimates • Optimization results: at least twice as effective as the documented result • Generic and scalable system • Architecture can be easily optimized for (n*17) – 2 bit with 2 <= n <= 14 [implemented: n = 9] • only small changes in fabric/resources, but exchange instruction ROM • Highly parallel architecture for co-factorization • Multiple ECM-units per FPGA (24 units supporting 151-bit on Virtex-4 SX 35) • Multiple FPGAs on the COPACOBANA cluster (128 Virtex-4 SX 35)
Thank you for your attention!Any Questions? Ralf Zimmermann, Tim Güneysu, Christof Paar
Redesign of FPGA Module for ECM Original: 6xSpartan-3 XC3S1000 Redesign: 8xVirtex-4 XCV4SX35