Workshop on Factorization, CITS Bochum, 11-12 September 2009

Workshop on Factorization, CITS Bochum, 11-12 September 2009 Implementing the Elliptic Curve Method (ECM) on Special-Purpose Hardware Ralf Zimmermann, Tim Güneysu, Christof PaarHorst Görtz Institute for IT-SecurityRuhr-University Bochum

Outline Introduction Background and Arithmetic on Modern FPGAs Overview of Elliptic Curve Method (ECM) Initial Implementation of ECM (Phase 1) Optimization and Implementing of ECM (Phase 1 & 2) Results Conclusions

Factorn= 7626668401 080283463 p=2349834551 q=3245619313 Design Goals and Decisions Usage of Elliptic Curve Method (ECM) in co-factorization Goals of this work group • ECM implementation for small bit numbers (up to 200 bits) • Implement both Phases 1 and 2 • Usable on COPACOBANA for massive parallelism Four Individual Phases of this Work Package • Determine the best platform for ECM • Redesign COPACOBANA architecture for target platform • Implement ECM on the target platform • Optimize ECM on the target platform

Selection Process • Task: Find is the best platform for ECM (≠ CPUs)! • Based on three (predefined) platforms: • FPGAs: Spartan-3/Virtex-4 • Digital Signal Processors (DSP): TI C6713 • Smartcards with PK-Accelerator: Siemens SL88 Winner of this competition: Virtex-4 FPGAs

Design Goals and Decisions II Use of Elliptic Curves in Montgomery Form • Efficient formulas for hardware • Computation/storage of y-coordinate can be omitted External inputs provided by the host PC • Initial values: k (Phase 1), prime table (Phase 2) • Each unit: curve parameters, modified moduli • GCD on Host PC

General FPGA Architecture • Configurability: by mesh of programmable elements • Configurable Logic Blocks (CLB) • Modern FPGAs contain tens of thousands of CLBs • Connection via interconnect (switching matrix) • Modern FPGAs contain hardcores • Dedicated memory elements • Arithmetic hardcores, e.g., to accelerate integer multiplication and addition • Embedded PowerPC processors • High-speed I/O Transceivers

Generic FPGA Structure (simplified) Long lines Switch matrix Input/output Configurable Logic Block

Embedded Memory Elements in Virtex-4 • 18kbit storage element (BlockRAM) • 400 MHz with connected output register • Flexible BRAM configuration • Dual-Port storage • Single-port storage • RAM or ROM • Cascading possible

Digital Signal Processing (DSP) Hardcores 8 4 32 5 37 • Fast signed 18x18-bit multiplication and signed 48-bit addition/subtraction (400 MHz) • Integrated pipeline register • Controllable by an OPMODE signal • DSP elements can be cascaded

Simplified Architecture of Virtex-4 FPGAs CLB PowerPC (optional) DSP elements Block RAM elements Location and columnwise alignment of elements are important for place and route of FPGA designs!

Input: Modulus M Operands A  M, B  M Output: A+B (mod M) S = A+B S‘ = A+B-M If (b = 1) then Return S Else Return S‘ Basic Modular Addition A 2 3 5 4 B 4 8 7 5 M 6 8 8 8 2 3 5 4 + 4 8 7 5 0 1 1 0 S 7 2 2 9 7 2 2 9 - 6 8 8 8 0 1 1 0 S‘ 0 3 4 1 borrow (b)

b b m m j , 17 .. 33 j , 0 .. 16 j , 17 .. 33 j , 0 .. 16 18 18 18 18 PCOUT a j 34 CIN CIN 1 1 48 35 35 34 34 M U X MODE _ S 34 E L s j DSP-supported Modular Addition A 2 3 5 4 B 4 8 7 5 M 6 8 8 8 8 6 5 7 8 4 8 6 4 5 3 2 1 0 1 0 0 1 1 0 1 9 12 4 3 12 0 7 0 3 4 1 7 2 2 9

Basic Modular Multiplication block of k bits Modular Multiplication with Quotient Pipelining (Orup) Input: Modulus dependent par M’’(M, k, d), operands A, B Output: S ≡ A · B · R-1 (mod M) // n is number of rounds for i = 0 to n do qi = Si (mod 2k) Si+1 = Si/2k + qiM‘‘ + biA Return Sn+1 A a1 a2 a3 an a0 block selection of Si,0 b0 b1 b2 b3 bn M‘‘ m1 m2 m3 mn m0 shift by k bits Si,j Si,0 Si,1 Si,2 Si,3 Si,n Si,1 Si,2 Si,3 Si,4 0 scalarmultiplication accumulation

Introduction to Elliptic Curve Method • Phase 1 and 2 of the elliptic curve method • Phase 1 was originally proposed by Lenstra • Idea: Pollard’sp-1 method adapted for elliptic curves • Brent and Montgomery extended the Phase 1 by a continuation (Phase 2) • Cost intensive operations in Phase 1 & 2 • Phase 1 is basically a scalar multiplication • Gaj, et. al. described an algorithm for the standard continuation (CHES 2006) • Standard continuation in hardware • Precomputations • Scalar multiplications jQ • Storing primes in a table • Main computations • point addition • accumulation of product d

Example: ECM Operations • Arithmetic unit comprises of 2 multipliers and 1 addition/subtraction component • Example: concurrent point doubling/point addition in 6 steps • Implementation of Montgomery Ladder for ECC point multiplication

Initial Project Implementation • Initial Project implemented only Phase 1 • Pros: • Proof-of-concept implementation • DSP implementation of multiplier and addition/subtraction • 8 units calculating Phase 1 • Cons: • Very complex instructions (size and routing) • Addition works with different bit-width than multiplication • Special way to feed multiplier with data • Design reaches DSPs limit • Extension to Phase 2 might not be possible

Initial ECM System Architecture ECM system ECM unit • 8 independent units • supports up to 151-bit numbers • Each unit computes a different parameter set

Initial ECM Systems on COPACOBANA • Host Computer: performs (simple) precomputations • COBACOBANA: performs cost intensive operations • Each module consists of 8 Virtex-4 FPGA • For more information about COPACOBANA, please see talk tomorrow

151-bit multiplier using 10 DSP elements Special structure: aj || aj+5 and mj || mj+5 Output: sj after 66 clock cycles in 17-bit blocks Initial Multiplication with DSPs

Initial Data storage storage block I storage block II storage block III • BRAM 0 to 5 contain input values for the multipliers • BRAM 6/7 contain input values for the addition/subtraction unit • BRAM 8 contain moduli for multiplication and addition/subtraction • Multipliers transfers 17-bit resulting blocks to storage block I (BRAM 0-2), or storage block II (BRAM 3-5), and/or storage block III (BRAM 6-7) • Addition/subtraction transfers 34-bit resulting block to one of storage block I/II/III Too complex! Routing kills the frequency  max 100 / 400 MHz

Optimization of Addition/Subtraction • 17-bit internal bus width • No conversation between add/sub (34 bit) and multiplication (17 bit) • Less space in fabric when buffering bus signals • Better routing possible • Changed code of modular addition/subtraction algorithm • Maximized clock frequency under optimal conditions • Removed output buffers and distributed RAM

Optimization of Multiplication • Previous approach to multiplying: • Use n DSPs to multiply n*17 bit values (parallel)  (n+1) * 6 cycles • New approach: • Use 3 DSPs to multiply n*17 bit values (sequential)  6 + (n+2)*n cycles • Here n=10: Reduce DSPs by 7 on cost of 60 clock cycles DSP 10 9 8 7 6 5 4 3 2 1 0 cycles 0 66 126

Optimization of Multiplication • Multiplication results: •  Algorithmic-Logic Unit (ALU) results: Promising results for the basic structure!

Optimization of Memory Usage • Instructions • Reduced bit width by factor > 2,5 • Instructions for all ECM operations: 1 BRAM • ECM instructions merged with ECM Parametersk and prime table • Memory usage per unit • ALU: 6 BRAMs • ECM Unit: 1 BRAM 1 – Project 1 uses global instructions for phase 1 using 17 BRAMs for 8 units 2 – Project 2 uses 2 BRAMs, one for ALU, one for ECM instruction for both phases.

Implementation of ECM Phase 2 • Different Clock Domains • 100 MHz for ECM finite state machine (FSM) • 200 MHz for algorithmic-logic unit (ALU) • Workspace RAM: ALU playground • 32 Cells, each containing 2 blocks • Each block stores 16 values of 17-bit •  17.408 bits (equals 1 BRAM) Phase 2 Precomputation Phase 2 P1/P2 temporary result storage Point addition / doubling results P2 pre temporary points

Optimizing Phase 2 Precalculation • From Gaj, et. al. (CHES 2006) Calculate set JS of integers j Use the j multiple of Q0

Optimizing Phase 2 Precalculation II • Suggested parameter D = 210 for hardware • Which multiples of Q0 are needed? • JS(D) = {1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103} • Calculation suggested in [1] calculates:JS(D) = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 2527, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 5355, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 8183, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103, 105} • Are there more needed? • 210*Q0 and MMIN*D*Q0 • But: Point Addition on elliptic curves? • To calculate (Px : : Pz) + (Qx : : Qz) we efficientlywe need the coordinates ([P-Q]x : : [P-Q]z) •  What to do with R = R + Q ?

Optimizing Phase 2 Precalculation III What about MMIN * D? • MMIN = round(B1 / D) [math] • Suggested B1 = 960  MMIN = 5 • MMIN*D = 1050 What about [R-Q]? • Start: R = 5*D*Q0, Q = D*Q0 • Point addition needs: [R-Q] = 4*D*Q0 = 840*Q0 • Iteration increases factor by one:5*D*Q0, 6*D*Q0, … Calculations suggested: • 1 D + 52 A + {210, 1050}*Q0 [(8 + 11) AD using montgomery ladder] • 840*Q0 never mentioned [10 AD using montgomery ladder] •  not implemented and/or inefficient

Implementation of Phase 2 Precalculation • Use a chain of point addition / doublingto calculate multiples • Number of operations: 30 A + 7 D • Number of points in RAM: 27 + 2 temporary •  reduce number of operations drastically •  calculate every point needed for addition •  calculate every point needed for phase 2

Implementation of Phase 2 Precalculation II • Parameters might change – what now? • Change of B1 results in a change of MMIN • MMIN = 0 not possible • MMIN = 1: first R + Q changes to doubling Q • MMIN > 1 is implementable using this strategy • Required space in RAM unchanged • Required space in fabric unchanged • Varies the number of operations in precomputation

Results of the new ECM implementation Data transfer Phase 1 Phase 2 Data transfer Results for ECM Phase 1 + 2 (B1 = 960 and B2 = 57600)

Results for ECM Phase 1 + 2 ECM Phase 1 + 2 using B1 = 960, B2 = 57600 [1723 bit k] 69,120 ECM calculations / second on COPACOBANA1 1 COPACOBANA using 4 FPGA modules (= 32 FPGA)

Comparison (ECM Phase 1) Comparing ECM Phase 1 Implementations using B1 = 960, B2 = 57000 [1323 bit k] 1 Cost for 1 FPGA: www.em.avnet.com (September 2009)

Conclusions • Novel and complete implementation • Implementation results: implementation of phase 2 on FPGA – not just estimates • Optimization results: at least twice as effective as the documented result • Generic and scalable system • Architecture can be easily optimized for (n*17) – 2 bit with 2 <= n <= 14 [implemented: n = 9] • only small changes in fabric/resources, but exchange instruction ROM • Highly parallel architecture for co-factorization • Multiple ECM-units per FPGA (24 units supporting 151-bit on Virtex-4 SX 35) • Multiple FPGAs on the COPACOBANA cluster (128 Virtex-4 SX 35)

Thank you for your attention!Any Questions? Ralf Zimmermann, Tim Güneysu, Christof Paar

Redesign of FPGA Module for ECM Original: 6xSpartan-3 XC3S1000 Redesign: 8xVirtex-4 XCV4SX35

COPACOBANA with Virtex-4 Devices

Workshop on Factorization, CITS Bochum, 11-12 September 2009

Workshop on Factorization, CITS Bochum, 11-12 September 2009

Presentation Transcript

September 11-12 2009

CITS

September 11, 2009

September 17, 2009 Workshop Agenda

September 12, 2009

G492 Writing September 12, 2009

COLONY Introduction September 11, 2009

Washington, September 10-11 2009

San Diego Workshop, 11 September 2003

WINDAM TESTING WORKSHOP September 2, 2009

Straw tracker NA62 Workshop JINR 29/11-4/12 2009

Multiage Newsletter September 11, 2009

Catalyst – September 2(11), 2009

LHeC Workshop, September 3, 2009

UNFCCC Workshop on National Systems 11-12 April

ESU Conference 2009 on Entrepreneurship Benevento, 9-12 September 2009

ESU Conference 2009 on Entrepreneurship Benevento, 9-12 September 2009

PRESENTATION ON LABOUR LEGISLATION 11-12 AUGUST 2009

September 6 - 12, 2009