400 likes | 508 Views
Optimum Implementation of Elliptic Curve Cryptosystems on the SRC-6E Reconfigurable Computer. Nghi Nguyen 1 , Kris Gaj 1 , David Caliga 2 , Tarek El-Ghazawi 3. 1 George Mason University 2 SRC Computers 3 The George Washington University. What is a reconfigurable computer?.
E N D
Optimum Implementation of Elliptic Curve Cryptosystems on the SRC-6E Reconfigurable Computer Nghi Nguyen1, Kris Gaj1, David Caliga2, Tarek El-Ghazawi3 1 George Mason University2 SRC Computers3 The George Washington University 1
What is a reconfigurable computer? Reconfigurable processor system Microprocessor system . . . P P . . . FPGA FPGA P memory P memory FPGA memory FPGA memory . . . . . . Interface Interface I/O I/O 2
Characteristic Features • close integration of the microprocessor system and the FPGA system • integrated programming environment • programming does not require hardware expertise • suitable for a wide range of applications • permits run-time reconfiguration of the FPGA system 3
SRC vs. FPGA Accelerator Boards Programming Graphical Data Flow Diagram HDL HLL Software FPGA Boards Hardware Software SRC Hardware 6
Run Time Reconfiguration in SRC Program in C or Fortran FPGA contents after the Function_1 call Main program Function_1 a …… FPGA Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Function_1(a, d, e) Macro_1 …… c b Function_2 Macro_2 Macro_2 Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) Function_2(d, e, f) d e …… 8
Elliptic Curve Cryptosystems 9
Elliptic Curve Cryptosystems • public key (asymmetric) cryptosystems • first true alternative for RSA • several times shorter keys • fast and compact implementations, in particular in hardware • a family of cryptosystems, instead of a single cryptosystem 10
Three Classes of Elliptic Curves Elliptic curves built over Secure m m=155 .. 512 K = GF(2m) K = GF(p) Our m m=233 Arithmetic operations present in many libraries Normal basis representation Polynomial basis representation Fast in hardware Compact in hardware 11
ECC Hierarchy High-level functions kP Medium-level functions 2P P+Q Low-level functions MUL INV XOR 12
Basic operations of Elliptic Curve Cryptosystems (1) Basic operations in Galois Field GF(2m) • addition andsubtraction (xor): x+y, x-y • multiplication: x y • inversion: x-1 Basic operations on points of an Elliptic Curve over Galois Field GF(2m) • addition of points: P + Q • doubling a point: 2 P where P = (xP, yP), Q = (xQ, yQ) 13
Basic operations of Elliptic Curve Cryptosystems (2) Complex operations on points of an Elliptic Curve over Galois Field GF(2m) • scalar multiplication: k P = P + P + …+P k times • double scalar multiplication: k P + l Q 14
Doubling, 2P Addition, P+Q R = 2 P R = P + Q • P = (xP, yP) • Q = (xQ, yQ) • R = (xR, yR) • xR = 2 + + xP + xQ + a2 • yR = (xP - xR) - yP • where • = (y1 + y2)(x1 + x2)-1 • Number of field operations: • 3 multiplications • 1 inversion • P = (xP, yP) • R = (xR, yR) • x3 = a6(xP-1)2 + xP2 • y3 = xP2 + (xP + yPxP-1)xR + xR • Number of field operations: • 5 multiplications • 1 inversion a2, a6 – coefficients of a curve 15
Scalar Multiplication - kP R = kP = P + P + … + P k times k = (km-1, km-2, ..., k1, k0)2 R = O S = P for ( i=0 to m-1 ) if( ki = 1 ) R = R + S end if S = 2S end for return R can be performed in parallel 16
ECC Hierarchy High-level functions kP Medium-level functions 2P P+Q Low-level functions MUL INV XOR 17
Investigated Partitioning Schemes 18
SRC Program Partitioning C function for P P system HLL C function for MAP FPGA system VHDL macro HDL 19
H00 Partitioning (μP Software Only) C function for P H kP C function for MAP 0 VHDL macro 0 20
00H Partitioning (VHDL only) C function for P 0 C function for MAP 0 VHDL macro H kP 21
HML Partitioning C function for P kP H C function for MAP M 2P P+Q VHDL macro L INV XOR MUL 22
0HL Partitioning C function for P 0 kP C function for MAP H P+Q 2P VHDL macro INV XOR MUL L 23
0HM Partitioning C function for P 0 C function for MAP H kP VHDL macro M P+Q 2P 24
GF(2m) Multiplier Constant P Input B • Input: • A, B GF(2m) • Output: • C = A*B mod P • 1. C = 0 • 2. for i = m-1 to 0 do • C = C<<1 + A*bi • C = C + cm*P • 5. end for • 6. return C m m AND B <<1 0 m-1 m-1 AND C A <<1 Input A m m Result m+1 clock cycles per multiplication 25
GF(2m) Inverter • Input: A GF(2m) • Output: C = A-1 mod P • 1. Y=A, D=P, B=0, Z=1 • 2. loop • 3. while y0 = 0 do • 4. Y=Y>>1 • X=(X + z0*P)>>1 • 5. end while • 6. if (Y=1) • return Z • 8. if (D>Y) then • D<=>Y, B<=>Z • 10. Y=Y+D, Z=Z+B • 11. end loop Input A Constant P m 0 0 Swapping Swapping m B D m m 1 >>1 >>1 Z Inside Y Inside while loop while loop m m m Modified Almost Inverse Algorithm Result Time of inversion is input-dependent Typically, 3-4 times m, on average 26
Unrolled Implementation Approach Using Two FPGA Devices MUL MUL MUL MUL MUL MUL MUL MUL INV INV FPGA1 FPGA2 kP I/O 2P P+Q 27
Iterative Implementation Approach Using Two FPGA Devices MUL MUL MUL INV INV FPGA1 FPGA2 kP I/O 2P P+Q 28
Iterative Implementation Approach Using One FPGA Device MUL MUL MUL INV INV FPGA1 FPGA2 kP I/O P+Q 2P 29
Results 30
Timing Measurements .c file .mc file MAP function MAP function MAP Alloc. MAP Free FPGA Configure DMA Data In FPGA Computation DMA DataOut End-to-End time (HW) End-to-End time (SW) MAP Allocation time MAP Release Time Configuration time 31
End-to-End Latency for Different Partitioning Approaches 101,145 35
FPGA Resource Usage for Different Partitioning Approaches 36
Conclusions • Elliptic Curve Cryptosystem implementation • challenging for reconfigurable computers because of • optimization for latency rather than throughput • limited amount of parallelism • From 8 to 9 times speed-up over highly optimized • microprocessor implementation demonstrated • using four different algorithm partitioning schemes • 0HL iterative 2-chip • 0HL unrolled 2-chip • 0HM 2-chip • 00H 1-chip 37
Conclusions – cont. Clear trade-offs: Resources Timing Ease of programming 38
Conclusions – cont. Assuming focus on: Resources Timing Ease of programming 39
C function for P 0 kP C function for MAP H P+Q 2P VHDL macro INV XOR MUL L Conclusions – cont. The best implementation approach: OHL partitioning scheme, 2-chip, unrolled Only 8% increase in the execution time compared to pure VHDL 40