Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj George Mason University

Outline 1. RSA and Factoring with Number Field Sieve(NFS) 2. Matrix Step of NFS Basic Mesh Routing 6. 3. Improved MeshRouting 4. 7. 5. 8. Results and Conclusions 9. Summary &Conclusions SRC-6e Reconfigurable Computer FPGA Array

RSA Major Public Key Cryptosystem { e, N } { d, P, Q } N = PQ P, Q - large prime factors Private key (d, P,Q) Public key (e, N) Network Alice Bob Decryption Encryption

RSA developed by Ron Rivest, Adi Shamir & Leonard Adlemann in 1977

Secure WWW, SSL, 95% of e-commerce Network Browser WebServer S/MIME, PGP Bob Alice Applications of RSA

How hard is to break RSA? Largest Number Factored: 576 bits RSA-576 (Dec 2003) 1881 9881292060 7963838697 2394616504 3980716356 3379417382 7007633564 2298885971 5234665485 3190606065 0474304531 7388011303 3967161996 9232120573 4031879550 6569962213 0516875930 7650257059 Resources & efforts 100 workstations from 8 different sites around the world 3 months

Recommended key sizes for RSA Old standard: 512 bits (155 decimal digits) Individual users Broken in 1999 New standard: Individual users 768 bits Organizations (short term) 1024 bits Organizations (long term) 2048 bits

Estimated Difficulty of factoring 1024-bit number by RSA Security, Inc. 342 million PCs, 500 MHz 170 GB RAM 1 year

Our Task Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware SRC-6e Reconfigurable Computer Generic Array of FPGAs

Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N exponential function, ek Sub-exponential function, e k1/3 (ln k)2/3 Polynomial function, a•km Best Algorithm to Factor NUMBER FIELD SIEVE

Computationally intensive steps Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix(Linear Algebra) Square Root

Hardware Architecture of NFS proposed to date Daniel Bernstein Adi Shamir, Eran Tromer Univ. of Illinois, Chicago Weizemann Institute, Israel Mesh Routing Mesh Approach Matrix Matrix and Sieving AsiaCrypt 2002 Mesh Sorting: TWIRL Matrix Sieving Fall 2001 Crypto 2003, AsiaCrypt 2003 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers

Matrix(Linear Algebra) Focus of Research My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers

My Objective • Detailed design in RTL code of the mesh algorithm • Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer

Function of Matrix Step Find the linear dependency in the large sparse matrix obtained after sieving step. D = number of matrix columns or rows. . . . . . . . . .. 0 0 0 0 . . . 0 0 1 . . . . . . . . . 0 1 0 0 . . . 0 0 0 . . . . . . . . .. 0 0 0 0 . . . 0 1 0 . . . . . . . . .. 0 0 0 0 . . . 0 1 0 1 0 0 0 . . . 0 0 0 . . . . . . . . .. 106 for 512-bit 107 for 1024-bit D  ci1 ci2 ciL ci1 ci2 . . .ciL = 0

Block Weidemann Algorithm for the Matrix Step 1) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectorsAvi, A2vi, …. , Akvi . k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A[DxD]v[Dx1] Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrix-vector multiplications

Mesh Sorting (Bernstein) Mesh Routing (Shamir-Tromer) • Based on Recursive Sorting • Based on Routing • Does one multiplication at a time • large area • Does K multiplications at a time • compact area - (handles large matrix size) Two Architectures for Matrix-vector multiplication

Mesh Routing

1 0 1 1 0 1 0 1 0 Matrix-Vector Multiplication v 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A·v A Sparse Matrix

Mesh Routing m x m mesh where m= D v 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 cell( S ) 0 1 2 3 4 5 6 7 8 9 1 1 2 2 4 4 3 3 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 5 5 7 7 9 9 8 8 1 1 1 1 1 1 3 3 2 2 5 5 1 1 0 A m D 1 1 1 1 1 6 6 4 4 7 7 1 1 1 1 1 1 3 3 2 2 1 0 1 1 1 1 1 4 4 8 8 6 6 8 8 9 9 1 1 1 1 1 1 m 1 1 1 1 D d = maximum non-zero entries for all column

3 2 4 0 1 0 8 9 5 7 3 2 1 1 5 1 0 7 6 4 2 1 3 1 0 1 4 8 6 8 9 Routing in the Mesh Fourth cell Each time a packet arrives at the target cell, the packet’s vector’s bit is xored with the partial result bit on the target cell

Mesh Routing Mesh contains the result of the multiplication 3 2 4 0 0 1 1 1 0 8 9 5 7 3 2 5 0 1 1 0 1 1 7 6 4 2 1 3 1 0 0 1 1 0 4 8 6 8 9

1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 Mesh Routing with K parallel multiplications Example for K=2 v2 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 v1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 2 2 4 4 3 3 1 1 1 1 1 1 5 5 7 7 9 9 8 8 A 1 1 1 1 1 1 3 3 2 2 5 5 1 1 1 1 1 6 6 4 4 7 7 1 1 1 1 1 1 3 3 2 2 1 1 1 1 4 4 8 8 6 6 8 8 9 9 1 1 1 1 1 1 1 1 1 1 mesh

Clockwise Transposition Routing Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet

Cells Compare-exchange direction Clockwise Transposition Routing Four iterations repeated

Types of Packets 1) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination

N 1 2 1 N 2 1 1 b) Current packet invalid, incoming new packet valid (may need to exchange) a) Both packets are valid (may need to exchange) N N 2 N N 2 N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action) Compare-Exchange Cases Four cases for a cell Left cell

Basic and Improved Mesh Routing Designs

Basic Mesh Routing Design • Each Cell of mesh handles one column of matrix A • K = 1 or K  32, K = number of vectors multiplied by matrix A concurrently • Total routing takes d·4·m compare-exchange operations

Basic Loading and Unloading Design Non Zero Matrix Entries Result Vector Vector

Parallel Loading & Unloading Design Non Zero Matrix Entries Result Vector Vector Restricted by Number of IO pins available

Basic Cell Design for Basic Mesh Routing LUT-RAM R[i] address P[i] CU decode en equal annihilate eq_pack en_cur CR eq_packet exchange P’[i] row/col Check Dest exchange en_equal oper Comparator r c annihilate equal coordinate en_equal eq_packet Status bits

Comparator Design cell’s coordinate current packet new packet row col s1 row col s2 row col row/col s1, s2 > = a1 b1 oper Control Signal Logic s1s2 en_equal exchange eq_packet annihilate

Improved Mesh Routing Design • Each Cell of mesh handles p columns of the matrix A • Compact area => handles larger matrix size • Total routing takes p·d·4·m compare-exchange steps • proposed for cost reduction

Mesh Cell Design for Improved Mesh Routing R[i] address CU LUT-RAM decode en addr en P[i] equal eq_pack annihilate en_cur CR exchange eq_packet P’[i] addr1 en_equal row/col Check_Dest exchange addr2 r c oper annihilate Comparator coordinate equal en_equal eq_packet Status bits

XC2V8000 • 46,592 CLB slices • 93,184 LUT ( LookUpTable ) Block RAMs Block RAMs Block RAMs Block RAMs • 93,184 FF (Flip-Flop) XC2V6000 • 33,792 CLB slices Multipliers 18 x 18 Multipliers 18 x 18 Multipliers 18 x 18 Multipliers 18 x 18 • 67,584 LUT ( LookUpTable ) • 67,584 FF (Flip-Flop) CLB slice I/O Block Target FPGA Devices Xilinx Virtex II

Results and Analysis

Synthesis Results for one Virtex II XC2V8000 using Basic Mesh Routing Design K = number of concurrent matrix-vector multiplicationsTime for K mult = d * 4 * m * Clock period

Speedup vs. Software Implementation Reference Optimized SW Implementation: PC, Pentium IV, 2.768 GHz, 1 GB RAM

Distributed Computation(Geiselmann, Steinwandt) A v Av A1,1 A1,2 A1,3 v1 A’1 A2,1 A2,2 A2,3  A’2 v2 = A3,1 A3,2 A3,3 v3 A’3 A1,1  v1 +  v2 + A1,3  v3 = A’1 A1,2 A·v =

1) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation   512-bit & 1024-bit performance with different number of square array of FPGAs connected in two dimensions

512-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D2/(m4) TK = routing time for K multiplications in the mesh = d*4*m* Clock Period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K * n *( TK + TLoad)

1024-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D2/(m4) TK = routing time for K multiplications in the mesh = d*4*m* Clock Period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K * n *( TK + TLoad)

TTotal = m1 = mesh size in one Virtex II chip Analysis & Conclusion • Polynomial Speedup with number of FPGAs • Speedup approximately proportional to (#FPGA)3/2

Speedup vs. number of FPGA chips

Synthesis Results on one Virtex II XC2V8000 for Improved Mesh Routing Design Time for K mult = p * d * 4 * m * Clock period

512-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=16 n = number of times to repeat sub-multiplications = D2/(m2 p)2TK = routing time for K multiplications in the mesh = p*d*4*m*Clock period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K* n *( TK + TLoad)

1024-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=16 n = number of times to repeat sub-multiplications = D2/(m2 p)2TK = routing time for K multiplications in the mesh = p*d*4*m*Clock period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K* n *( TK + TLoad)

Comparison of Basic & Improved Mesh Routing performance with the number of FPGAs 512-bit 1024-bit

Speedup of Improved to Basic Mesh Routing vs. Number of Virtex II FPGAs 512-bit 1024-bit

Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj