240 likes | 531 Views
AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit. Piyush Ranjan Satapathy CS203B Class Project Presentation. Road Map. AES Algorithm Overview IXP2400 Platform: A Quick Look Microcode: Overview Implementation of AES Experimental Results
E N D
AES Microcode Implementation In IXP2400 And A study ofReconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation
Road Map • AES Algorithm Overview • IXP2400 Platform: A Quick Look • Microcode: Overview • Implementation of AES • Experimental Results • Reconfigurable Crypto unit of Intel IXP2850
Algorithm Overview • Designed by Daemen and Rijmen for the NIST • Originally called Rijndael • Symmetric key block substitution cipher • Replacement for DES • Successful field testing since inception • Three bit-modes • State defined as a 4x4 array of 16 bytes • Key size is either 16,24, or 32 bytes • A byte is represented by Galois polynomials
Stages of AES Algorithm: Kn Result from round n-1 Pass to round n+1 ByteSub Shift Row MixColumn AddRoundKey Detailed view of round n • Each round performs the following operations: • Non-linear Layer: No linear relationship between the input and output of a round • Linear Mixing Layer: Guarantees high diffusion over multiple rounds • Very small correlation between bytes of the round input and the bytes of the output • Key Addition Layer: Bytes of the input are simply EXOR’ed with the expanded round key
1. SubBytes Function • Affine Transformation in GF (28) • Direct implementation is complex • Easily performed by a 16 x 16 LUT ROM • Simple byte substitution • Combinational logic Each byte at the input of a round undergoes a non-linear byte substitution according to the following transform Substitution (“S”)-box
2. Shift Row • Shifting done only on the bottom three rows of the State • Left rotate for encryption • Right rotate for decryption Depending on the block length, each “row” of the block is cyclically shifted according to the above table
3. MixColumns Function • Matrix multiplication in GF (28) • MixColumns functionality resides primarily in the controller and instruction memory • A series of conditional XOR and left shift operations Each column is multiplied by a fixed polynomial C(x) = ’03’*X3 + ’01’*X2 + ’01’*X + ’02’ This corresponds to matrix multiplication b(x) = c(x) a(x):
4. Key Expansion and Addition • Performed before both the encrypt and decrypt process • Byte values from the Key are read and manipulated into the RoundKey • A series of SubBytes and XOR operations with RCON ROM values and the Key • Performs XOR operation between the State and the Roundkey • This is the only function without an inverse Each word is simply EXOR’ed with the expanded round key
IXP2400 Platform: A Quick Look • achieve high processing performance • programming flexibility • Cheaper than ASIC
Microcode Overview • alu [ dest1, a, +, b] ALU addition of a and b and storing in dest1 • alu [ dest2, dest1, -, c] ALU subtraction • Move(reg1, reg2) Moving from one reg1 to reg2 ; both are gprs. • Immed[reg, ox0020] Immediate value assignment to register • local_csr_wr[ACTIVE_LM_ADDR_0, 0x0] Local memory indexing with index0 • .begin … endm Macro begin and end • .if … .endif If loop • xbuf_alloc ($$state, 4, read) buffer allocation in DRAM transfer register • .reg gen_regiater $sram_reg $$dram_reg Register declaration • .sig sram_sig dram_sig signal declaration • .while … .endw While looping • #for round[1,2,3,4,5,6,7,8,9,10] … #endloop For looping • alu_shf[index, --, B, s0, >>24] Alu shift function of B • scratch[read, $T, index, 0, 1], ctx_swap[sram_sig] scratch read instruction • ld_field_w_clr[t1, 1000, $T] Performs a write to t1 register • dram[write, $$out[0], dst_addr, 0, 2], sig_done[dram_sig] Dram write • ctx_arb[dram_sig], ctx_arb[kill] signaling
Implementation Setup • Environmental Setup: • Intel IXP 4.1 • 600MHz ME configurations • 200-MHz SRAMs • 150-MHz RDRAMs • Executed in Multi threads • Executed in Different Micro Engines
Experimental Results(1) SRAM Utilization ME utilization %
Experimental Results(2) Throughput Performance Across Threads in 1 ME Throughput Performance Across Threads in 1 ME
3DES Core 2 Cores per crypto unit • Takes 192-bit key • –(56-bit + 8-bit parity) x 3Keys • Operates on 8-byte blocks • Result is written to ME transfer registers or TBUF element • Result can be passed to the SHA-1 unit for hashing Security Processing, pipelining, and interleaving using three wires and one core Multiple keys and IVs
AES Core • All AES key sizes are supported • –(128, 192, or 256) • Both Encryption and Decryption supported • Operates on 16 byte blocks AES Key Scheduler
SHA1 Core • 2 SHA-1 cores per crypto unitOperates on 64-byte blocks • Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer • Can perform on unmodified packet data or on the ciphered packet data • Operates on 512 bit block size and has a data buffer to accumulate the ciphered data • This gives flexibility to run SHA and AES, 3DES at different rates. SHA1 Critical Path Analysis
Some of The Crypto Commands • crypto_write_ram($$orig_plain_text[0],DATA_RAM_ADDR,8,ENCRYPT_UNIT, ram_sig) Perform and wait for the write • crypto_load_iv($$iv[0], 1,ENCRYPT_UNIT,CRYPTO_BANK, ENCRYPT_STATE, iv_sig) Loading IV Data • crypto_load_key($$key[0],3,ENCRYPT_UNIT,CRYPTO_BANK,ENCRYPT_STATE,key_sig) Loading Key • crypto_cipher($$encrypt_data[0],DATA_RAM_ADDR,8,CRYPTO_CIPHER_ENCRYPT,CRYPTO_CIPHER_NO_CBC, CRYPTO_CIPHER_3DES, ENCRYPT_UNIT,CRYPTO_BANK, ENCRYPT_STATE, cipher_sig)
Acknowledgement • Yan Luo • Chris Baron • http://cnscenter.future.co.kr/resource/rsc-center/presentation/intel/spring2003/S03USCPTS92_OS.pdf ( For some slides) • Mel Tsai; UC Berkeley (For some slides) • Thomas Sodon et al, EE College of NewJersey • Zhangxi Tan et al, Tsinghua University