160 likes | 273 Views
High Throughput AES. Alireza Hodjat IVGroup. k n. Key Addition. Key Sch_Sub. Substitution. Key Sch_rt. Shift Row. Key Sch_xor. Mix Column. k i. Key Addition. Key Sch_Sub. Key Sch_rt. Substitution. Key Sch_xor. Shift Row. Key Addition. The AES Algorithm. Outer-round Pipelining.
E N D
High Throughput AES Alireza Hodjat IVGroup
kn Key Addition Key Sch_Sub Substitution Key Sch_rt Shift Row Key Sch_xor Mix Column ki Key Addition Key Sch_Sub Key Sch_rt Substitution Key Sch_xor Shift Row Key Addition The AES Algorithm
The Highest Possible Throughput • The choice of 128-bit key only • Completely unrolled loop • Pipelined • Between each round (Outer-round) • Inside each round (Inner-round) • This causes huge area consumption.
Area Optimization • Area optimization inside each round • Two different techniques: • Resource sharing • Re-timing • Break the critical path and perform the algorithm in multiple clock cycles • Critical path: Substitution • Area-delay trade-off
Sbox area-delay trade-off for FPGA Sbox area-delay trade-off for ASIC Design Type Design Type Critical path Critical path Area Area Re-timing Re-timing Direct No-Pipeline Direct No-Pipeline 4.05 ns 1.19 ns 2.086 Kgates 136 LUTs No No Indirect No-Pipeline Indirect No-Pipeline 10.41 ns 3.67 ns 1.167 Kgates 94 LUTs No No Direct One stage pipeline Direct One stage pipeline 3.91 ns 0.78 ns 3.51 Kgates 136 LUTs Yes 2 pipe stages Yes 2 pipe stages Indirect Three stage pipeline Indirect Three stage pipeline 5.95 ns 1.11 ns 1.65 Kgates 90 LUTs Yes 3 pipe stages Yes 3 pipe stages Direct No-pipeline Using Block RAM 4.87 ns 0 LUTs No Sbox Area-Delay Trade-off • Direct Implementation: Look-up table • Indirect Implementation: GF(24) • Wolkerstorfer Design • Patrick’s codes
4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 S S S S S S S S S S S S S S S S 4 3 2 1 M M M M 4 3 2 1 + + + + + + + + + + + + + + + + AES Encrypt Datapath
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S S S S + + + + + + + + + + + + + + + + + Key Scheduling Datapath
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 S S S S S S S S S S S S S S S S 1 Cycle 1 Cycle 1 Cycle M M M M 1 Cycle + + + + + + + + + + + + + + + + Design 1: Straight Forward 1 Round
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S 1 Cycle 1 Cycle 1 Cycle M M M M + + + + + + + + + + + + + + + + 1 Cycle Design 2: Use re-timing for Sbox 1 Round
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-A S-B S-D S-C 4 Cycle 4 Cycle M + + + + 4 Cycle Design 3: Use resource sharing 1 Round
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-A-1 S-C-1 S-B-1 S-D-1 S-C-2 S-B-2 S-A-2 S-D-2 M + + + + Design 4: Use resource sharing and re-timing for Sbox 5 Cycle 1 Round 5 Cycle 5 Cycle
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-D-1 S-C-1 S-A-1 S-B-1 1 Cycle S-D-2 S-A-2 S-C-2 S-B-2 1 Cycle Mix Column 1 Cycle + + + + 1 Cycle Design 5: Resource sharing and pipelining and re-timing for Sbox 1 Round
S1 S2 M A S1 S2 M A 1 2 1 3 2 1 Time 4 3 2 1 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 1 1 4 3 2 2 1 2 1 4 3 3 2 1 3 2 1 4 4 3 2 1 4 3 2 1 1 4 3 2 1 … Round 1 Round 2 Inner-Round Pipeline for Design 5
Design # 1 # 2 # 3 # 4 # 5 Clock per Sample 1 1 4 5 4 Pipe stages per round 4 stages 4 stages 3 stages 4 stages 4 stages Total pipe stages 4 10 stages 4 10 stages 3 10 stages 4 10 stages 4 10 stages Latency 4 10 cycles 4 10 cycles 4 3 10 cycles 5 3 10 cycles (4 10) + 4 cycles FPGA Throughput (200MHz) 25.6 Gbit/s 25.6 Gbit/s 6.4 Gbit/s 6.4 Gbit/s 6.4 Gbit/s ASIC Critical path 1.5 ns 650 MHz 1 ns 1 GHz 1.5 ns 650 MHz 1 ns 1 GHz 1 ns 1 GHz Estimated Area Less than 500 Kgates Less than 900 Kgates Less than 150 Kgates Less than 300 Kgates Less than 250 Kgates ASIC Throughput (128*650) 83.2 Gbit/s (128*1) 128 Gbit/s (128*650/4) 20.8 Gbit/s (128*1/5) 25.6 Gbit/s (128*1/4) 32 Gbit/s Performance Estimation