180 likes | 293 Views
TIE Extensions for Cryptographic Acceleration. Charles-Henri Gros Alan Keefer Ankur Singla. Agenda. Introduction Survey of Existing Architectures Xtensa+ Crypto Processor Rijndael Algorithm (AES final selection) RC6, IDEA, and DES Performance Trade-off Analysis Conclusion.
E N D
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla
Agenda • Introduction • Survey of Existing Architectures • Xtensa+ Crypto Processor • Rijndael Algorithm (AES final selection) • RC6, IDEA, and DES • Performance • Trade-off Analysis • Conclusion
Introduction • Commercial Networking Applications require flexible & high throughput secure connectivity • Encryption/Decryption algorithm computation intensive • Multi-session applications present significant load on embedded processors • Embedded systems need performance while optimizing power and area • Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded
Survey of Existing Architectures • Three categories • Specialized Crypto Processors • Reconfigurable Architectures • Full Hardware Implementation (ASICs/FPGAs) • High Variation in architecture complexity • Performance vs Area tradeoff • Suitability for Embedded Applications
Specialized Crypto Processors • Few VLIW architectures - CryptoManiac • Instruction Combining – Instruction Word combining to exploit ILP • Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation • Coarse configurability of datapath • Mostly lacking SIMD support • Performance is typically 2x to 6x that of general processors
Reconfigurable Architectures • Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP • Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication • Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity • VLIW Instructions • Reconfiguration Registers • Suitable for Block Ciphers • High Variability in Performance increase w.r.t Processors
Full Hardware Implementation • High performance implementations targeted to ASICs/FPGAs • DES – 12 Gbps on Virtex-E XCV300E • AES – 18 Gbps on ASIC using TSMC 0.18m process • Lacking flexibility and crypto-modes • Memory and Area efficient • Typical latency only in DMA of data to Hardware unit • Need additional processor for control path
Xtensa+ Crypto Architecture • Custom Extensions to Xtensa Processor using the TIE framework • Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied • Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor • Currently Implemented using Table construct in TIE • Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis • Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied • Possible future extensions to include multi-session key storage and fast retrieval support
AES Overview • AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption • Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits • Designed to be efficient in both hardware and software across a variety of platforms • 10, 12, or 14 rounds depending on key size • 128-bit round key used for each round • Can be pre-computed and cached for future encryptions
AES Implementation Abstraction • Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR • Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs • Decryption is essentially the same, but with different tables and a different key schedule
TIE Implementation • Our implementation does all 16 lookups in parallel, requiring 16 SRAMs • x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index • Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0
Other Ciphers Implemented • DES (Data Encryption Standard) • 64-bit block, 56-bit key, 16 rounds, Feistel network • 8 6x4 S-Boxes, XORs, and bit-level permutations • Can’t really be done efficiently in software • TIE Implementation required 1 Instruction per round • IDEA (International Data Encryption Algorithm) • 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers • 4 Multiplications mod 216 + 1, 4 adds mod 216, 6 XORS • Each round is highly sequential, so difficult to parallelize • TIE Implementation required 7 Instructions per round • RC6 • Same block and key modes as AES, 20 rounds, iterated • Multiplication mod 232, XORs, rotations, addition mod 232 • TIE Implementation required 2 Instructions per round
AES Performance in Xtensa+ • Performance of TIE extensions approaches performance of non-pipelined ASICs • Total of 31 run-time instructions per data-block • Initial EXOR Instruction • 1 Instruction per round computation (10 total) • 20 Cycles for Load and Store of 128-bit Data Blocks • Generally an order of magnitude better than pure software • Also faster than reconfigurable hardware or a specialized VLIW processor
Design Tradeoffs • Flexibility • Algorithm changes • New algorithms • New encryption modes • Implementation bugs • Time to Market • Closer to software development time • Can choose which parts to accelerate
Conclusion • Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution • Suitable for most Embedded Applications like 802.11i, etc. • Using Xtensa for cryptography is a good choice if: • You don’t need absolute throughput • You don’t need absolute flexibility • You need a control processor anyway • The algorithms needed are known ahead of time