1 / 18

TIE Extensions for Cryptographic Acceleration

TIE Extensions for Cryptographic Acceleration. Charles-Henri Gros Alan Keefer Ankur Singla. Agenda. Introduction Survey of Existing Architectures Xtensa+ Crypto Processor Rijndael Algorithm (AES final selection) RC6, IDEA, and DES Performance Trade-off Analysis Conclusion.

erasto
Download Presentation

TIE Extensions for Cryptographic Acceleration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla

  2. Agenda • Introduction • Survey of Existing Architectures • Xtensa+ Crypto Processor • Rijndael Algorithm (AES final selection) • RC6, IDEA, and DES • Performance • Trade-off Analysis • Conclusion

  3. Introduction • Commercial Networking Applications require flexible & high throughput secure connectivity • Encryption/Decryption algorithm computation intensive • Multi-session applications present significant load on embedded processors • Embedded systems need performance while optimizing power and area • Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded

  4. Survey of Existing Architectures • Three categories • Specialized Crypto Processors • Reconfigurable Architectures • Full Hardware Implementation (ASICs/FPGAs) • High Variation in architecture complexity • Performance vs Area tradeoff • Suitability for Embedded Applications

  5. Specialized Crypto Processors • Few VLIW architectures - CryptoManiac • Instruction Combining – Instruction Word combining to exploit ILP • Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation • Coarse configurability of datapath • Mostly lacking SIMD support • Performance is typically 2x to 6x that of general processors

  6. Reconfigurable Architectures • Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP • Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication • Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity • VLIW Instructions • Reconfiguration Registers • Suitable for Block Ciphers • High Variability in Performance increase w.r.t Processors

  7. Full Hardware Implementation • High performance implementations targeted to ASICs/FPGAs • DES – 12 Gbps on Virtex-E XCV300E • AES – 18 Gbps on ASIC using TSMC 0.18m process • Lacking flexibility and crypto-modes • Memory and Area efficient • Typical latency only in DMA of data to Hardware unit • Need additional processor for control path

  8. Xtensa+ Crypto Architecture • Custom Extensions to Xtensa Processor using the TIE framework • Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied • Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor • Currently Implemented using Table construct in TIE • Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis • Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied • Possible future extensions to include multi-session key storage and fast retrieval support

  9. AES Overview • AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption • Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits • Designed to be efficient in both hardware and software across a variety of platforms • 10, 12, or 14 rounds depending on key size • 128-bit round key used for each round • Can be pre-computed and cached for future encryptions

  10. AES Implementation Abstraction • Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR • Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs • Decryption is essentially the same, but with different tables and a different key schedule

  11. TIE Implementation • Our implementation does all 16 lookups in parallel, requiring 16 SRAMs • x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index • Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0

  12. Other Ciphers Implemented • DES (Data Encryption Standard) • 64-bit block, 56-bit key, 16 rounds, Feistel network • 8 6x4 S-Boxes, XORs, and bit-level permutations • Can’t really be done efficiently in software • TIE Implementation required 1 Instruction per round • IDEA (International Data Encryption Algorithm) • 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers • 4 Multiplications mod 216 + 1, 4 adds mod 216, 6 XORS • Each round is highly sequential, so difficult to parallelize • TIE Implementation required 7 Instructions per round • RC6 • Same block and key modes as AES, 20 rounds, iterated • Multiplication mod 232, XORs, rotations, addition mod 232 • TIE Implementation required 2 Instructions per round

  13. AES Performance in Xtensa+ • Performance of TIE extensions approaches performance of non-pipelined ASICs • Total of 31 run-time instructions per data-block • Initial EXOR Instruction • 1 Instruction per round computation (10 total) • 20 Cycles for Load and Store of 128-bit Data Blocks • Generally an order of magnitude better than pure software • Also faster than reconfigurable hardware or a specialized VLIW processor

  14. Mbps of Throughput

  15. Cycles Per Block

  16. Design Tradeoffs • Flexibility • Algorithm changes • New algorithms • New encryption modes • Implementation bugs • Time to Market • Closer to software development time • Can choose which parts to accelerate

  17. Power vs. Performance: Mbps/mW

  18. Conclusion • Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution • Suitable for most Embedded Applications like 802.11i, etc. • Using Xtensa for cryptography is a good choice if: • You don’t need absolute throughput • You don’t need absolute flexibility • You need a control processor anyway • The algorithms needed are known ahead of time

More Related