240 likes | 411 Views
Concurrent Error Detection Architectures for Symmetric Block Ciphers. Ramesh Karri, Kaijie Wu, Y. Kim and P. Mishra CAD Lab Department of Electrical Engineering Polytechnic University ( ramesh@india.poly.edu , kwu03@utopia.poly.edu , ykim01@utopia.poly.edu , pmishr01@utopia.poly.edu ).
E N D
Concurrent Error Detection Architectures for Symmetric Block Ciphers Ramesh Karri, Kaijie Wu, Y. Kim and P. Mishra CAD Lab Department of Electrical Engineering Polytechnic University (ramesh@india.poly.edu, kwu03@utopia.poly.edu, ykim01@utopia.poly.edu, pmishr01@utopia.poly.edu)
Purpose • Investigate systematic approaches to low-cost, low-latency Concurrent Error Detection (CED) schemes for symmetric block ciphers
Outline • Describe symmetric block ciphers • structure • operation • inverse properties at various levels • Discuss Advanced Encryption Standard (AES) candidate algorithms • Review hardware and time redundancy based CED approaches • Present encryption-decryption-inversion based CED approach • Implementation results • Case study • Conclusion
Symmetric Block Ciphers • Basic iterative looping structure • Optional pre and post processing • Each round uses multiple operations and round key(s) • Decryption process comprises of applying the same or inverse operations in a reverse order
128 Bit Symmetric Block Cipher 128-bit plain text 128-bit plain text Pre round Pre round Operation 1 Operation 1 Round 1 Operation 2 Round key 1 Round r Operation i-1 Operation i Operation i Decryption Encryption Operation 1 Operation 1 Operation 2 Round r Round key r Round 1 Operation i-1 Operation i Operation i Post round Post round 128-bit cipher text 128-bit cipher text
Operations used in 128-bit Symmetric Block Ciphers RC6 (mod232) 5-bit rotation exclusive or variable rotation + (mod 232) (key) Rijndael s-box Fixed rotation (GF(28)) exclusive or (key) Serpent exclusive or (key) S-box exclusive or (linear transform) Twofish s-box (GF(28)) + (mod 232) + (mod 232) (key) exclusive or 1-bit rotation MARS s-box +,(mod 232) (key) exclusive or variable rotation 5-bit,13-bit rotation - (mod 232) s-box: Non-linear bit-wise substitution GF: Galois field
Hardware architecture for Symmetric Block Ciphers • An encryption device consists of: • an encryption module • a decryption module • a key ram • an input and an output data port • Only one operation, either encryption or decryption, is performed at a time and hence single key-ram and single I/O data ports
CED architectures for Symmetric Block Ciphers - Motivation • Fault-based attacks use radiation or some other external source to introduce errors into an encryption device • Then inputs are applied to the faulty device and its outputs observed to obtain the keys stored in the device or discover the implementation structure • CED followed by suppression of output on fault is an effective approach against such fault-based side-channel cryptanalysis
CED architectures for Symmetric Block Ciphers • Straightforward duplication of basic hardware followed by comparison • Minimum detection latency • Duplication of basic architecture 100% area overhead • Re-computation using the basic hardware followed by comparison • Minimum area overhead • 100% time overhead • Only transient-fault detection capability • Proposed approach: Exploits the inverse relationship between encryption and decryption • 40 % area overhead • low fault-detection latency
Encryption-decryption Inverse relationship Plain text Plain text Pre round Pre round Operation 1 Operation 1 Operation 2 Round 1 Round key 1 Round r Operation i-1 Operation i Operation i Decryption Encryption Operation 1 Operation 1 Operation 2 Round r Round key r Round 1 Operation i-1 Operation i Operation i Post round Post round Cipher text Cipher text • Plain text = Decryption(Encryption (plain text, key), key) • Cipher text = Encryption (Decryption(cipher text, key), key) • True at the algorithm level, round level and operation level
Approach 1: Algorithm Level CED Plain text • Output of encryption is fed to decryption; result is compared with original plain text • Low area overhead: a 128-bit register, four (2:1) 128-bit multiplexers and a 128-bit comparator • 100 % performance penalty (time for encryption = 2 # of rounds (r) cycles per round (n)) • Large fault detection latency (2 # of rounds (r) cycles per round (n)) Round 1 Register Round 1 Round 2 Round 2 Encryption module Decryption module Round r Round r Comparator Random value To output data port
Approach 2: Round Level CED Plain text ENC round 1 Register • Output of an encryption round is fed to corresponding decryption round and the result is compared with the input to the encryption round • Larger area overhead: a 128-bit register, two (3:1) 128-bit multiplexers, two (2:1) 128-bit multiplexer and a 128-bit comparator • Encryption and decryption for CED can be carried out concurrently • Lower performance penalty (time for encryption = # of cycles for encryption + # of cycles for one round of decryption) • Additional delays in the critical path leading to slower clock • Low fault detection latency (2 cycles per round) ENC round 2 DEC round n Comparator Register ENC round n Cipher DEC round 1 Comparator Random value To output data port
Approach 3: Operation Level CED Plain text Register Operation 1 • Output of encryption module’s operation is fed to the decryption module’s respective inverse operation and the result is compared with the input to the encryption module’s operation • Largest area overhead of the three: multiple 128-bit registers, 128-bit multiplexers and 128-bit comparators with complex inter-connections • Encryption and decryption for CED can be carried out concurrently • Lowest performance penalty (time for encryption = # of cycles for encryption + # of cycles for one operation of decryption) • Lowest fault detection latency (2 cycles per operation of encryption/decryption) • Maximum delay in the critical path leading to slowest clock Operation 2 Decryption round n-r+1 Operation m Encryption round r Register Comparator Operation m-1 Operation m Operation 1 Intermediate cipher Comparator
W/o CED Algorithm-level CED Round-level CED Operation-level CED Algorithm Detec Ltncy cycles Detec Ltncy cycles Detec Ltncy cycles Enc cycles Enc cycles Enc cycles Enc cycles RC6 42 84 84 44 4 43 2 Mars 32 64 64 34 4 33 2 Serpent 64 128 128 66 4 65 2 Twofish 34 68 68 36 4 35 2 44 Rijndael 88 88 48 8 45 2 Comparison of 128-Bit Symmetric Block Ciphers
FPGA Implementation and Validation • Xilinx Virtex device, XCV1000BG560-6 • VHDL modeling • Functional verification: Modeltech’s Modelsim VHDL simulator • Synthesis: Synplify • Place and route: Xilinx Foundation PAR tool
Implementation Metrics • Area = No. of Virtex slices used • Each Virtex slice = Two lookup tables • Each lookup table can implement 4 i/p- 1 o/p logic function • Throughput = • Performance degradation = 1-
Area (# of slices) Over head (%) 5100 28.37 3153 31.5 3467 6.28 9659 19.15 Implementation Results: Area Overhead w/o CED Algorithm level Round level Operation level Area (# of slices) Area (# of slices) Over head (%) Area (# of slices) Over head (%) Rijndael 3973 4806 20.97 * * RC6 2397 3028 26.3 3337 39.20 Twofish 3262 3474 6.49 ** ** Serpent 8073 9376 16.14 9974 23.55 * Result not available ** Result not applicable • Increasing granularity of CED increases area overhead • Decrease in fault detection latency/increase in area overhead is more significant between algorithm level and round level than between round level and operation level CED
w/o CED Algorithm level Round level Operation level Max freq (MHz) Max freq (MHz) Dgrd (%) Max freq (MHz) Dgrd (%) Max freq (MHz) Dgrd (%) Rijndael 46.93 36.44 -22.35 36.06 -23.17 * * RC6 23.99 21.76 -9.30 20.74 -13.54 16.87 -29.70 Twofish 20.16 18.98 -5.85 19.07 -5.41 ** ** Serpent 28.64 30.37 6.04 26.27 -8.08 26.759 -6.56 Implementation Results: Clock period degradation • Increasing granularity of CED decreases clock frequency • Decrease in fault detection latency/decrease in clock frequency is more significant between algorithm level and round level than between round level and operation level CED * Result not available ** Result not applicable
w/o CED Algorithm level Round level Operation level Through-put (Mbps) Through-put (Mbps) Dgrd (%) Through-put (Mbps) Dgrd (%) Through-put (Mbps) Dgrd (%) 136.53 53.04 -61.15 96.16 -29.57 * * Rijndael 73.11 33.16 -54.64 60.33 -17.48 50.22 -31.31 RC6 75.90 35.73 -52.92 67.80 -10.67 ** ** Twofish 57.28 30.37 -46.98 50.95 -11.05 52.69 -8.01 Serpent Implementation Results: Performance degradation • Increasing granularity of CED decreases throughput • Decrease fault detection latency/decrease in throughput is more significant between algorithm level and round level than between round level and operation level CED * Result not available ** Result not applicable
Case Study:128-bit Serpent Cipher • 32 rounds, each using one round key,except for the last one which uses two; and a pre- and post- processing step • Operations in a round of encryption: Key-Xor Non-linear byte substitution (S-Box) Linear transform • Operations in a round of decryption: Inverse linear transform Inverse non-linear byte substitution(S-Box-1) Key-Xor • In our implementation: • Round keys are generated and stored in a key RAM (128*33 bit) • One round of encryption consumes 2 cycles • Entire encryption process consumes = 2 * 32 = 64 cycles
Serpent: Algorithm Level CED Plain text Pre-whitening Algorithm level: • Area overhead: a 128-bit register, three (2:1) 128-bit multiplexers and a 128-bit comparator • Fault detection latency: 64 cycles for encryption + 64 cycles for decryption = 128 cycles Register Round 1 Round 1 Round 2 Round 2 Decryption module Encryption module Round 32 Round 32 Post-whitening Comparator
Serpent: Round Level CED Plain text Pre-whitening Decryption round output Round level: • Area overhead: a 128-bit register, two (3:1) 128-bit multiplexers, a (2:1) 128-bit multiplexer and a 128-bit comparator • Fault detection latency: 2 cycles/round for encryption + 2 cycles/round for decryption = 4 cycles Register Inverse Linear Transform Key-xor S-Box Decryption Round Encryption Round S-Box-1 Linear Transform Key-xor Post-whitening Comparator
Serpent: Operation Level CED Plain text Decryption round output Pre-whitening Operation level: • Area overhead: multiple 128-bit registers, 128-bit multiplexers and 128-bit comparators with complex inter-connections • Fault detection latency: 2 cycles Register Key-Xor Register Key-Xor S-Box Encryption Round Register Comparator Linear Transform S-Box-1 Register Inverse Linear Transform Comparator Comparator
Conclusions • Hardware redundancy based approach requires more than 100% area overhead • Time redundancy based approach requires more than 100% time overhead • Proposed CED approach provides better fault detection latency with significantly smaller area overhead ( 40 %) • As we move from high level to low level CED, we get: • better fault detection latency, • lower throughput, and • higher area overhead • Round level CED balances this trade-off better than algorithm level and operation level CED techniques and hence is a better choice