240 likes | 428 Views
Running OpenSSL Crypto Algorithms in Simplescalar. Piyush Ranjan Satapathy Department of Computer Science & Engineering University of California Riverside. Outline. What Crypto Algorithms are ? Why we need to run them on simplescalar ? Any previous work on this ? Introducing OpenSSL0.9-7e
E N D
Running OpenSSL Crypto Algorithms in Simplescalar Piyush Ranjan Satapathy Department of Computer Science & Engineering University of California Riverside
Outline • What Crypto Algorithms are ? • Why we need to run them on simplescalar ? • Any previous work on this ? • Introducing OpenSSL0.9-7e • Introducing Simplescalar version2.0 • Selecting the crypto Algorithms from OpenSSL • Simulation Settings and parameters • Results & Discussions • An interesting Comparison • Demo • Conclusion • Acknowledgement and References • Q&A CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
What Crypto Algorithms Are ? Algorithms meant for Network Security 1. Authentication 2. Secrecy 3. Nonrepudiation 4. Integrity Control Kind of Crypto Algorithms to solve the above 1. Public Key Algorithms (Ex:- RSA,DSS,LUC...) 2. Secret key Algorithms (Ex:- AES,DES,RC4,SEAL…) 3. Cryptographic Hash Functions (Ex:- MD5,SHA1…) 4. Random Number Generators (Ex:- PGP, Noiz,SSH…) Secret Key Algorithms 1. Block Ciphering (Ex:- IDEA, DES, AES, BLOWFISH…) 2. Stream Ciphering (Ex:- RC4,SEAL,A5) CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Why run on Simplescalar ? Architectural Analysis for Crypto algorithms To achieve a best network processor design we need to know the architectural analysis of crypto algorithms at cycle level accuracy. Simplescalar Easy to Simulate !! Fast, Flexible and Accurate simulation. Simplescalar provides a cycle level accuracy simulation of MIPS processor Not concerned about Parallel programming Otherwise could have used Simics… CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Previous Work on Architectural Analysis of Crypto Algorithms: • Analysis using widely available Crypto algorithms (I refer “Average” here) by haiyong et. al. • Analysis using SPECInt & CommBench • Performance of SSL crypto Algorithms (Li Zhao et. al.) • But no architectural analysis of OpenSSL crypto algorithms. Now OpenSSL has been the standard bench mark for crypto engines….. So knowing the architectural analysis of these algorithms help understanding the need of modern network processor dealing with cryptography. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Introducing OpenSSL 0.9.7e Widely used Open source for crypto algorithms ( I have used the recent version) OpenSSL is a cryptography toolkit It implementing the Secure Sockets Layer (SSL v2/v3) and Transport Layer Security (TLS v1) network protocols and related cryptography standards required by them. The openssl program is a command line tool for using the various cryptography functions of OpenSSL's crypto library from the shell. It can be used for creation of RSA, DH and DSA key parameters Creation of X.509 certificates, CSRs and CRLs o Calculation of Message Digests Encryption and Decryption with Ciphers SSL/TLS Client and Server Tests Handling of S/MIME signed or encrypted mail I have used the library to port the crypto algorithms into Simplescalar. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Introducing SimpleScalar2.0 Compiling: sslittle-na-sstrix-gcc foo.c –o foo Running: sim-outorder foo CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Selecting OpenSSL Crypto Algorithms: Private Key Block Cipher Mode AES (Key length: 128bits; Block Size: 16bits) DES (Key length: 128bits; Block Size: 8bits) 3DES (Key length: 168 bits; Block Size:8 bits) IDEA (Key length: 128 bits; Block Size: 8 bits) Stream Cipher Mode RC4 (Length of 128 bits) Hash Key MD5 (Block Size: 512 bits; Digest Size: 128 bits) SHA1 (Block Size: 512 bits; DigestSize: 160 bits) CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Simulation Settings & parameters Settings: -Writing of separate modules for each algorithm by using crypto library. • Simulating by gcc 2.7.2.3 simplescalar simulator and running the binary file and giving a file as Input. -Input file length varies from 1byte to 256 KB. -Most readings are taken by running through 1 byte length of Input file. - Changing different parameters in simplescalar in command line and observing the readings. Parameters used: CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (1) 1. Instruction Set Characteristics: - Comparison with Average, SPECint & Commbench - “Average” represents Li’s work - SSLcrypto represents the average over all the OpenSSL algorithms I considered. Obvservation:- * SSLCrypto algorithms has significant amount of memory reference (~40%) * Intensive Arithmetic Computation but less than Average CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (2,3) 2.Comaprisons of Instruction Mix: -Plotted all the block, stream and hash ciphers for the instruction mix Observation: - DES, 3DES have high memory reference -IDEA has a significant branch predictions 3. Cycle per Bytes of Computation -3DES takes more cycle as it has to manipulate data 3 times with 3 diff keys. - Block ciphers require more cycles than Stream and hash ciphers. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions (4,5) 4. IPC Vs ALU: - I26%, 37%, and 40% for Block, stream and hash kind of algorithms respectively when the number of ALUs increases from 1 to 2 - 6%, 10%, and 5% when the number of ALUs increases from 2 to 4 -with more than 4 ALUs, the number of instructions executed in one cycle increases only less than 1%. 5. IPC Vs IFQ Size: -26%, 37%, and 40% for block, stream and hash kind of algorithms respectively after the size of the instruction fetch queue changes from 1 to 2 - 6%, 10% and 5% if the IFQ changes from 2 to4 - After that it changes within 2% CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (6) 6. IPC Vs ILP: - ILP 4 means 4 ALU and 4 IFQ (Both Changes) - ILP of 4 is enough for getting the best Instruction per cycle value. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (7) 7. Branch prediction Hit Rate: - Bimodal & Combinational kinds of prediction give a better hit rate - Also 2lev kind of prediction gives almost better hit rate. -Simple taken or not taken doesn’t do well.. -So need to consider the complex branch predictions. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (8,9) 8. L1 Instruction Cache Size behaviors: - Cache Size changed keeping fixed 64 bytes of lines size , 4way set and l replacement - We can observe that 128KB is enough to reach the best performance level. 9. L1Instruction Cache Line Size : -Cache line sizechangedkeeping fixed256 cache size and 4 way set and l replacement - we can observe that 32 bytes of line size is enough to reach the lowest possible miss rate. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (10,11) 10. L1 Instruction cache Set behaviors: - Set Associativity changed keeping fixed 256KB cache size, 32 bytes of line size and l kind of replacement policy. - We can observe that 2 way set associativity is enough to reach a miss rate lower than 5%. 11. L1 Instruction Cache Replacement Policy Behaviors: - Replacement policy changes keeping fixed 256KB cache size, 32 bytes of line size and 4 way set.. - We can observe that LRU & FIFO give same performance . We can choose either one. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions:(12,13) 12. L1 Data Cache Behaviors: - Cache Size changed keeping fixed 64 bytes of lines size , 1way set and l replacement - We can observe that 32KB is enough to reach the best performance level. 13. L1 Data Cache Line Size : -Cache line sizechangedkeeping fixed256 cache size and 1 way set and l replacement - we can observe that 32 bytes of line size is enough to reach the lowest possible miss rate. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (14,15) 14.L1 Data cache Set behaviors: - Set Associativity changed keeping fixed 256KB cache size, 32 bytes of line size and l kind of replacement policy. - We can observe that 2 way set associativity is enough for block and for stream but 4 way is enough for Hash ciphers. 15.L1 Instruction Cache Replacement Policy Behaviors: - Replacement policy changes keeping fixed 256KB cache size, 32 bytes of line size and 4 way set.. - We can observe that LRU & FIFO give same performance . We can choose either one. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Results & Discussions: (16,17) 16.L1 Data Cache Behaviors: - Cache Size changed keeping fixed 64 bytes of lines size , 1way set and l replacement - We can observe that 512KB is enough to reach the best performance level. 17.L1 Instruction Cache Replacement Policy Behaviors: - Replacement policy changes keeping fixed 512KB cache size, 64 bytes of line size and 4 way set.. - We can observe that LRU & FIFO give same performance . We can choose either one. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
An Interesting Comparison: CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Demo Time ………… CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Conclusion: For crypto Engines using OpenSSL crypto algorithms should have * 128KB L1 Inst cache size * 32KB L1 Data cache Size * 512KB UL2 cache Size * 2 way set associativity * l replacement policy * ILP of 8 * Advanced branch prediction schemes For a better performance architecture wise….!!! CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Acknowledgement & References: • A Big Thanks to Li Zhao • References: SimpleScalr Tool Set http://www.simplescalar.com OpenSSL http://www.openssl.org Architectural Analysis of Cryptographic applications for Network processors by Haiyong Xie et. al. Anatomy and Performance of SSL processing by Li Zhao, Ravi Iyer, Srihari Maikeneni, Laxmi Bhuyan. CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
Q&A ???? CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside