190 likes | 324 Views
SHA-3 Candidate Evaluation. FPGA Benchmarking - Phase 1. 14 Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design methodology (the same function implemented independently by 2-3 students) Uniform Input/Output Interface Uniform Generic Testbench
E N D
FPGA Benchmarking - Phase 1 • 14 Round-2 SHA-3 Candidates implemented by • 33 graduate students following the same design • methodology (the same function implemented independently by 2-3 students) • Uniform Input/Output Interface • Uniform Generic Testbench • Optimization for maximum throughput to cost ratio • Benchmarking on multiple FPGA platforms from Xilinx and Altera using ATHENa • Comparing vs. optimized implementations of • SHA-1 & SHA-2 • Compressing all results into one single ranking
Division into Datapath and Controller Data Inputs Control & Status Inputs Control Signals Datapath (Execution Unit) Controller (Control Unit) Status Signals Data Outputs Control & Status Outputs
Design Methodology Specification Interface Execution Unit Control Unit Algorithmic State Machine Block diagram VHDL code VHDL code
Steps of the Design Process (1) Given • Specification • Interface Completed • Pseudocode • Detailed block diagram of the Datapath • Interface with the division into the Datapath and the Controller • Timing and area analysis, architectural-level optimizations • RTL VHDL code of the Datapath, and corresponding Testbenches
Steps of the Design Process (2) Remained to be done • ASM chart of the Controller • RTL VHDL Code the Controller and the corresponding testbench • Integration of the Datapath and the Controller • Testing using uniform generic testbench (developed by Ice) • Source Code Optimizations • Performance characterization using ATHENa • Documentation and final report
FPGA Benchmarking - Phase 2 • extending source codes to cover all hash functionvariants • padding in hardware • applying additional architectural optimizations • extended benchmarking (Actel FPGAs, multiple tools, • adaptive optimization strategies, etc.) • reconciling differences with other available rankings • preparing the codes for ASIC evaluation
Single Ranking (1) • Select several representative FPGA platforms with significantly different properties • e.g., different vendor – Xilinx vs. Altera • process - 90 nm vs. 65 nm • LUT size - 4-input vs. 6-input • optimization - low-cost vs. high-performance • Use ATHENa to characterize all SHA-3 candidates • and SHA-2 using these platforms in terms • of the target performance metrics • (e.g. throughput/area ratio)
Single Ranking (2) • Calculate ratio • SHA-3 candidate performance vs. • SHA-2 performance (for the same security level) • Calculate geometrical average over multiple • platforms
The common ground is vague • Hardware Performance: cycles per block, cycles per byte, Latency (cycles), Latency (ns), Throughput for long messages, Throughput for short messages, Throughput at 100 KHz, Clock Frequency, Clock Period, Critical Path Delay, Modexp/s, PointMul/s • Hardware Cost: Slices, Slices Occupied, LUTs, 4-input LUTs, 6-input LUTs, FFs, Gate Equivalent GE, Size on ASIC, DSP Blocks, BRAMS, Number of Cores, CLB, MUL, XOR, NOT, AND • Hardware efficiency: Hardware performance/Hardware cost
Our Favorite Hardware Performance Metrics: Mbit/s for Throughput ns for Latency Allows for easy cross-comparison among implementations in software (microprocessors), FPGAs (various vendors), ASICs (various libraries)
But how to define and measure throughput and latency for hash functions? Time to hash N blocks of message = Htime(N, TCLK) = Initialization Time(TCLK) + N * Block Processing Time(TCLK) + Finalization Time(TCLK) Latency = Time to hash ONE block of message = Htime(1, TCLK) = = Initialization Time + Block Processing Time + Finalization Time Block size Throughput (for long messages) = Htime(N+1, TCLK) - Htime(N, TCLK) Block size = Block Processing Time (TCLK)
But how to define and measure throughput and latency for hash functions? Initialization Time(TCLK) = cyclesI ⋅ TCLK Block Processing Time(TCLK) = cyclesP ⋅ TCLK Finalization Time(TCLK) = cyclesF ⋅ TCLK Block size from place & route report (or experiment) from specification from analysis of block diagram and/or functional simulation
How to compare hardware speed vs. software speed? EBASH reports (http://bench.cr.yp.to/results-hash.html) In graphs Time(n) = Time in clock cycles vs. message size in bytes for n-byte messages, with n=0,1, 2, 3, … 2048, 4096 In tables Performance in cycles/byte for n=8, 64, 576, 1536, 4096, long msg Time(4096) – Time(2048) Performance for long message = 2048
How to compare hardware speed vs. software speed? 8 bits/byte ⋅ clock frequency [GHz] Throughput [Gbit/s] = Performance for long message [cycles/byte]
How to measure hardware cost in FPGAs? 1. Stand-alone cryptographic core on FPGA Cost of a smallest FPGA that can fit the core. Unit: USD [FPGA vendors would need to publish MSRP (manufacturer’s suggested retail price) of their chips] – not very likely or size of the chip in mm2- easy to obtain 2. Part of an FPGA System On-Chip Vector: (CLB slices, BRAMs, MULs, DSP units) for Xilinx (LEs, memory bits, PLLs, MULs, DSP units) for Altera 3. FPGA prototype of an ASIC implementation Force the implementation using only reconfigurable logic (no DSPs or multipliers, distributed memory vs. BRAM): Use CLB slices as a metric. [LEs for Altera]
How to measure hardware cost in ASICs? 1. Stand-alone cryptographic core Cost = f(die area, pin count) Tables/formulas available from semiconductor foundries 2. Part of an ASIC System On-Chip Cost ~ circuit area Units: μm2 or GE (gate equivalent) = size of a NAND2 cell