220 likes | 370 Views
An Encryption-Enabled Network Protocol Accelerator. Steffen Peter, Mario Zessack, Frank Vater, Goran Panic, Horst Frankenfeldt, and Michael Methfessel. Outline. Motivation TCP General Hardware Design Cryptographic Accelerators Implementation Conclusions . Motivation.
E N D
An Encryption-Enabled Network Protocol Accelerator Steffen Peter, Mario Zessack, Frank Vater, Goran Panic, Horst Frankenfeldt, and Michael Methfessel
Outline • Motivation • TCP • General Hardware Design • Cryptographic Accelerators • Implementation • Conclusions
Motivation Wireless sensor network Internet • standard TCP • high data rates • security • low energy Tiny sensor nodes Cluster head
Motivation • Increasing amount of data • even in mobile and ubiquitous scenarios • need for good transport performance • Low cost • Small silicon area • Energy efficient • Need for security • Secrecy, Integrity, Reliability • Support of standard protocols • TCP (transport) • AES (data encryption) • ECC/ECDSA (signature, key agreement) Is dedicated hardware the solution?
TCP • Standard transport protocol of the Internet • Connection-based protocol • Three-way handshake • Complicated connection tear-down • Basic data integrity mechanism • Checksum • Error correction mechanisms • Fast retransmit, slow (re-)start, many others • Flow control • - Buffer and congestion control • No actual security mechanisms
TCP – Profiling Results Transmit Receive
TCP Profiling – Implications • Copying data consumes most time and energy • Reduce copy operations as much as possible • Protocol handling needs merely 1/5th of the total computation • Is it worth hard-wiring the TCP state machines in hardware? • Trade-off performance flexibility • Checksum is the most expensive computation • The obvious dedicated hardware unit • How to integrate in the data flow? • Memory allocation needs more than 5 percent of time • Can a dedicated unit help here?
TCP Profiling – Our Answers • One copy architecture • Data is copied directly from the peripheral handler to the right memory location (assigned by CPU) • During this one copy operation other operational blocks (checksum, encryption) listen on the bus and do their work • MIPS CPU performs complicated (but low effort) TCP logic • Connection build-up/tear-down, error handling, congestion control • Software handling allows protocol variations and debugging • Dedicated checksum-block • checksum block computes checksum during the one copy operation • No dedicated memory manager unit • Hard-wired memory manager reduces time by 77 percent (from 5%1%) • BUT high hardware costs (300 flip-flops) and lack of flexibility
to Host Host interface handler SRAM CPU System Bus RF Checksum Data I/O to MAC/PHY General Design • MIPS CPU handles complex protocols. • CPU never touches payload • Internal 32 kByte SRAM stores packets. • AMBA bus connects system components. • Standard bus system allows modular approach. • Periperhal bus (APB) connects GPIO, UART, SPI ports. • System concept: interacting independent units. • Units exchange commands and status using register file. • General-purpose formalism for command/status syntax.
General Design - Flow • Incoming packet to Host • Full header processing • Window update • Alloc memory slot for next packet • Signal application Host interface handler Sleeps Wakeup SRAM CPU System Bus RF Check-sum Data I/O • Basic header processing • Select memory slot for packet • Packet received and Checksum ok? to MAC/PHY
Results (Performance and power consumption) • Simulation results: split power among different entities. • Maximal data rate in pure software on MIPS is 20.7 Mbit/sec. • Hardware accelerators reduce load on CPU and save 50% of power. • Maximal data rate with hardware accelerators is 40 Mbit/sec. Case Rate CPU CPU AMBA Reg- Card- EPP/ Total (Mb/sec) active bus file bus UA power SW 20.7 100% 60 14 7 4 4 89 mW HW 20.7 15% 9 14 7 4 12 46 mW HW 40.0 31% 18 14 7 4 12 55 mW • Measured power consumption is 2.5 times simulated power. • Measured power includes pads. • Consumed power varies for different production runs.
Cryptographic Accelerators • AES (Advanced Encryption Standard) • Symmetric stream cipher • Suitable for low-power high-throughput data streams • Standardized in November 2000 (NIST/National Institute for Standards and Technology; USA) • Input data length: 128 bit; key length: 128, 196, 256 bit • Assumed to be secure for the next 70 years • ECC (Elliptic Curve Cryptography) • Asymmetric cryptography • Suitable signatures and key-establishment • Key length 160-571 bit (NIST standard)
Advanced Encryption Standard (AES) 10 Rounds xor S-Box Shiftrow MixColumn Calckey Key Data Output data S I H E P C 84 30 AE CB R M E i c T r 97 38 AD 43 o K E e l Y e 17 58 7A 0E c t r o 67 CF FE 80 • Huge design space • Sharing S-Boxes reduces performance but leads to smaller designs • Pipelining and Parallelism boost performance – but cost area and energy
AES - Results • Throughput: ~52 MBit/s @33 Mhz (includes input and output of the data blocks) • Size: 0.336mm² in 0.25 CMOS (8,450 equivalent gates) • 70 clock cycles per 128 bit data block for en-/decryption • 72 times faster than software implementation on MIPS (33 MHz) and it requires 0.4% energy of the software solution
Elliptic Curve Cryptography (ECC) • Asymmetric cryptography • Basis for many key exchange and signature algorithms (ECDSA) • Trapdoor : Elliptic Curve Point Multiplication • one can compute: Q = kP • it is infeasible to determine k for given Q and P • Higher security with shorter key lengths • about 1/10th of RSA’s key size • Still operations on Elliptic Curves are expensive • one 233 bit EC Point multiplication needs: 1200 additions, 1500 multiplications, 800 squarings, 1 division (233 bit each in the finite field)
ECC - Design • Asymmetric cryptography • Trapdoor : Elliptic Curve Point Multiplication • one can compute: Q = kP • it is infeasible to determine k for given Q and P • Utilization 15% 95% 50% • Area 70% 5% 20%
ECC – Implementation Results • Time for one ECPM (233 bit): • MIPS: 410 ms • HW: 0.4 ms • Energy for one ECPM (233 bit): • MIPS: 16 mWs • HW: 0.03 mWs
Implemented Chip (Design) Bridge (Master) Bridge CardBus (Linux/Windows Host) I-Cache (16 kB) CardBus (Master) MIPS Processor Core UART Serial 1+2 EJTAG (Debug) AMBA AHB Bus GPIO GPIO Flash Data I/O Control (Master) Packet Filter / Checksum Memory Controller (AHB Slave) UART 0 D-SPRAM (8 kB) (Master) SRAM Check Sum Registers & Control Sum1 CPU Control Bus & ECC EPP UART Internal SRAM (32 kB) AES / MD5 SRAM (32 kB) AES Data I/O
Implemented Chip (Chip Photo) Size: 7.3 x 7.4 mm (54 mm²) Core: 44 mm² Pads: 219 in QFP256 package Transistors: 4.8 M in 0.25μm Packet SRAM: 32 kByte Instruction cache: 16 kByte Data scratchpad: 8 kByte
Implementation (Test Board) • Allows: • Testing the implementation in practice • Tests of interoperability • Performance tests • Energy measurements
Conclusions • Profiling of TCP/IP code identified bottlenecks • TCP checksum and copying use 90% of power. • High data rate needs hardware accelerators. • Chip is a hardware solution for TCP/IP handling • Takes care of middle protocol layers efficiently. • AMBA-based bus as prototype for modular systems • Assemble different systems quickly. • Pre-tested components lead to reliability. • Cryptographic components allow security for low-cost • Designs for AES and ECC improve performance and energy consumption for security operations by three orders of magnitute. • TCP chip creates basis for further developments • Extension to higher data rates (Gbit/sec). • Use as component of complex single-chip systems.
Thank You Questions? peter@ihp-microelectronics.com