Superscalar Coprocessor for High-speed Curve-based Cryptography

Superscalar Coprocessor forHigh-speed Curve-based Cryptography K. Sakiyama, L. Batina, B. Preneel, I. Verbauwhede Katholieke Universiteit Leuven / IBBT Department Electrical Engineering - ESAT/COSIC 1/26

Introduction Curve-based Cryptography HW/SW Partitioning Superscalar Coprocessor Results Conclusions Overview 2/26

IntroductionMotivation • High-speed curve-based cryptography in HW/SW co-design • How much instruction-level parallelism can we obtain from coprocessor instructions? • Performance improvement for different operation forms in datapath • AB+C mod P vs A(B+D)+C mod P ,A,B,C,D,P: polynomials • Performance comparison three different curve-based cryptosystems • Which one is faster between ECC, HECC, ECC over a composite field? • Programmability and scalability • Programmable in order to support different cryptosystems? • Scalable in field sizes? 3/26

IntroductionTarget Architecture • Curve-based cryptography over binary fields • Hardware can be smaller and faster than prime field • ECC over a binary field, e.g. GF(2163) • HECC of genus 2 Field length can be shorter with a factor of 2, e.g. GF(283) • ECC over a composite field Field length can be shorter with a factor of 2, e.g. GF ((283)2) • The datapath can be shared • Programmable coprocessor supporting three curve-based cryptography by defining coprocessor instruction(s) • (Coprocessor) instruction-level parallelism by superscalar 4/26

Curve-based CryptographyHW/SW partitioning (1) • General hierarchy in coprocessor for curve-based cryptography Point/Divisor Multiplication SW or HW controller Point/Divisor Addition Point/Divisor Doubling SW or HW controller Finite Field Addition Finite Field Multiplication Finite Field Inversion HW Datapath 6/26

Curve-based Cryptography Proposed Hierarchy (1) • Single instruction for all finite field operations • Fixed-cycle execution enables efficient implementation Single Instruction (Datapath) Point/Divisor Multiplication Point/Divisor Multiplication Conventional Point/Divisor Addition Point/Divisor Doubling Finite Field Inversion Point/Divisor Addition Point/Divisor Doubling Finite Field Operation E.g. AB+C mod P Finite Field Addition Finite Field Multiplication Finite Field Inversion 7/26

Curve-based Cryptography Modular Arithmetic Logic Unit (MALU) • (a) Building block: Regular XOR chains • (b) Scalable in digit size (d) and field size (k) by interconnecting several building blocks • We use MALU83 (n=83, d=12) as building block • 2xMALU83 can be configured as 1xMALU163 8/26

HW/SW PartitioningTYPE I: Smallest implementation (baseline) Main CPU SRAM Program ROM Memory Mapped I/O 32-bit instructions 32-bit data Coprocessor DBC IBC Data Bus Instruction Bus MALU83 10/26

HW/SW PartitioningTYPE II: TYPE I + m-code RAM Main CPU SRAM Program ROM Memory Mapped I/O 32-bit instructions 32-bit data Coprocessor IBC FSM m-code RAM DBC Data Bus Instruction Bus MALU83 11/26

HW/SW PartitioningTYPE III: TYPE I + Coprocessor Memory Main CPU SRAM Program ROM Memory Mapped I/O 32-bit instructions 32-bit data Coprocessor DBC IBC Data Bus Instruction Bus MALU83 Coprocessor Memory 12/26

HW/SW PartitioningTYPE IV: TYPE I + Copro. Mem.& m-code RAM Main CPU SRAM Program ROM Memory Mapped I/O 32-bit instructions 32-bit data Coprocessor IBC FSM m-code RAM DBC Data Bus Instruction Bus MALU83 Coprocessor Memory 13/26

HW/SW PartitioningCo-design flow with GEZEL C/C++ codes for PKCs Partitioning of functions C/C++ codes & H/W behavior blocks w/interface ARM (SW) Co-processor (HW) C/C++ codes w/physical memory map Cycle-true sim. (GEZEL) GEZEL FDL codes Cross compile Synthesis Program codes VHDL codes 14/26

HW/SW PartitioningResult: Vertical Exploration of System • HECC Performance for different HW/SW partitioning (Performance: Point/Divisor multiplication) 15/26

Superscalar Coprocessor Proposed Hierarchy (2) • Multiple Modular Arithmetic Logic Units (MALUs) in coprocessor Single MALU Point/Divisor Multiplication Multiple MALUs Point/Divisor Multiplication Point/Divisor Addition Point/Divisor Doubling Finite Field Inversion Point/Divisor Addition Point/Divisor Doubling Finite Field Inversion Finite Field Operation E.g. AB+C mod P Finite Field Operation E.g. AB+C mod P Finite Field Operation E.g. AB+C mod P Finite Field Operation E.g. AB+C mod P Finite Field Operation E.g. AB+C mod P … 17/26

Main CPU SRAM Program ROM Memory Mapped I/O 32-bit instructions Buffer Full 32-bit data Coprocessor IBC FSM m-code RAM DBC IQB Data Bus Instruction Bus MALU83 MALU83 MALU83 MALU83 Coprocessor Memory Superscalar Coprocessor Parallel Processing Architecture (TYPE IV-based) 18/26

Superscalar Coprocessor Horizontal Exploration of System • Performance of ECC and HECC 19/26

ResultsPerformance for ECC over GF(283) • Fastest of three • x1.8 speed-up by 2-way superscaling (ILPDP=6) with A(B+D)+C • Still more improvement is possible by adding MALUs AB+C A(B+D)+C 21/26

ResultsPerformance of HECC over GF(283) • Faster than ECC over a composite field • x2.7 speed-up by 4-way superscaling (ILPDP=5) with A(B+D)+C • Less improvement as increasing # of MALU AB+C A(B+D)+C 22/26

ResultsPerformance for ECC over GF((283)2 ) • Slowest of three • x2.5 speed-up by 4-way superscaling (ILPDP=6) with A(B+D)+C • Less improvement as increasing # of MALU AB+C A(B+D)+C 23/26

ResultsComparison of ECC/HECC implementations on FPGAs [11] T. Wollinger, PhD thesis, 2004. [13] G. Orlando and C. Paar, CHES 00. [14] N. Gura et al., CHES02. [29] Nazar A. Saqib et al., International Journal of Embedded Systems 2005 24/26

Conclusions • Performance improvement / Comparison • ECC was improved by a factor of 1.8 (2-way) • HECC (genus 2) was improved by a factor of 2.7 (4-way) • ECC over a composite field was improved by a factor of 2.5 (4-way) • A(B+D)+C offers better performance than AB+C • ECC is the fastest in this case study • Programmability & flexibility • Support three different curve-based cryptosystems over a binary field • Arbitrary irreducible polynomial • Field size up to 332 bitsby using 4xMALU83 25/26

Thank you! 26/26

Parallel issue of instructionsCase of using 4 MALUs • IF/D : Instruction Fetch & Decode • R_ : Read operands (dependent on the type of operation) • EX : Execution (dependent on MALU configuration, k & d) • W_ : Write (dependent on # of instructions issued in parallel) 27/26

Parallel issue of instructionsOut-of-order Execution • Check RAW (Read After Write Dependency) for in-/out-of-order execution 28/26

Superscalar Coprocessor for High-speed Curve-based Cryptography