1 / 43

Unified Architectures for Efficient and Compact Crypto-Processing

Unified Architectures for Efficient and Compact Crypto-Processing. Erkay Savaş Sabancı University. Outline. Research Motivation Public Key Cryptography Unified Arithmetic High-Radix Multiplication Dual-Radix Multiplication Support for GF(3 n ) Arithmetic Implementation Results

tranquilla
Download Presentation

Unified Architectures for Efficient and Compact Crypto-Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unified Architectures for Efficient and Compact Crypto-Processing Erkay Savaş Sabancı University Erkay Savaş

  2. Outline • Research Motivation • Public Key Cryptography • Unified Arithmetic • High-Radix Multiplication • Dual-Radix Multiplication • Support for GF(3n) Arithmetic • Implementation Results • Future Research Erkay Savaş

  3. Motivation • Compatibility • support for fast arithmetic in different finite fields and groups • Saving in Area • Improve {time  area} metric • Algorithm Agility • NTRU  ECC Erkay Savaş

  4. Public Key Cryptography (PKC) • Each user has a pair of keys: • Private Key - known only to the owner • Public Key - known to everyone in the systems with assurance • Encryption: • Encryption with the Public Key of the receiver • Decryption: • Only the receiver can decrypt the message by her/his Private Key Erkay Savaş

  5. Public Key Cryptography in Use • RSA, Rabin’s scheme • Integer factorization, Square root of modulo a composite number • Discrete Logarithm Based Algorithms • Diffie-Helman Key Exchange, El Gamal • Elliptic curve DH Key Exchange, ECDSA • Discrete logarithm over elliptic curves • IBE • pairings over elliptic curve points Erkay Savaş

  6. RSA • Most popular PKC • Invented by Rivest/Shamir/Adleman in 1977 at MIT. • Its patent expired in 2000. • Based on Integer Factorization problem • Each user has public and private key pair. Erkay Savaş

  7. RSA Encryption & Decryption • Encryption done by using public key y  xe mod n, where x, y < n • Decryption done by using private key x  yd mod n Erkay Savaş

  8. DL Based Cryptosystems • Fundamental operation gx mod p, where x, g < p and g is primitive Erkay Savaş

  9. Elliptic Curve Cryptography 1/2 • Emerging public key cryptography standard for constrained devices. • 160 bit key length is equivalent in cryptographic strength to 1024-bit RSA. • 313 bit ECC is equivalent to 4096 bit RSA • As algebraic/geometric entities have been studied extensively for the past 150 years. • Rich and deep theory suitable to cryptography • First proposed for cryptographic usage in 1985 independently by Neal Koblitz and Victor Miller Erkay Savaş

  10. Elliptic Curve Cryptography 2/2 • Dominant fundamental operations • Multiplication in GF(q) where q = pk and p is prime • Alternatives • GF(p) k = 1 • GF(2k) p = 2 • GF(pk) • GF(3k) p = 3 Erkay Savaş

  11. Identity Based Encryption (IBE) • Public key can be any string • e-mail address, name, etc. • No need for certificates • Anonymity achieved • users can choose any public key without revealing their ID • It can easily change it Erkay Savaş

  12. IBE – Bilinear Mapping • e(xP, yQ) = e(P, Q)xy = e(yP, xQ) = g • g is in an (extension of) the underlying field. • Bilinear mapping over elliptic curves • Weil pairing • Tate pairing • Resource consuming • Most efficient bilinear mappings • defined on curves over GF(3k) Erkay Savaş

  13. An Introduction to UnifiedArithmetic • Types of finite fields are heavily used • Prime fields, GF(p) • Binary extension fields, GF(2k) • Ternary extension fields GF(3k) (recently, due to IBE schemes) • These finite fields feature dissimilar properties • Different implementations on specialized hardware Erkay Savaş

  14. Unified Arithmetic • Unified hardware design methodology requires • A single (unified) datapath • A single (unified) control • Insignificant overhead in the area • Insignificant overhead in the time complexity (e.g. critical path delay) • Good {timearea} metric Erkay Savaş

  15. Unified Arithmetic (GF(p) + GF(2k)) • A unified hardware design methodology for both field is possible since: • the elements of either field are represented using almost the same data structures in digital systems • the algorithms for basic arithmetic operations in both fields have structural similarities (i.e. the steps of the algorithms are almost identical) • Hence, eventually unified arithmetic is possible Erkay Savaş

  16. Finite Field Operations in ECC • Addition in GF(p) and GF(2k) • Relatively inexpensive in area and time complexity • Multiplicative inversion in GF(p) and GF(2k) • Prohibitively expensive in terms of time • Possible to avoid some of them • Multiplication in GF(p) and GF(2k) • Expensive in terms of time and area • Usually most important operation • Our focus Erkay Savaş

  17. Montgomery Multiplication • Very efficient way of doing multiplication in GF(p) and GF(2k) (now also in GF(3k)) • Faster (replaces division by shifts) • Suitable for unified design • Suitable for scalable design • Highly parallel • Suitable for pipelining Erkay Savaş

  18. Montgomery Multiplication • Definition: • Given a, bGF(p), MonMul(a, b) = a·b·R-1 mod p, where R = 2k mod p and k = log2p. • Algorithm • c := 0 • for i = 0 to k-1 • c := (c + ai · b) • c := (c + c0 · p)/2 • if c > p then c := c-p (final subtraction) Erkay Savaş

  19. Algorithm for GF(2k) • Input : a(x), b(x) GF(2k), p(x) and k • Output: c(x) = a(x)·b(x)·xkGF(2k) • c(x) := 0 • for i = 0 to k-1 • c(x) := (c(x)  ai · b(x)) • c(x) := (c(x)  c0 · p(x))/x • No final subtraction • Note that • c/2 and c(x)/x are implemented in an identical way in SW and HW Erkay Savaş

  20. Representation • Addition • Atomic operation: multiplication is performed as a repeated addition • Unified addition • most efficient when carry-save representation is used for elements of GF(p) • Carry-save representation • an integer is represented as the sum of two other integers • x := xs + xc (sum and carry parts, resp.) Erkay Savaş

  21. Scalability • Original Montgomery multiplication algorithm performs full-precision integer additions • Not scalable • Instead, • long integers are divided into words • Addition of words are handled separately on word adders. • Choice of word length depends on the precision, area and speed requirements Erkay Savaş

  22. b(j) b(j+1) p(j) p(j+1) c(j) c(j+1) ai+1 b(j) p(j) c(j) PUi+1 Word-Based Multiplication ai PUi c(j)w-1 c(j)0 c(j)1 c(j+1)w-1 c(j+1)1 c(j+1)0 c(j) Erkay Savaş

  23. Dependency Graph Erkay Savaş

  24. FSEL Dual-Field Adder Dual-Field Adder Dual-Field Adder Dual-Field Adder Processing Unit (PU) with w=2 C1(j) C0(j) Erkay Savaş

  25. Dual-Field Adder (DFA) 1/2 • Almost identical to a full-adder (FA) • Difference • it has and additional (control) input (FSEL) which suppress the carry output of the adder when it is set to logic-0 • Namely, when FSEL = 0 then the adder operates in GF(2k), otherwise it becomes a regular FA Erkay Savaş

  26. DFA 2/2 B S A C FSEL Cout Erkay Savaş

  27. SR-a RAM-a PU-1 PU-2 RAM-b RAM-p SR-C Pipeline Organization with two PUs s: the number of PUs Erkay Savaş

  28. Total Computation Time (in clock cycles) w: word size, k: precision, e := k/w, s: the number of PUs Erkay Savaş

  29. Example Execution Times • Example: k = 1024, w = 32 • s = 17  T = 2105 • s = 15  T = 2305 • s = 10  T = 3415 • s = 1  T = 33792 • Example: k = 2048, w = 32 • s = 33  T = 4221 • s = 30  T = 4543 • s = 10  T = 13343 • s = 1  T = 133120 Erkay Savaş

  30. Comparison to the single-field (GF(p)) design w: word size 1.2 m CMOStechnology Erkay Savaş

  31. Design Alternatives • Higher Radix • Original design is radix 2 • Namely, multiplier bits are scanned one bit in each clock cycle • Possible to scan two or more bits of the multiplier a • Radix-4: two bits • Radix-8: three bits • More Complex Design: lower clock frequency, higher area • Less clock cycle count  Faster execution of multiplication Erkay Savaş

  32. Comparison • Higher radix vs. single radix • Metric • area  time • For small total area (i.e. <10000 equivalent NAND gates) the performances of radix-2 and radix-8 are comparable • Radix-8 multiplier outperforms radix-2 multiplier more than 3 times when the total area is around 25000 NAND gates Erkay Savaş

  33. MUX-2 MUX-1 Selection Logic 3x2 Dual Field Adder Dual-Radix Multiplier • Radix-2 for GF(p) and radix-4 for GF(2k) Erkay Savaş

  34. Dual-Radix Multiplier • Three multipliers • A1: GF(p)-only multiplier • A2: single-radix unified multiplier (with precomp.) • A3: dual-radix multiplier • Performance (area  time) • A3 performs slightly worse than A1 and A2 (between 7% to 19%) in GF(p) mode • A3 outperforms A2 by 38% to 46% in GF(2k)-mode Erkay Savaş

  35. Unified Arithmetic? • Unified multiplier • carry-save adders used in multiplier • It is not easy to perform other arithmetic operations with carry-save representation such as subtraction and comparison (essential in inversion) Erkay Savaş

  36. New Redundant Representation • Recall: • Carry-save representation • X = xs + xc. • New redundant representation • Redundant signed representation (RSD) • X = xp - xn. • Subtraction is equivalent to the addition • X-Y = (xp - xn) - (yp - yn) = (xp - xn) + (yn - yp) • Comparison is relatively easy Erkay Savaş

  37. RSD • All previous multipliers require a reverse transformation to non-redundant for after each multiplication • There are thousands multiplication in ECC • With RSD, all the computation can be done in RSD form without any reverse transformation • a single transformation is necessary if the result is needed in non-redundant form. Erkay Savaş

  38. Support for GF(3n) Arithmetic • RSD lends itself to a unified arithmetic architecture that efficiently supports GF(3n) arithmetic Erkay Savaş

  39. Analysis • A1: GF(p)-only architecture • A2: GF(2k)-only architecture • A3: GF(3n)-only architecture • A4: Unified architecture (GF(p) + GF(2k)) • A5: Unified architecture (GF(p) + GF(2k) + GF(3n)) • A1 + A2: Hypothetical architecture that has separate datapath for GF(p) and GF(2k) Erkay Savaş

  40. Analysis • Metric: area  time • A4 over A1 + A2: 7.94% • A5 over A1 + A2 + A3: 33.54% • A5 over A4 + A3: 28.36% Erkay Savaş

  41. Implementation Results • 2.38 GHz, 0.13 m CMOS • 4 PUs  ~11,000, 8 PUs  ~15,000 NAND gates Erkay Savaş

  42. Research Directions • Embed the unified architectures into common general-purpose processors • Unified inversion using RSD • Unified architectures for other PKC Erkay Savaş

  43. Ending… • Questions • Contact • Erkay Savaş • erkays@sabanciuniv.edu • http://people.sabanciuniv.edu/~erkays Erkay Savaş

More Related