1 / 54

Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

Explaining The Gap Between ASIC and Custom Power: A Custom Perspective. Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University. * Work done while Author was at Stanford . Design Tradeoffs: Power vs. Performance.

coralie
Download Presentation

Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University * Work done while Author was at Stanford

  2. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 3 1 2 Performance

  3. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom Power 3 1 2 Performance

  4. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom 3. Move to Different Power vs. Performance Curve More Architectural Choice with Custom Power 3 1 2 Performance

  5. Dynamic Power Dissipation Pdyn = a CVdd2f = a Ecircuitf • Reduce Vdd • Static, dynamic, voltage islands, power gating • Reduce a and/or f • Clock gating, block enables, bus encoding, glitch identification and elimination • Reduce Ecircuit • Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques

  6. Static Power Dissipation Pstatic = Vdd (Isub + Iox ) Isub = K1 W e -Vt/ nVq (1- e –Vgs/Vq) Iox = K2 W (Vgs/tox)2 e –atox/ Vgs With K1, K2, n, and a experimentally determined • Reduce Vdd • Static, dynamic, voltage islands, power gating • Increase effective Vt • Substituting high-threshold devices, transistor stacking, static and active body bias • Reduce effective W • Reduce number and size of devices in design

  7. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • 0.18um CMOS 10kHz chip w/ 640K T’s

  8. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V

  9. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW

  10. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  11. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  12. Defining Ebit Ebit = Cbit * Vdd2 Cbit = 4 * 2 fF/um * Wmin • Energy needed to write a 1-bit SRAM cell • Approximates minimum useful capacitance • The ratio of Ebit to the energy for a range of circuits remains largely constant with technology scaling

  13. c2 Technology mm2 0.5mm 58 18 5.7 0.18mm 18 Technology Scaling for Ebit • c is a normalized unit of distance equal to the M1 pitch

  14. Technology Scaling for Nand2 NAND2 • c is a normalized unit of distance equal to the M1 pitch A A YN B B YN 4c = 2.24mm 8c = 4.48mm

  15. Applying Ebit

  16. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  17. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  18. Effect of Architecture NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors 400MHz – 125M Transistors

  19. Effect of Architecture NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors ~60 Watts 400MHz – 125M Transistors ~20 Watts

  20. Effect of Architecture ASIC Architecture: 6x Efficiency NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs 400MHz – 125M Transistors ~20 Watts: 10GFlops & 13 GBs

  21. Custom Circuits: 9x (7x) Efficiency NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V

  22. Combined Architecture and Circuits40x+ Improvement but 1.5 Years vs. 3+ Years NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V

  23. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  24. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  25. ASIC vs. Custom • ASIC Methods • Provide only coarse-grain control 100K+ gates, but requiremuch less effortand historically scale with complexity • Custom Methods • Offer fine-grain control individual transistors & gates, but requirelarge effort andscale poorly with complexity • Exploits Design Structure • Exploits Circuit Techniques

  26. Custom Methods EmphasizeFine-Grain Manual Control + Custom Library

  27. Custom Methods EmphasizeFine-Grain Manual Control + Custom Library Operation and Performance Characterized for the Specific Case

  28. ASIC Methods SubstituteCoarse-GrainControlAutomation + Generic Library

  29. ASIC Methods SubstituteCoarse-GrainControlAutomation + Generic Library Operation and Performance Characterized for the Typical/Generic Case

  30. ASICFocus on 100K+ GatesLost Opportunities to Exploit Structure • Designs reuse similar basic building blocks • Building blocks: 1-10K-gates not 100K+ gate • 64-bit adder 1K-gates • 64x64 rf 2K-gates • 64x64 multiplier 20K-gates • Opportunities to exploit these structures lost when design is viewed in large chunks

  31. Bank 1 Bank 0 LTLB EMI MEMORY SWITCH NIF/ROUTER CLST 2 CLST 1 CLST 0 CLUSTER SWITCH CLST 2 CLST 1 CLST 0 C C C C C C C C C C C C C L L L L L L L L L L L L L C L C L Different Architectures Similar Building Blocks 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) XCVRS Bus EX RF SRAM 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) XCVRS Bus EX RF SRAM

  32. Bank 1 Bank 0 LTLB EMI MEMORY SWITCH NIF/ROUTER CLST 2 CLST 1 CLST 0 CLUSTER SWITCH CLST 2 CLST 1 CLST 0 C C C C C C C C C C C C C L L L L L L L L L L L L L C L Significant Structure ExistsWithin100K-gates 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) XCVRS Bus EX RF SRAM C L 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) XCVRS Bus EX RF SRAM

  33. Energy of 100K-gate Equivalent • ASIC (N2) = 1400K Ebits (typ) • Custom Logic = 424K Ebits* • SRAM (small) = 1085K Ebits • SRAM (med) = 155K Ebits • SRAM (large) = 50K Ebits *Based on data extracted from Intel McKinley

  34. Exploiting Circuit Techniques • Custom circuits more efficient • Reduced parasitics • 1.7x circuit techniques and flops • 1.4x libraries • 1.4x due to engineering interconnects • Subthreshold Circuits • Low Performance but ultra-low power • Requires Architecture, Gates, Memories, CAD Tools

  35. Relating Power to PerformanceCV/I, Idsat, tFO4 Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)

  36. Relating Power to Performance Relating Vdd and Vt to tFO4 Idsat = K3 Leff -0.5 tox-0.8(Vgs - Vt)1.25 tFO4 = K4[Ceff Vdd /Idsat](K4 ~ 13.5)

  37. Relating Power to PerformanceCorrelation to Reported Foundry Data Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)

  38. Achievable Power Improvement (Assuming 50/50 split of Logic and Memory)

  39. Achievable Power Improvement(Assuming 50/50 Split of Logic and Memory)

  40. Achievable Power Improvement(Assuming 50/50 Split of Logic and Memory)

  41. Achievable Power ImprovementAssuming 50/50 Split of Logic and Memory • 130nm uP assumes 80% Dynamic and 20% Static • 90nm uP assumes 50% Dynamic and 50% Static

  42. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  43. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  44. 16b 1024 point FFT • Generally, k N log Noperations (complex multiplies) with pre-computation • Radix-2, Radix-4 etc… implementations • Decimation in time and/or decimation in Frequency

  45. Range of Implementations • MIT FFT (2005) • 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV operation • Spiffee (1999) • 0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation • SA-1100 (1999) • 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom Circuits, 1.5V operation • Imagine (2003) • 0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation • Stratix IS25F627C8 (2005) • 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor, • Intel P4 (2003) • 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom Circuits, 1.5V operation • TI ‘C6416 (2003) • 0.13um CMOS, 720MHz: Commercial Digital Signal Processor

  46. Ebit Energy 16b 1024 point FFT

  47. Ebit Energy 16b 1024 point FFT

  48. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW

  49. Which Design Is More Efficient?Depends on the Metric! • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • EDP 143x better • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW • Absolute energy 6x better

More Related