1 / 71

Digital Integrated Circuits A Design Perspective

Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Modified and integrated by Davide Bertozzi. Design Methodologies: Standard cell design. Impact of Implementation Choices. Three orders of magnitude Higher efficiency.

Download Presentation

Digital Integrated Circuits A Design Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Integrated CircuitsA Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Modified and integrated by Davide Bertozzi DesignMethodologies: Standard cell design

  2. Impact of Implementation Choices Three orders of magnitude Higher efficiency Providing programmability adds overhead to the implementation 100-1000 Domain-specific processor (e.g. DSP) 10-100 • Late binding • Re-use across multiple applications • Software upgrade Embedded microprocessor Energy Efficiency (in MOPS/mW) Hardwired components 1-10 Configurable/Parameterizable 0.1-1 0.25um CMOS process Somewhat flexible Flexibility(or programmability) Fully flexible None Flexibility comes at a cost in terms of power and performance

  3. Mapping Computation to….. • Parallel threads heavily dependent on local data content • Many, truly independent parallel computations • Branch divergence between threads • Coarse-grain parallelism • Strong memory consistency models • OK with few large independent threads • Leading core count • Single-Instruction Multiple-Data (SIMD) • OK with thousands of small threads (better if almost identical) to expose massive HW multithreading • Highest GOPS/W • No flexibility • (lower yield) General-Purpose Computing (host processor) Throughput Computing (GPGPUs) HW accelerators Cluster-based, many-core programmable accelerators

  4. HeterogeneousParallel Computing General Purpose Programmable Accelerator Host (multi-core heterog.) processor • There is not only one perfect mapping solution! • Architectural heterogeneity and many-cores are THE design paradigm for embedded SoCs (e.g., HSA initiative) • This is today the way to pursue high performance with energy efficiency, that is high performance-per-watt Hardware accelerators High-Speed I/O Cache-coherent interconnect 2nd-level NoC Top-level NoC Intermediate Fabric 2nd-level NoC DRAM memory controller Graphics DMA engine FPGA-like fabric

  5. Digital Circuit Implementation Approaches Custom Semicustom Cell-based Array-based Standard Cells Pre-diffused Pre-wired Ma cro Cells Compiled Cells (Gate Arrays) (FPGA's) Implementation Choices Design for high performance/high density: handcrafted full custom design Design for fast time-to-market: design automation techniques

  6. The Custom Approach – early days Intel 4004 microprocessor (108 KHz, 2300 transistors, 10um) When performance or design density are critical, handcrafting circuit topology and physical design seems to be the only option • high cost • long time-to-market It is OK when • custom blocks can be reused • cost can be amortized over a large volume (e.g., uP,memories) • cost is not primary design criterion (e.g. Supercomputers) Courtesy Intel

  7. Intel 4004 (‘71) Intel 8080 Intel 8085 Intel 8486 Intel 8286 Transition to Automation and Regular Structures Evolution of full custom design • Replication of the same custom-designed block multiple times (e.g., memories) • Composition of different custom-designed blocks with a regular composition pattern In both cases regularity enables deployment of automation Courtesy Intel

  8. Pentium 4 processor • Almost all parts were designed automatically – composing custom blocks together in a regular way (semicustom design with a library of cells) • Performance critical modules (PLL, clock buffers) were still designed manually • Basic automation levels even for full-custom design: - layout editors, DRC.

  9. Full-custom design • Complete control over transistor and interconnect dimensions (within design rule constraints) • Design rules: • Minimum spacing between metal lines (varies per layer) • Line width • Transistor channel length • Circuit Designers create application-specific building blocks • Technology Provider (foundry) provide SPICE/HSPICE transistor models, parasitic extraction tools • Models are used to drive transistor sizing/layout constraints • Continual verification of design as it becomes more defined • PRO: Produces Optimized Design (density, power, performance) • CON: Time-consuming; error-prone, highest NRE

  10. Layout editors Stick diagram of inverter • Dimensionless layout entities • Only topology is important • Final layout generated by “compaction” program Magic Layout Editor (UC Berkeley)

  11. Cell-based semicustom design (CBD) Predefined and custom-designed cells are instantiated multiple times and interconnected to yield a given logic function ADVANTAGES • cuts down on design time and costs • reduces implementation effort by REUSING a library of cells for different designs • cells need to be designed and verified once for a given technology node • cell reuse amortizes the cost for their full-custom design DRAWBACKS • Reduced integration density and performance • No design fine-tuning (i.e., transistor-level) allowed Metaphor: mosaic cell Logic function CBD approaches are categorized based on the granularity of library elements

  12. Standard cells Standardizes the design entry level at the logic gate Based on a library of standard pre-designed, pre-verified cells • basic logic functions (NOR, NOT, NAND, XOR,..) • complex functions (basic MUX, decoders, adder, comparator,..) • storage elements (DFF, SR latches, ...) • special cells (e.g., brute-force synchronizers; tie-high; tie-low) • logic cell variants to cover a wide range of fan-in/fan-out conditions (e.g., 4:1 mux vs 2:1 mux; different transistor sizing) • specialization for other parameters: - supply voltage, threshold voltage, corner cases Foundries or even fabless companies (in partnership with foundries) provide libraries of standard cells for semicustom design with tens or hundreds of cells

  13. Standard cell layout methodology Strong restrictions on the layout allow high levels of automation (e.g. automatic layout generation) Row of standard cells (all cells must have same height) Routing channel requirements are reduced by presence of more interconnect layers Intermixing with other layout design approaches. For those modules which do not adapt to the logic cell paradigm (e.g., highly regular, more stringent performance requirements)

  14. Standard cell layout methodology • A standard cell library is complemented by an I/O cell library • I/O circuits are analog in nature, and analog delays are not easy to predict/model • IC designers are faced with interfacing to a growing diversity of standards and parts • Memory, I/O, graphics, networking • Standards: DDR, SDRAM, PCI, USB,.. • Different signaling methods: LVDS, CML,… • Circuits for latchup, ESD, isolation,.. • Ground and power pins (many) Routing Cell I/O pad ring IO cell

  15. I/O library and packaging options Low-Cost packaging with low pin count High-cost packaging with high pin count Peripheral pads for bond wires allow the I/O cell circuitry to be placed in alignment with the pads leading to simple logical, electrical and physical structures Creating a grid array of pads for flip chip mounting allows for easier alignment in the packaging, but may cause the routing to and from the I/O circuitry to become very complex

  16. Standard Cell — Early Example • Large area overhead for the interconnects • Feedthrough cells • large routing channels • Adding more metal layers  less requirements on routing channels [Brodersen92]

  17. Standard Cell – The New Generation Design in a 7 metal layers technology Cell-structure hidden underinterconnect layers • Density: 90% • small area overhead for interconnects

  18. Standard cells Designing a standard cell library is time consuming, although amortized among a large number of designs • Today it is common practice to have several cell versions - number of inputs - transistor sizing for different capacitive loads (driving strength) - pullup/pulldown ratios • technology: Vth, Vdd, technology corner cases • Non-trivial choice of the mix of logic cells • small library with most cells having limited fan-ins? • large library with many versions of the same cell? • conservative large driving capabilities lead to power/area overhead • Technology libraries are broadly differentiated based on the target design goal (low-power vs. high-performance) Synthesis tools choose the correct cell version in the library based on speed/area/power constraints

  19. Standard cell structure Routing channel VDD PMOS transistors close to the Vdd rail Mirrored Cell Intra-cell wiring No Routing channels VDD NMOS transistors close to the ground rail signals VDD M2 GND Cell mirroring enables sharing of power and ground rails M3 GND Mirrored Cell GND

  20. Inverter standard cell layout Power rail p-mos diffusions N-well n-mos diffusions Ground rail

  21. Design rules • The feature size f is the minimum spacing between drain and source (min. poly width) • Design rules expressed in terms of λ= f/2 • A wiring track is the space required for a wire • E.g., 4 λ width, 4 λ spacing from beighbor = 8 λ pitch • The rule applies to transistors as well

  22. V DD Cell height Cell height 12 metal tracks (metal track is the M1 pitch) Rails ~10 Tall cells (11 or 12 metal tracks) support more complex routing, larger driving strength transistors and are typically tuned for performance, but may exhibit higher leakage power. N Well Short cells (7 or 8 metal tracks) are optimized for area efficiency, but generally designed with smaller, lower driving strength transistors, so are less appropriate for high-speed designs. Out 2 In Standard height cells (9 or 10 tracks) are an intermediate trade-off GND Cell boundary

  23. Standard Cell – Example in 0.18 um 3-input NAND cell (from ST Microelectronics): C = Load capacitance T = input rise/fall time Power rail Input signals wired through PolySi • 5 cell versions - C from 0.18 to 0.72 pF - area from 16.4 to 32.8 um • Not just performance, but also energy given in datasheet 2 Ground rail Library cells documentation is critical, although time-intensive Low power library; high Vth TNs against leakage

  24. Horizontal track wire Vertical track Routing tracks • Tracks form a grid for routing. • Spacing between tracks is center-to-center distance between wires. • Track spacing depends on wire layer used. • Different layers are (generally) used for horizontal and vertical wires. • Horizontal and vertical can be routed relatively independently. • Cell pins placed at intersections between vertical and horizontal tracks. • Pin placement dictates the complexity of the routing problem cell cell cell cell cell Routing channel cell cell cell cell cell

  25. Left-edge algorithm • Basic channel routing algorithm. • Assumes one horizontal segment per net. • Sweep pins from left to right: • assign horizontal segment to lowest available track.

  26. Example A B B C A B C

  27. ? aligned Limitations of left-edge algorithm • Some combinations of nets require more than one horizontal segment per net. A B B A

  28. Vertical constraints • Aligned pins form vertical constraints. • Wire to lower pin must be on lower track; wire to upper pin must be above lower pin’s wire. A B B A

  29. Dogleg wire • A dogleg wire has more than one horizontal segment. A B B A • But requires an additional metal layer!

  30. Technology library – Architectural effects Power-Length/Speed trade-off for NoC link design • Short and/or slow-clocked links don’t pose any problem • Long and/or high-speed links force routing tools to infer a large number of buffering gates, increasing power 65nm LP-LVT NoC links synthesized in isolation From: A. Pullini, F. Angiolini, S. Murali, D. Atienza, G. De Micheli, and L. Benini, ``Bringing NoCs to 65nm,'' IEEE Micro Magazine, vol. 12, no. 5, September/October, pp. 75-85, 2007.

  31. One Technology Library? A single technology library no longer exists for standard cell design • An aggressively low-power library (LP-HVT) infers buffers with lower size and speed, resulting in much tighter constraints on operation frequency or length (i.e., link feasibility) • The spread increases as technology scales down • We need to pick the right library for specific design constraints 65nm LP-LVT 65nm LP-HVT

  32. Mixed-Library Design During logic synthesis, it is possible to link different technology libraries at the same time to span the performance-power trade-off for the design at hand. • 4x4 2D mesh NoC • 65nm Library variants: • Low-Vth (fast) • High-Vth (low-power) • Mixed-Vth (multiple Vths) • aiming for max performance • aiming for a power-perf. trade-off • Clock gating always enabled except for Low-Vth, to avoid performance penalties Handle with care: When you link more libraries, you are increasing mask complexity and fabrication cost, since manufacturing steps for transistors are different

  33. Mixed-Vth 10ns • Gates on the critical path should come from the fastest library (Low-Vth) • Gates on non-critical paths should come from the low-power library (High-Vth) 5ns

  34. Power-Performance Trade-Off Library variant HVth MVthA MVthB LVth There is almost an order of magnitude difference in the power/performance ratios achievable by LVth and HVth libraries Frequency target Max 300 MHz Max Max. Clock gating Enabled Enabled Enabled Disabled Frequency (MHz) 142 300 714 952 Bandwidth (GB/s) 27 57 137 183 Power (mW) 11 25 88 145

  35. Power-Performance Trade-off Library variant HVth MVthA MVthB LVth Mixed Vth is attractive: • approaches LVth performance at a lower power • Approaches HVth performance at almost the same power efficiency (GB/s over mW) Frequency target Max 300 MHz Max Max. Clock gating Enabled Enabled Enabled Disabled Frequency (MHz) 142 300 714 952 Bandwidth (GB/s) 27 57 137 183 Power (mW) 11 25 88 145

  36. Design Capture Behavioral HDL Pre-Layout Simulation Structural (RTL) Logic Synthesis Floorplanning Post-Layout Simulation Placement Physical Circuit Extraction Routing GDSII file. Tape-out to silicon foundry for mask generation Semicustom Design Flow Design Iteration

  37. Semicustom design flow Design capture: schematics, block diagrams, HDLs, imported IPs Logic synthesis: from HDL language into a gate-level netlist, combined with the netlist of reused or generated macros PreLayout Simulation: (grossly) estimated parasitics and layout parameters; performance analysis Floorplanning: chip outlay creation based on estimated module sizes, early design of clock and power distribution networks Placement: Precise positioning of cells within blocks

  38. Semicustom design flow Routing: Interconnects between cells and blocks Extraction: chip model from actual physical layout and parasitics PostLayout Simulation: Check functionality and correctness of the circuit in presence of layout parasitics; Performance AND Power analysis Tape out: binary file generation in GDSII format, containing information needed for mask generation. To silicon foundry.

  39. Integrating Logic synthesis with Physical Design RTL (Timing) Constraints Physical Synthesis • Exponential increase of design tool complexity and run-time Logic synthesis with first-order place-and-route Macromodules Fixed netlists Netlist with Place-and-Route Info Place-and-RouteOptimization Accurate Place-and-route meeting timing constraints

  40. Design synthesis

  41. Logic synthesis

  42. Design Environment • The process parameters • Technology library • Operating conditions (PVT) • I/O port attributes • Driving strength of input ports • Capacitive Loading of output ports • Design rule constraints • max_transition, max_fanout, max_capacitance • Statistical wire-load model • wirelength=f(fanout) • Resistance/Capacitance/Area-per-unit-length given • pre-layout static timing analysis

  43. Input and output delay constraints These parameters may have a tremendous impact on driving strength of boundary cells and power consumption of the design as a whole

  44. Design constraints • Clock signal specification • Period • Duty cycle • Transition time • Skew • Delay specifications • Maximum delays • Minimum delays • Timing exceptions • Multicycle paths • False paths • Path grouping • E.g., for multi-clock designs When the max. speed of the design is searched for, then a max. period of 0.1ns can be given as a constraint. The min. period can be derived from the amount of violation

  45. Design constraints • Clock signal specification • Period • Duty cycle • Transition time • Skew • Delay specifications • Maximum delays • Minimum delays • Timing exceptions • Multicycle paths • False paths • Path grouping • E.g., for multi-clock designs • Enforce absolute constraints • Extract timing of paths • Enforce minimum delay requirements on bundling paths • Are bundling constraints fulfilled?

  46. Design constraints • Clock signal specification • Period • Duty cycle • Transition time • Skew • Delay specifications • Maximum delays • Minimum delays • Timing exceptions • Multicycle paths • False paths • Path grouping • E.g., for multi-clock designs set_multicycle_path -from U1 -to U5

  47. Performance-Area/Power trade-off during logic synthesis • LET US COMPARE SEVERAL ADDER IMPLEMENTATIONS • WHILE RELAXING TARGET CLOCK SPEED FOR SYNTHESIS • As the target clock period increases, new adder architectures come progressively into play (see lower side of bars in the plots). • As the period is further increased, adders’ slack is exploited for power optimizations (RTL netlist transformations, insertion of HVT cells), therefore adders do not show slacks for a certain time window • After a certain period, RTL netlists of adders cannot be power-optimized any more, and they start having slacks (upper side of the bars in the plots)

  48. Area-Power for 32 bit adders • Let us sweep a range of target clock periods • Maximum data introduction rate • Synthesis tool optimizes adder slack for power. • The “new entry” adder for a given target period is never the most power efficient • Higher area always means higher power

  49. Floorplanning Typical issues the floorplanning tool copes with: • does the design fit the chip budgeted area? • estimates area of major units and defines their relative placement based on some objective function • estimates wire lengths and wiring congestion, although more advanced cost functions can be considered: Having high communication traffic (thick lines) spread over short (up) or long (bottom) links is likely to heavily affect the power required for data transmission. Best IR drop solutions spread out the hot spot across a large part of the floorplan, instead of concentrating it in a specific region.

  50. Placement • Placement: assign cells to positions on the chip, such that no two cells overlap with each other (legalization), and some cost function (e.g., projected wirelength) is optimized • Considers: wirelength, routability/channel density, power, timing,....

More Related