The Scaling Challenge: Can Correct-by-Construction Design Help?

The Scaling Challenge:Can Correct-by-Construction Design Help? Prashant Saxena Noel Menezes Pasquale Cocchini Desmond Kirkpatrick Intel Labs (CAD Research) Hillsboro OR International Symposium on Physical Design Monterey, CA Apr 16, 2003

Repeaters, which are already a full-chip headache, will become critical at the block level also

Outline • Some scaling experiments • Spice simulations • Implications for post-RTL design • Correct-by-Construction (CbC) design • What’s the promise? What’s missing?

G S D A Scaling Primer • Process scaling: • Devices shrink 0.7x, delay 0.7x • Wires shrink 0.7x • R/m increases 2x, C/m unchanged • So, (delay/scaled m) increases 1.4x • Block area often stays same • # cells, # nets doubles • Wiring histogram shape invariant

0.57x In line with scaling theory: Critical Repeater Lengths • Optimally-sized uniformly for min delay • Min distance at which inserting a repeater speeds up the line • “Ideally shrunk” circuit requires additional repeaters (0.7x vs 0.57x)

0.43x Critical Sequential Lengths • Optimized for max distance in one clock period • Assumes: • 2x frequency scaling, 5GHz on 90nm • Ignores setup, hold, skew • “Ideally shrunk” circuit: • Requires much new wire pipelining(0.7x vs 0.43x) • Ratio of regular to clocked repeaters decreasing 0.75x

100000 100000 Process Metal 90nm 10000 10000 M6 65nm M3 45nm 32nm 1000 1000 # Wires (90nm) 100 100 10 10 1 1 5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40 45 45 50 50 55 55 60 60 65 65 70 70 75 75 80 80 85 85 90 90 0.25 0.25 Normalized wirelength Block Wiring Histogram and Critical Repeater Lengths # Wires (90nm) Normalized wirelength Critical lengths migrating rapidly to the left… (zoomed view coming up)

Critical Repeater Lengths 100000 Process Metal 90nm M6 65nm M3 10000 45nm 32nm 1000 # wires (90nm) 100 10 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Normalized Wirelength Block Wiring Histogram: Zoomed View # Wires (90nm) Normalized Wirelength Increasingly steep slope of curve (log scale) => # impacted nets exploding!

PSC/bus1p Wiring Histogram PSC/bus1p Wiring Histogram PSC/bus1p Wiring Histogram PSC/bus1p Wiring Histogram Critical Sequential Distances Critical Sequential Distances Critical Sequential Distances Critical Sequential Distances 100000 100000 100000 100000 Process Process Metal Metal 90nm 90nm 90nm M6 M6 M6 10000 10000 10000 10000 65nm 65nm 65nm M3 M3 M3 45nm 45nm 45nm 32nm 32nm 32nm 1000 1000 1000 1000 #wires (90nm) #wires (90nm) #wires (90nm) #wires (90nm) #wires (90nm) #wires (90nm) #wires (90nm) 100 100 100 100 10 10 10 10 1 1 1 1 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20 25 25 25 25 30 30 30 30 35 35 35 35 40 40 40 40 45 45 45 45 50 50 50 50 55 55 55 55 60 60 60 60 65 65 65 65 70 70 70 70 75 75 75 75 80 80 80 80 85 85 85 85 90 90 90 90 0.25 0.25 0.25 0.25 Normalized Normalized Normalized Normalized Wirelength Wirelength Wirelength Wirelength Normalized Normalized Normalized Wirelength Wirelength Wirelength Block Wiring Histogram and Critical Sequential Lengths Process Process Metal Metal #wires (90nm) #wires (90nm) #wires (90nm) Normalized Wirelength Normalized Normalized Wirelength Wirelength # pipelined nets growing from negligible (90nm) to substantial (32nm)

Repeated Block-level Nets Ever-increasing %age of block-level nets requires repeaters Even the rate of growth is accelerating! …especially for clocked repeaters

Total Repeater Count • Ever-increasing fractions of total cell count will be repeaters • 70% in 32nm(and this omits FC repeaters within block !) Total repeater count is independent of frequency scaling assumptions

So, what’s changing? • Interconnects scaling worse than devices ….in spite of optimal (re-)buffering • # repeaters increasing exponentially Interconnect repeaters will comprise significant fraction of cells in block Even block-level nets will need to be pipelined

Implications on Synthesis • Literal/Gate count and fanout metrics misleading • Major delay contribution from communication • Fanouts often isolated by repeaters • Area often wire-limited • Sizing often determined by (predictable) repeater load • Pre-layout sizing wasted

Implications on Synthesis • Less logic per pipeline stage • Combinational synthesis: max benefit shrinking • Synthesis across sequential boundaries • Methodological support for retiming

Implications on Synthesis • Bandwidth ceiling • Hard to move data around for computation • Logic replication • Encourage low fans • Dense encodings • Distribution of computation across channel

b b b a a a Implications on Layout • Routing • Must understand repeater insertion • Fine power grid => templated routing? • Placement with repeaters • Intra-block nets: # repeaters depends on routing • OTH routes: fixed obstructions • Add buffering into placement core … as opposed to ECO postprocessing

90nm 32nm Implications on Layout • Latency-constrained placement • march sub-optimality • Hard constraint per stage (unlike delay) OR • Post-RTL latency optimization • Methodological nightmare • Delay insensitive design?

Implications on FC Assembly What if we reduce block area to avoid wire effects? Many of the new physical synthesis problems go away BUT # blocks triples! (and block assembly is the hardest part of chip design!) • Flat assembly (Fragmentation of paths across blocks) OR • Increased hierarchy (Lack of visibility across hierarchy levels)

The CbC Link Process scaling => worsening predictability Predictability => CbC design But current CbC approaches too rigid Can we still apply them?

Principles of CbC Design • More predictability • Reduced estimation error improves high-level optimizations • Break the design-verification loop • Sequence of small, guaranteed-correct transformations • No unexpected deterioration of secondary metrics • Avoid micro-engineering • Design productivity gap

Abstract Fabrics • Structural fabrics: too resource-intensive e.g. DWF: 50% routing tracks • Use algorithmic fabrics instead • Prune to subspace with desirable CbC properties e.g. Non-uniform power grid using “min power pitch” (ISPD’02) Guaranteed throughput bus design (ICCAD’02) • CbC rules-of-thumb e.g. Bound on max adjacent runs of signals Performance with predictability

RTL Synth/mapped netlist Placed/buffered netlist GR/track-assigned layout CbC Block Construction • “Vertical” partitioning and successive refinement • Coarse layout of unsynthesized design • Successive refinement of “vertical” partitions • Critical partitions first • Different partitions exist at different level of refinement • Hierarchical engines • Enables early repeater prediction

CbC Full Chip Assembly • Latency prediction for full-chip interconnects • Preferential routing for performance-critical nets • Flip-flop staging on non-critical nets • Performance prediction with cycle latency ranges • Block area mis-prediction tolerance • Move blocks without re-implementation • Global communication grids

Summing Up • Repeaters becoming critical at the block level • Most post-RTL design problems changing fundamentally • Combination of algorithmic and methodological advances required • CbC approaches viable, but at the abstract level • Current structural fabrics too resource intensive • Achieve predictability through algorithmic fabrics

Backup Slides

PIE (Process Independent Exploration) Models • To provide an easier way to study interconnect structures and their trends in future CMOS processes • To be used in place of fudged process files • Analytical models directly correlating to device and interconnect physics • Device models based on BSIM3 equations including major 2nd order effects • Accurate mobility and velocity saturation models, DIBL and channel length modulation approximation • Continuous from weak to strong inversion • Interconnect models with 2D fringe capacitance approximation • Scattering not accounted for • Entire process expressed by small set of physically meaningful process parameters (e.g. Tox, Vth, kild, etc.) in PEF (Process Exploration File) files • 16 for devices • 6 each metal layer • Test cases simulated as SPICE netlists • PIE models implemented as behavioral sources • Calibrated against existing process files

The Scaling Challenge: Can Correct-by-Construction Design Help?