CS137: Electronic Design Automation

CS137:Electronic Design Automation Day 8: January 27, 2006 Cellular Placement

Problem Parallelism Cellular Automata Idea Details Avoid Local Minima Update locations Results Directions Primary Sources Wrighton&DeHon FPGA2003 Wrighton MS Thesis 2003 Today

Placement • Problem: Pick locations for all building blocks • minimizing energy, delay, area • really: • minimize wire length • minimize channel density • surrogates: • Minimizing squared wire length • Minimize bounding box

Parallelism • What parallelism exists in placement? • Evaluate costs of prospective moves • One set to many perspective locations • Many moves each to single location • Perform moves

Cellular Automata • Basic idea: regular array of identical cells with nearest-neighbor communication

CA Model • On each cycle: • Each cell exchanges values with neighbors • Updates state/value based on own state and that of neighbors • E.g. Conway’s LIFE

Cellular Automata • Physical Advantage: • No long wires • Area linear in number of nodes • Minimum delaysmall cycle time • Good scaling properties

System Architecture Taxonomy (Subject to continuing refinement and embellishment)

CA Placement • Can we perform placement in a CA?

Mapping • Each cell is a physical placement location • State is a logical node assigned to the cell • Assume: • Cell knows own location • State knows location of connected nodes

Costs • Assume: • Cell knows own location • State knows location of connected nodes • Cell computes: its cost at that location

Moves • Two adjacent cells can exchange graph nodes

Moves • Evaluate goodness of proposed swap • Each cell considers impact of its graph node being in the other cell • Keep if swap reduces cost

Move Costs • Only really need to evaluate delta cost • (src.x-sink.x)2 • Moving sink • d/dx=-2 (src.x-sink.x) • Delta move cost is linear distance

Parallel Swaps • Pair up and perform N/2 swaps in parallel

Movement • Alternate pairings with N,S,E,W neighbor  move any directions

Basic Idea • Pair up PEs • Compute impact of swaps in parallel • Perform swaps in parallel • Repeat until converge

Problems/Details • Greedy swaps  local minima? • How update location of neighbors? • …they are moving, too

Avoid Greedy • Insert randomness in swaps •  Simulated Annealing • Shake up system to get out of local minima • Swap if • Randomly decide to swap • OR beneficial to swap • Change swap thresholds over time

Swap?

Impact of Randomness

Range Limiting Eurgo, Hauck, & Sharma DAC 2005

Local Swaps Only • Assume there’s an ideal location • Each node takes a biased Random Walk away from minimum cost location • Gives node a distribution function around the minimum cost location • If wander into a better “minimum cost” home, then wanders around new centerpoint • Decreasing temperature restricts effective radius of walk

Local Swap Random Walk • Decreasing temperature restricts effective radius of walk

How update locations? • Broadcast? • Pipelined Ring? • Send to neighbors? • Routing network? • Tree? • For whom? • Everyone? Only things moved? Only things moved a lot?

Drop value in ring Shift around entire array Everyone listens for updates Simple Solution: Ring

Simple Solution: Ring • Weakness? • Serial • N cycles to complete • N/2 swaps in O(1) • Then O(N) to update?

Simple Solution: Ring • Linear update bad • Idea: allow staleness • Things move slowly • Estimate of position not that bad… • …and continued operation will correct…

Algorithm

Algorithm Update Locations

Algorithm Try Moves

Quality vs. Parameters

Iso-Quality Pick point on Iso-Quality Curve that minimizes time

FPGA Implementation • Virtex E (180nm) • 10ns cycle (100MHz) • 150 cycles for 4-phase swap • (~40 cycles/swap) • 400 LUTs / Placement Engine • Comparing • 2.2GHz Intel Xeon (L2 512KB)

Results

Tuning Quality

Scaling • Processor cycles O(N4/3) • VPR • Systolic cycles • O(N1/2) – assume geometric refinement; O(N1/2 ) update • O(N5/6) – mesh sort, same number of swaps as VPR (N4/3 / N1/2)

Scaling Also includes technology scaling

Variations • Update Schemes • Cost Functions • Larger bins than PEs

Update Scheme: Tree • Build Reduce Tree (H-Tree) • Route to route in O(N1/2) time • Route from root to leaves in O(N1/2) times • Pipeline • Same bandwidth as Ring (1/cycle) • But less staleness (only O(N1/2))

Reducing Broadcast (Idea 1) • Don’t update things that haven’t moved (much) • …or things that move and move back before broadcast • Keep track of staleness • How far moved from last broadcast • Give priority to stalest data • Max staleness wins at each tree stage • Break ties with randomness

Reducing Broadcast (Idea 2) • Update locally • Don’t need to know if someone far away moved by 1 square • …but need to know if near neighbor did • Multigrid/multiscale scheme • Only alert nodes in same subtree • When change subtrees at a level, alert all nodes underneath

Update Scheme: Mesh Route • Can Route a permutation in O(N1/2) time on a mesh • Build mesh switching • Make O(N) swaps • Then take O(N1/2) time moving/updating • Becomes full simulated annealing • i.e. not just local swaps

Cost Functions

Cost Functions • Bounding Box2 phase update • Phase 1: alert source to location of all sinks • Phase 2: source communicates bbox extents to all sinks

Timing • Linear Update: • Topological ordering of netlist • Use tree to distribute updates • Send updates in netlist order • get delay in one pass • Mesh: • Compute directly with dataflow-style spreading activation • Wait for all inputs; then send output

Bins

Node Bins • Keep more than one graph node per PE • Local swap of one node from each PE node set each step • One with largest benefit? • Randomly select based on cost/benefit? • Like rejectionnless annealing

Admin • Parallel Prefix familiarity? • Due today: literature review • There is class on Monday

CS137: Electronic Design Automation