490 likes | 515 Views
CS137: Electronic Design Automation. Day 8: January 27, 2006 Cellular Placement. Problem Parallelism Cellular Automata Idea Details Avoid Local Minima Update locations Results Directions. Primary Sources Wrighton&DeHon FPGA2003 Wrighton MS Thesis 2003. Today. Placement.
E N D
CS137:Electronic Design Automation Day 8: January 27, 2006 Cellular Placement
Problem Parallelism Cellular Automata Idea Details Avoid Local Minima Update locations Results Directions Primary Sources Wrighton&DeHon FPGA2003 Wrighton MS Thesis 2003 Today
Placement • Problem: Pick locations for all building blocks • minimizing energy, delay, area • really: • minimize wire length • minimize channel density • surrogates: • Minimizing squared wire length • Minimize bounding box
Parallelism • What parallelism exists in placement? • Evaluate costs of prospective moves • One set to many perspective locations • Many moves each to single location • Perform moves
Cellular Automata • Basic idea: regular array of identical cells with nearest-neighbor communication
CA Model • On each cycle: • Each cell exchanges values with neighbors • Updates state/value based on own state and that of neighbors • E.g. Conway’s LIFE
Cellular Automata • Physical Advantage: • No long wires • Area linear in number of nodes • Minimum delaysmall cycle time • Good scaling properties
System Architecture Taxonomy (Subject to continuing refinement and embellishment)
CA Placement • Can we perform placement in a CA?
Mapping • Each cell is a physical placement location • State is a logical node assigned to the cell • Assume: • Cell knows own location • State knows location of connected nodes
Costs • Assume: • Cell knows own location • State knows location of connected nodes • Cell computes: its cost at that location
Moves • Two adjacent cells can exchange graph nodes
Moves • Evaluate goodness of proposed swap • Each cell considers impact of its graph node being in the other cell • Keep if swap reduces cost
Move Costs • Only really need to evaluate delta cost • (src.x-sink.x)2 • Moving sink • d/dx=-2 (src.x-sink.x) • Delta move cost is linear distance
Parallel Swaps • Pair up and perform N/2 swaps in parallel
Movement • Alternate pairings with N,S,E,W neighbor move any directions
Basic Idea • Pair up PEs • Compute impact of swaps in parallel • Perform swaps in parallel • Repeat until converge
Problems/Details • Greedy swaps local minima? • How update location of neighbors? • …they are moving, too
Avoid Greedy • Insert randomness in swaps • Simulated Annealing • Shake up system to get out of local minima • Swap if • Randomly decide to swap • OR beneficial to swap • Change swap thresholds over time
Range Limiting Eurgo, Hauck, & Sharma DAC 2005
Local Swaps Only • Assume there’s an ideal location • Each node takes a biased Random Walk away from minimum cost location • Gives node a distribution function around the minimum cost location • If wander into a better “minimum cost” home, then wanders around new centerpoint • Decreasing temperature restricts effective radius of walk
Local Swap Random Walk • Decreasing temperature restricts effective radius of walk
How update locations? • Broadcast? • Pipelined Ring? • Send to neighbors? • Routing network? • Tree? • For whom? • Everyone? Only things moved? Only things moved a lot?
Drop value in ring Shift around entire array Everyone listens for updates Simple Solution: Ring
Simple Solution: Ring • Weakness? • Serial • N cycles to complete • N/2 swaps in O(1) • Then O(N) to update?
Simple Solution: Ring • Linear update bad • Idea: allow staleness • Things move slowly • Estimate of position not that bad… • …and continued operation will correct…
Algorithm Update Locations
Algorithm Try Moves
Iso-Quality Pick point on Iso-Quality Curve that minimizes time
FPGA Implementation • Virtex E (180nm) • 10ns cycle (100MHz) • 150 cycles for 4-phase swap • (~40 cycles/swap) • 400 LUTs / Placement Engine • Comparing • 2.2GHz Intel Xeon (L2 512KB)
Scaling • Processor cycles O(N4/3) • VPR • Systolic cycles • O(N1/2) – assume geometric refinement; O(N1/2 ) update • O(N5/6) – mesh sort, same number of swaps as VPR (N4/3 / N1/2)
Scaling Also includes technology scaling
Variations • Update Schemes • Cost Functions • Larger bins than PEs
Update Scheme: Tree • Build Reduce Tree (H-Tree) • Route to route in O(N1/2) time • Route from root to leaves in O(N1/2) times • Pipeline • Same bandwidth as Ring (1/cycle) • But less staleness (only O(N1/2))
Reducing Broadcast (Idea 1) • Don’t update things that haven’t moved (much) • …or things that move and move back before broadcast • Keep track of staleness • How far moved from last broadcast • Give priority to stalest data • Max staleness wins at each tree stage • Break ties with randomness
Reducing Broadcast (Idea 2) • Update locally • Don’t need to know if someone far away moved by 1 square • …but need to know if near neighbor did • Multigrid/multiscale scheme • Only alert nodes in same subtree • When change subtrees at a level, alert all nodes underneath
Update Scheme: Mesh Route • Can Route a permutation in O(N1/2) time on a mesh • Build mesh switching • Make O(N) swaps • Then take O(N1/2) time moving/updating • Becomes full simulated annealing • i.e. not just local swaps
Cost Functions • Bounding Box2 phase update • Phase 1: alert source to location of all sinks • Phase 2: source communicates bbox extents to all sinks
Timing • Linear Update: • Topological ordering of netlist • Use tree to distribute updates • Send updates in netlist order • get delay in one pass • Mesh: • Compute directly with dataflow-style spreading activation • Wait for all inputs; then send output
Node Bins • Keep more than one graph node per PE • Local swap of one node from each PE node set each step • One with largest benefit? • Randomly select based on cost/benefit? • Like rejectionnless annealing
Admin • Parallel Prefix familiarity? • Due today: literature review • There is class on Monday