CS137: Electronic Design Automation

CS137:Electronic Design Automation Day 2: January 6, 2006 Spatial Routing

Today • Idea • Challenges • Path Selection • Victimization • Allocation • Methodology • Quality, Timing • Parallelism • Mesh • FPGA Implementation

CS137a: Day22 Global/Detail • With limited switching (e.g. FPGA) • can represent routing graph exactly

Pathfinder Review • Key step: find-shortest path from src to sink • Mark links by usage • Used links cost most • Shortest path tries to avoid • Negotiated Congestion w/ History • Increase cost of congested nodes • Adaptive cost … makes historically congest nodes expensive, try to avoid

Slow? • Why is routing slow? • Each route: • search all possible paths from source to sink • Number of paths expands as distance2 • Graph of network is MBs large • Large complicated data structure to walk • Won’t all fit in cache • Number of nets = Number of edges • Perform many iterations to converge

Parallelism? • Search all paths in parallel for a single route • Search routes for multiple nets in parallel • Don’t overlap • Overlap?

Initial Key Ideas • Augment existing static network structure to route itself • Use hardware to exploit parallelism in routing • Search all paths in parallel • Route multiple nets in parallel • Avoid walking irregular graph • Specialized/pipelined hardware at each switch • Hardware can perform a route trial in 10s of cycles vs. 10K-100K cycles for software

2 4 Hardware Route Search in Action

Path Search Hardware

Idea Existing paths already allocated Drive a one into search paths All free paths pass up Path Search Hardware

Challenges • How select among paths? • What if there are no free paths? • Can we work without Pathfinder’s history? • How handle fanout? • How handle allocation and victimization?

Select Among Paths? • Easy: Randomly • Use PRNG at xover switchbox • Otherwise, need to represent costs…

No Paths? • Try stealing a path (rip-up)  victimize existing path • Which one? • Randomly select victim • History-free Pathfinder suggest: • one with least nets shared with other routes  CountCost • CountNet: one which intersects least existing nets

CountNet vs. CountCost • CountCost: 6 • CountNetCost: 1

Implement Counting? Idea: Delay congested signal Free paths not delayed. Least congested signal arrives at xover first.

CountNet Approximation • Keeping track of which net uses a switch would be much more state/complicated • Approximate CountNet by only delaying at conflicting switches

Implement CounNet Approximation Allow to pass if agrees with switch setting.

Cost is max of sides • Also note: • Actual cost is max(srcxover,sinkxover) instead of sum

Algorithm Comparison – Random Netlist Total Channels HSRA Array Size

How Improve? • Apologize for lack of history? • Exploit fast • Try multiple starts and exploit randomness • Like multiple starts of FM

Trading Routing Time for Quality

Choosing the Right Victims

CountNet CountNet  best of 20 starts.

Hypergraphs (Fanout) • Sequentially route each two-point net, trying to re-use as much as possible from existing allocated paths.

Hypergraphs (Fanout) • Sequentially route each two-point net, trying to re-use as much as possible from existing allocated paths. • Add a state bit at every switch • Set when allocate during the current net search. • Clear when we begin to route a new net • Order the destinations associated with a single source • For each destination, • Search from sink as before (only from sink) • At the switch, if the state bit is set and the sink side is congestion free, we have found an available path. • Otherwise, drive ones into all available source paths and allocate a new path, like a standard route search.

Hypergraphs (Fanout) • Sequentially route each two-point net, trying to re-use as much as possible from existing allocated paths.

Hypergraphs (Fanout) • Sequentially route each two-point net, trying to re-use as much as possible from existing allocated paths. • Add a state bit at every switch • Set when allocate during the current net search. • Clear when we begin to route a new net • Order the destinations associated with a single source • For each destination, • Search from sink as before (only from sink) • At the switch, if the state bit is set and the sink side is congestion free, we have found an available path. • Otherwise, drive ones into all available source paths and allocate a new path, like a standard route search.

High Fanout Nets • Victimizing high fanout net will cause considerable re-route work • Might want to penalize victimizing high fanout nets • CountNetFanout? • Requires more state…expensive… • Simple hack: lock high fanout nets against victimization • What’s a high fanout net? >10?

Toronto20 - Quality

So far • All Quality • …haven’t dealt with all performance details • Had basis for confidence in performance • Wanted to make sure worthwhile first

Hardware Allocation Add all nets to R While nets in R > 0 and routeTrial < RTmax For each unrouted net Find all possible routes If found possible routes Randomly select and allocate a route Else Select a route to victimize and allocate the route Endfor Adjust R Endwhile Idea: send one down selected path

With Victimization Add all nets to R While nets in R > 0 and routeTrial < RTmax For each unrouted net Find all possible routes If found possible routes Randomly select and allocate a route Else Randomly select a route to victimize and allocate the route Endfor Adjust R Endwhile

Analysis Methodology • Sequential version that does effectively the same thing (perhaps inefficiently) • Count key operations/variables • Number of net searches • Number of victims • Timing model for key operations • Calculate Performance under various timing assumptions

Timing Models • Hardware Timing • Tpath = length of path ~= log(N) • Tallocate~=Tpath • Tvictim~=4*Tpath • Software Timing • Tallocate~=Npathsw*(Tm+Tc+Twb+Ta) • Tvictim~=Npathsw*(Tm+Tc)+V*Talloc • Tm=main memory ref • Tc=cache ref; Twb=write buffer; Ta=bit alloc

Route Time Ntry – number of route starts NRT – number of path searches NRO – number of rip ups NFO – number of fanout searches NFOA – number of fanout allocations

Raw Data

There is a quality/time tradeoffs Want to compare at iso-quality Making comparisons

More Parallelism • Only exploiting parallelism in path search • Subtrees are independent • Route root • Then route next two channels in parallel • Then route next 4…

Still Not Exploiting • Multiple path searches in parallel that overlap routing resources…

Extension to Mesh Networks • No well defined crossover point . • Path back to the source is not implied directly by the topology of the routing network. • Paths of different length • and non-minimal length paths may be important components of a good solution.

Mesh Approach • Single-ended search from source • Larger delay on congestion  allow non-minimal length paths • Breadcrumb approach  leave state in switches pointing back to source

Extension to Mesh Networks

Extension to Mesh Networks - Results (Simulator too slow to run larger)

BFT FPGA Implementation • 21 4-LUTs to implement switch logic +9 4-LUTs to manage prng/allocation =30 4-LUTs/T-switch • 13/3 switches/PE/domain • 130 4-LUTs/PE/domain • C=10 • 1300 4-LUTs / PE

Mesh FPGA Implementation

CS137: Electronic Design Automation