390 likes | 483 Views
ECE 506 Reconfigurable Computing http://www.ece.arizona.edu/~ece506 Lecture 6 Clustering Ali Akoglu. Before Placement: Clustering. Intra -cluster connections: fast Inter -cluster connections: slow Need to pack BLEs Goals : Reduce stress on routing
E N D
ECE 506Reconfigurable Computinghttp://www.ece.arizona.edu/~ece506Lecture 6ClusteringAli Akoglu
Before Placement: Clustering • Intra-cluster connections: fast • Inter-cluster connections: slow Need to pack BLEs • Goals: • Reduce stress on routing • Take advantage of local fast interconnect • Reduce inter-cluster wiring • Minimize critical path (timing-driven) • How do we do this • Take advantage of cluster architecture • Tradeoffs
Basic Clustering (Betz) • How many distinct inputs should be provided to a cluster of N 4-LUTs? • How many 4 LUTs should be included in a cluster to create the most area-efficient logic block?
Basic Clustering (Betz) • Flow • Iterate until all BLEs consumed • Start new cluster by selecting a random BLE • select the currently unclustered BLE with the most used inputs, • Add BLE with most shared inputs with current cluster to cluster • to minimize the number of inputs that must be routed to each cluster. • Keep adding until either cluster full or input pins used up • Hill climbing – if some cluster BLEs unused • Add another BLE even if cluster input count temporarily overflowed • If input count not eventually reduced select best choice from before hill climbing
Number of Inputs per Cluster • Lots of opportunities for input sharing in large clusters (Betz – CICC’99) • Reducing inputs reduces the size of the device and makes it faster. • Most FPGA devices (Xilinx, Lucent) have 4 BLE per cluster with more inputs than actually needed.
Architecture Modeling Tri-state buffer and pass transistor distribution Cluster Size vs. Routing resources (Tile size) Transistor and Buffer Scaling based on segment length Flexibility of Switches (Fc=W for large cluster size is a waste?)
Timing-Driven Clustering – T-VPACK • Optimization goals of VPack • Pack each cluster to its capacity • Minimize number of clusters • Minimize number of inputs per cluster • Reduce the number of external connections
Timing-Driven Clustering – T-VPACK • Optimization goal of T-VPack • Minimize number of external connections on critical path • Why? • External connections have higher delay and internal connections • Reducing number of external nets on critical path will reduce delay
Timing-Driven Clustering – T-VPACK • First stage • Identify connections that are on the critical path • Second Stage • Pack BLEs sequentially along the critical path • Recompute criticality of remaining BLEs
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 0 0 0 Arrival Times
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 3 0 3 3 0 1 Arrival Times
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 7 0 9 3 7 0 1 7 Arrival Times
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 13 7 0 15 9 3 7 0 14 1 7 Arrival Times
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 18 13 7 0 22 15 9 3 7 0 18 14 1 7 Arrival Times
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 18/22 13 7 0 22/22 15 9 3 7 0 18/22 14 1 7 arrival time/required time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 18/22 13 7 0 22/22 15 / 15 9 3 7 / 15 0 18/22 14/ 18 1 7 arrival time/required time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 0 18/22 13 7 0 22/22 15 / 15 9 3 7 / 15 0 18/22 14 / 18 1 7/ 13 arrival time/required time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 13 / 15 0 18/22 7 0 22/22 15 / 15 9 3 7 / 15 0 18/22 14 / 18 1 7/ 13 arrival time/required time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 1 13 / 15 0 18/22 7 / 9 0 22/22 15 / 15 9 / 9 3 7 / 15 0 18/22 14 / 18 1 7/ 13 arrival time/required time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 13 / 15 1 / 5 0 18/22 7 / 9 0 22/22 15 / 15 9 / 9 3 / 3 7 / 15 0 18/22 14 / 18 1 / 9 7/ 13 arrival time/required time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 13 / 15 1 / 5 0 / 4 18/22 7 / 9 0 / 0 22/22 15 / 15 9 / 9 3 / 3 7 / 15 0 / 8 18/22 14 / 18 1 / 9 7/ 13 Slack = required time - arrival time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 2 4 4 4 2 0 0 0 0 0 8 8 4 4 8 6 Slack = required time - arrival time
PI1 PO1 1 4 5 6 3 6 7 6 PI2 PO2 1 4 4 5 PI3 PO3 Slack and Criticality Calculation 13 / 15 1 / 5 0 / 4 18/22 7 / 9 0 / 0 22/22 15 / 15 9 / 9 3 / 3 7 / 15 0 / 8 18/22 14 / 18 1 / 9 7/ 13 Critical Path
Timing-Driven Clustering – T-VPACK • Cost metric now considers both connectivity and timing criticality • Perform an analysis of criticality at beginning considering all wires to be inter-cluster • Determine “Base” BLE criticality
How to break ties? • Initially, many paths may have the same number of BLEs • Include “tie-breaking” in performance cost function
Results for T-VPACK versus VPACK Why does the gap between VPack and T-VPack increase as N increases?
Results for T-VPACK versus VPACK • T-VPack prefers to cluster a BLE with BLEs that are in its fan-in or fan-out • VPack favors input sharing • T-VPack completely absorbs many low-fanout nets • Fewer nets to route!
Results for T-VPACK versus VPACK Why does area-delay product show an increasing trend beyond cluster size of 10?
Results for T-VPACK versus VPACK • Increased number of nets that are completely absorbed by T-Vpack • Area- delay product • Cluster size 7-10 best choice (36-34% better than N=1) • N=7 vs N=1 • 30% less delay, 8% les area
Results for T-VPACK, DELAY !!! Why do we see a circuit speedup?
Results for T-VPACK, DELAY !!! 18% 40% • Intra-cluster: Fast, Inter-cluster: Slow ! • As N increases • Number of internal connections on the critical path increase • Number of external connections on the critical path decrease
Why are inter-cluster connections becoming faster? Reduction in Number of external connections (internal connections are faster) External connections on the critical path are becoming faster Reduction in routing requirements