220 likes | 228 Views
Conjoining Soft-Core FPGA Processors. David Sheldon a , Rakesh Kumar b , Frank Vahid a* , Dean Tullsen b , Roman Lysecky c a Department of Computer Science and Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine
E N D
Conjoining Soft-Core FPGA Processors David Sheldona, Rakesh Kumarb, Frank Vahida*, Dean Tullsenb , Roman Lyseckyc aDepartment of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine bDepartment of Computer Science and Engineering University of California, San Diego cDepartment of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx
FPGA Soft Core Processors HDL Description • Soft-core Processor • HDL description • Flexible implementation • FPGA or ASIC • Technology independent FPGA ASIC Spartan 3 Virtex 2 Virtex 4 David Sheldon, UC Riverside
FPGA FPGA Soft Core Processors • Soft Core Processors can have configurable options • Datapath units • Cache • Bus architecture • Current commercial FPGA Soft-Core Processors • Xilinx Microblaze • Altera Nios μP FPU MAC Cache David Sheldon, UC Riverside
Conjoined FPU unit Conjoinment Overview • Add necessary units to both processors Application 1 Application 2 Base micro-processor Base micro-processor FPU FPU FPU FPU FPU • Conjoin the FPU Unit “Conjoining” David Sheldon, UC Riverside
Conjoinment Background • Conjoinment proposed for multicore desktop processing (Kumar 2004) • Reduces size with reasonable performance overhead • e.g., cache conjoinment overhead: 1%-13% ICache Sharing DCache Sharing David Sheldon, UC Riverside
size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside
Barrel Shifter Divider Area Savings • Significant potential area savings • Limitations • Does not consider multiplexing costs • Due to absence of FPGA synthesis tools supporting conjoinment • But good potential justifies further investigation Multiplier 32% Base MicroBlaze 23% 6% FPU 4% Unit Size Multiplier 1331 Barrel Shifter 228 Divider 122 FPU 2738 David Sheldon, UC Riverside
size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside
trace1 trace2 Access stall Contention stall Performance Overhead • No simulator exists for conjoined processors • We developed our own • Trace-based conjoined processor simulator • Simulation uses pessimistic performance assumptions • Kumar's techniques can improve • Simulator outputs contention information • Final cycles can be compared to unconjoined to determine performance overhead app1 app2 Xilinx simulator brev Conj. simulator bitmnp David Sheldon, UC Riverside
brev bitmnp Performance Overhead • Speedup: Application time on optimally configured processor / avg. app. time on base processor • Compared configuration with conjoinment versus without • Performance overhead usually small, averaged just 4.2% • Overhead caused by access delays and contention of the hardware units 2.4% 17% David Sheldon, UC Riverside
size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside
Multiplier Multiplier Multiplier Barrel Shifter Divider Tuning Heuristic • 5 choices per unit • e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined • 4 units 54 = 625 possible configurations • Simulation: ~30 minutes per configuration • Need search heuristic to tune Base MicroBlaze 2 Base MicroBlaze 1 NO FPU FPU 2 NO FPU FPU 1 FPU conjoined David Sheldon, UC Riverside
Synthesis Synthesis FPU Barrel Shifter Multiplier Divider FPU App perf perf perf perf size size size size Base MicroBlaze MicroBlaze Map to 0-1 Knapsack Problem Creating the model BS FPU MUL DIV Perf increment 1.1 0.9 1.2 1.0 Size increment 1.4 2.7 1.8 1.1 Perf/Size 0.96 0.34 0.63 0.93 David Sheldon, UC Riverside
Map to 0-1 Knapsack Problem • First consider tuning without conjoinment • Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem • Add items, each with weight and benefit, to weight-constrained knapsack such that profit maximized FPU 2 FPU 1 MUL 2 Items: MUL 1 2 2 1 1 Weights: 1331 228 121 1331 228 121 2738 2738 Benefits: 0.08 0.62 0.00 0.22 0.76 0.00 0.00 0.00 MUL 1 Base MicroBlaze Base MicroBlaze FPU 1 Note: Mapping inexact – weights/benefits not strictly additive MUL 2 Available FPGA Knapsack David Sheldon, UC Riverside
Disjunctively Constrained Knapsack • Problem: If conjoined unit included, can't also include standalone unit • Solution: Map to disjunctively-constrained 0-1 knapsack • Yanada T., “Heuristic and Exact Algorithms for the Disjunctively Constrained Knapsack Problem”, 2002 • Prohibits specific item pairs from being in the knapsack • ILP solution, running time is pseudo polynomial FPU 2 FPU 1 MUL 2 Items: MUL 1 2 2 1 1 FPU C MUL C C C Base MicroBlaze Base MicroBlaze Available FPGA Knapsack David Sheldon, UC Riverside
Disjunctively Constrained Knapsack FPU 2 • Conjoined benefits shows a small decrease in benefit from the unconjoined unit FPU 1 MUL 2 Items: MUL 1 MUL 1 2 2 1 1 Weights: 1331 228 121 1331 228 121 2738 2738 Benefits: 0.08 0.62 0 0.22 0.76 0 0 0 FPU C MUL C MUL C C C Weights: 1331 228 121 2738 Benefits 1: 0.06 0.54 0 0 Benefits 2: 0.21 0.71 0 0 • Conjoined units provide benefits to both processors Base MicroBlaze Base MicroBlaze Available FPGA Knapsack David Sheldon, UC Riverside
Disjunctively Constrained Knapsack • Running Time • Modeling • 5 Synthesis runs for each Processor • At most 4 runs of the conjoined Simulator • Disjunctively Constrained 0-1 Knapsack • NP-complete problem • Solved with a heuristic • Heuristic takes < 1 min David Sheldon, UC Riverside
Results • Data gathered for the Xilinx Microblaze Soft-core Processor • 10 EEMBC and Powerstone benchmarks • aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk • Obtained results for all possible pairwise conjoinment • We only show conjoinment data when both applications use unit • To avoid making conjoinment appear better than it is David Sheldon, UC Riverside
Results Knapsack approach finds near-optimal in most cases David Sheldon, UC Riverside
Results • Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment) • Runs in seconds • One example had sub-optimal results (2.9 times slower) • Performance overhead due to conjoinment just a few percent on average David Sheldon, UC Riverside
Results • On average the knapsack approach yields the same size as the exhaustive with conjoinment • Average size savings of 16% David Sheldon, UC Riverside
Conclusions • Conjoining two soft-core FPGA processors reduces average size by 16% • Performance overhead just a few percent in most cases • Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases • But could be improved for some examples • Future • Consider multiplexing size and delay overheads • Apply Kumar's advanced conjoining techniques to reduce overheads David Sheldon, UC Riverside