220 likes | 369 Views
Application-Specific Customization of Parameterized FPGA Soft-Core Processors. David Sheldon a , Rakesh Kumar b , Roman Lysecky c , Frank Vahid a* , Dean Tullsen b a Department of Computer Science and Engineering University of California, Riverside
E N D
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldona, Rakesh Kumarb, Roman Lyseckyc, Frank Vahida*, Dean Tullsenb aDepartment of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine bDepartment of Computer Science and Engineering University of California, San Diego cDepartment of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx
FPGA Soft Core Processors HDL Description • Soft-core Processor • HDL description • Flexible implementation • FPGA or ASIC • Technology independent FPGA ASIC Spartan 3 Virtex 2 Virtex 4 David Sheldon, UC Riverside
FPGA FPGA Soft Core Processors • Soft Core Processors can have configurable options • Datapath units • Cache • Bus architecture • Current commercial FPGA Soft-Core Processors • Xilinx Microblaze • Altera Nios μP FPU MAC Cache David Sheldon, UC Riverside
FPGA Goal • Goal: Tune FPGA soft-core microprocessor for a given application Parameter Values μP App Parameter Values Synthesis Configured μP Configured μP time size David Sheldon, UC Riverside
Barrel Shifter Base MicroBlaze Divider Microblaze – Xilinx FPGA Soft-Core All units not necessarily the fastest, due to critical path lengthening Multiplier FPU Cache Instantiatable units Significant tradeoffs David Sheldon, UC Riverside
Problem • Need fast exploration • Synthesis runs can take an hour Parameter Values μP • This talk • Two approaches • Approach 1: Using Traditional CAD Techniques • Approach 2: Synthesis-in-the-loop • Results Synthesis Exploration ~20-60 mins Configured μP David Sheldon, UC Riverside
MicroBlaze Constraints on Configurations • Size constraints may prevent use of all possible units Multiplier Barrel Shifter FPU Multiplier Divider FPU Cache Cache Max Area David Sheldon, UC Riverside
MicroBlaze Approach 1: Traditional CAD Techniques Slow, includes synthesis Create model • Create a model of the problem • Solve model with extensive search heuristics • We will model this problem as a 0-1 knapsack problem Model Fast, considers 1000s of configurations Exploration FPU Multiplier Cache Max Area David Sheldon, UC Riverside
Synthesis Synthesis FPU Barrel Shifter Multiplier Cache Divider FPU App perf perf perf perf perf size size size size size Base MicroBlaze MicroBlaze Approach 1: Traditional CAD Techniques Creating the model BS FPU MUL DIV CACHE Perf increment 1.1 0.9 1.2 1.0 1.3 Size increment 1.4 2.7 1.8 1.1 1.6 Perf/Size 0.96 0.34 0.63 0.93 0.80 David Sheldon, UC Riverside
Micro- Blaze Approach 1: Traditional CAD Techniques • 0-1 knapsack model • Object’s benefit = Unit’s performance increment / size increment • Object’s weight = Unit’s Size • Knapsack’s size constraint = FPGA size constraint BS FPU MUL DIV CACHE Perf increment 1.1 0.9 1.2 1.0 1.3 Size increment 1.4 2.7 1.8 1.1 1.6 Perf/Size 0.96 0.34 0.63 0.93 0.80 David Sheldon, UC Riverside
Approach 1: Traditional CAD Techniques • Solved the 0-1 knapsack problem using established methods • Toth, P., Dynamic Programming Algorithms for the Zero-One Knapsack Problem. Computing 1980 • Running time • 6 Microblaze configuration synthesis runs to create model • O(n*p) to solve model • n is the number of factors • p is the available area • Negligible (seconds) compared to synthesis runtimes (~hour) David Sheldon, UC Riverside
Approach 1: Traditional CAD Techniques • Problems • 100’s of target FPGAs • Different hard core resources (multiplier, block RAM) • Model approach estimates size and performance for two or more units • MUL speedup 1.3, DIV speedup 1.6 estimate MUL+DIV speedup 1.9 • May really be 1.7 • Model inaccuracies may be large David Sheldon, UC Riverside
Create model Model Exploration Exploration Synthesis size Execute Approach 2: Synthesis-in-the-Loop • Problem with traditional CAD approach • 100’s of target FPGAs • Model approach estimates size and performance for two or more units • Model inaccuracies may be large • Solution – Synthesis in the loop • No abstract model • Guided by actual size and performance data • But slow – can only explore a few configurations Synthesis-in-the-Loop 10’s of minutes perf David Sheldon, UC Riverside
Barrel Shifter Floating Point Multiplier Cache Divider perf perf perf perf perf size size size size size BS FPU MUL DIV CACHE Perf increment 1.1 0.9 1.2 1.0 1.3 Size increment 1.4 2.7 1.8 1.1 1.6 Perf/Size 0.96 0.34 0.63 0.93 0.80 Approach 2: Synthesis-in-the-Loop • First pre-analyze units to guide heuristic • Same calculations as when creating model for knapsack David Sheldon, UC Riverside
BS FPU MUL DIV CACHE Perf/Size 0.96 0.34 0.63 0.93 0.80 BS DIV CACHE MUL FPU Perf/Size 0.96 0.93 0.80 0.63 0.34 Approach 2: Synthesis-in-the-Loop • Build “impact-ordered tree” structure • Tree is specific to given application Application Specific Impact-ordering Impact BS 0.96 DIV 0.93 CACHE 0.80 Sort MUL 0.63 FPU 0.34 David Sheldon, UC Riverside
Synthesis-in-the-Loop Exploration size perf Synthesis Execute Approach 2: Synthesis-in-the-Loop • Run tree-based search heuristic Perf/Size Useful BS Yes 0.96 Not Include Include DIV No 0.93 CACHE No 0.80 MUL Yes 0.63 FPU No 0.34 David Sheldon, UC Riverside
Comparison of Approaches • Approach 1 – Traditional CAD • 6 synthesis runs to build model • O(np) knapsack solution • Examines thousands of configurations during exploration • Approach 2 – Synthesis in the loop • 11 synthesis runs (6 pre-analysis, 5 exploration) • Examines (at most) 5 configurations during exploration David Sheldon, UC Riverside
Results • 10 EEMBC and Powerstone benchmarks • aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk • Average results shown, on Virtex 2 Pro, for particular size constraint 800 Exhaustive App-Spec 600 Knapsack Tool Run Time (min) 400 Application-specific impact-ordered tree approach yields near-optimal results in acceptable tool runtime 200 0 Knapsack sub-optimality due to multi-unit estimation inaccuracy 1.5 2 2.5 1 Speedup David Sheldon, UC Riverside
Results • Obtained results for six different size constraints • Results shown for a second size constraint • Similar findings for all six constraints 800 Exhaustive App-Spec 600 Knapsack Tool Run Time (min) 400 200 0 1.5 2 2.5 1 Speedup David Sheldon, UC Riverside
Results • Also ran for different FPGA • Xilinx Spartan2 • Similar findings 300 Exhaustive 250 App-Spec 200 Knapsack Tool Run Time (min) 150 100 50 0 1.2 1.4 1.6 1 Speedup David Sheldon, UC Riverside
Conclusions • Synthesis-in-the-loop approach outperformed traditional CAD approach • Better results • Slightly longer runtime • Application-specific impact-ordered tree heuristic served well for synthesis-in-the-loop approach • Future • Extend for highly-configurable soft-core processors, and for multiple processors competing for and/or sharing resources David Sheldon, UC Riverside
Questions? David Sheldon, UC Riverside