260 likes | 374 Views
Energy-Efficient Design of Kernel Applications for FPGAs Through Domain-Specific Modeling. Seonil Choi, Ronald Scrofano, and Viktor K. Prasanna University of Southern California MAPLD 2002, September, 2002. funded by the DARPA Power-aware Computing and Communications program. Outline.
E N D
Energy-Efficient Design of Kernel Applications for FPGAs Through Domain-Specific Modeling Seonil Choi, Ronald Scrofano, and Viktor K. Prasanna University of Southern California MAPLD 2002, September, 2002 funded by the DARPA Power-aware Computing and Communications program
Outline • Motivation • Design Methodology • Example Matrix Multiplication Designs • Results • MILAN
FPGAs: Current Trends • Large FPGAs (40M+ gates) • Embedded multipliers, processors • Military and commercial systems using FPGAs • Digital Signal Processing: matrix operations, FFT, window operations, filtering • Image processing • Internet • Performance metrics • Energy, Latency, and Area
Mapping Kernel Applicationsonto FPGAs • FPGAs lack a fixed structure comparable to that of general purpose processors • Too fine-grained to model at a high level • Very large design space • Many degrees of freedom • Cannot simulate all designs at a low level • Energy-efficient designs • Analyze efficiency early in design cycle • Analyze effect of algorithm changes • Consider energy efficiency vs. area and latency • Energy consumed by configurable logic blocks and routing • Look-up tables, flip flops, registers, RAM • Various length interconnects
Kernel Application Energy-Efficient Design Design Methodology: Overview • Use domain-specific modeling • Explore the design space at a high level • Verify chosen designs at a low level • Select a set of designs 1. Domain Selection 2. Domain-Specific Modeling 3. Tradeoff Analysis and Manual Design Space Exploration 4. Low-Level Simulation of Candidate Designs
Architecture FPGA Domain • A family of architectures and algorithms for a given kernel application • E.g. matrix multiplication on a linear array • Fixes architecture of FPGA • FPGA too fine-grained to model at high-level • No fixed structure comparable to that of a general purpose processor • Difficult to model at a high level • Domain imposes high-levelstructure • Facilitates high-levelmodeling and highlevel performance analysis
1. Domain Selection • Choose domains by analyzing algorithms and architectures for a given kernel • Tradeoffs in Energy, Area, Latency Kernel Various Architecture Families Domain 1 Domain 2 Domain n Domain Specific Modeling Domain Specific Modeling Domain Specific Modeling . . . System-wide Energy Function System-wide Energy Function System-wide Energy Function Design Space Exploration, Optimizations Design Space Exploration, Optimizations Design Space Exploration, Optimizations
2. Domain-Specific Modeling (1) • High-level model • Model parameters are specific to the domain • Identify only those parameters that make a significant impact on energy consumption • Others need not be studied • Design is abstracted to allow easier (but coarse) tradeoff analysis and design space exploration • Benefit: Rapid evaluation of architectures and algorithms without low-level simulation • Identify candidate designs that meet requirements Domain-Specific Model (parameterized) Domain (fixed architecture) FPGA (flexible architecture)
Domain-Specific Modeling (2) Domain Components RModules Interconnects Component specific parameters (n, pe, f, sa) Function Estimation Component specific power function Component power state matrices System-wide energy function Specific design in the domain System-wide energy
Architecture, parameters with ranges of a component VHDL code for sample designs MILAN Model Interpreters Low-level Simulators (XPower, ModelSim,…) Component specific power function Power function builder (curve fitting …) Power estimates Estimation of Power Functions (3) • Using sample implementations • VHDL coding, simulation, measuring power using XPower • Estimation method • Generate random input vectors for estimation • Repeat experiments for statistical significance
Xilinx XST Synthesis Waveforms Component VHDL VHDL File Netlist Xilinx Place&Route ModelSim .ncdVHDL .ncd file .vcd file XPower Power Low-Level Simulation of Components (4) • Accurate power estimates for RModules and Interconnects • Randomly generated test input waveforms • Switching activity is a consideration • Results can be reused
3. Tradeoff Analysis and Manual Design Space Exploration • Vary model parameters to see the effect on performance. • Analyze tradeoffs • Weed out designs that are not promising
4. Low Level Simulation of Candidate Designs • Verify high-level estimation of energy and area for a design • Select the best design within the range of the estimation error among candidate designs • Similar to low-level simulation of components Xilinx XST Synthesis Candidate Designs Waveforms VHDL VHDL File Netlist Xilinx Place&Route Modelsim .ncdVHDL .ncd file .vcd file XPower Energy
Example Problem: Matrix Multiplication • Multiply two n n matrices as efficiently, in terms of energy, as possible • No hard area or latency constraints • Area and latency considered, but no specific constraints • Why matrix multiplication? • Fundamental to many applications in DSP • LU Decomposition • CFAR detection requires matrix-vector multiplication
[xilinx.com] Optimized Design from Xilinx • Provides baseline for comparison • 3 3 block matrix multiplication • Low area
FPGA PE Matrix Entries Cache MAC Design 1: Uniprocessor Architecture • Same area as Xilinx design • Block matrix multiplication • Single processing element (PE) • Model Parameters: cache size, precision, power states
Design 2: Linear Array Architecture • Low latency • Array of processing elements (PEs) • Model parameters: number of PEs, precision, power states
Experimental Procedure for Low-Level Simulations • Code designs in VHDL • Synthesize • Place and route • Simulate with test input waveforms • Measure power dissipation with XPower
Linear Array Difference Xilinx Low-Level Simulation for Statistical Analysis of Energy Dissipation • Dependency of energy on input data switching activity • Simulation for statistical significance • 50 randomly generated sets of input matrices • Comparison with Xilinx design for 3x3 matrix multiplication • Confidence intervals give range around experimental value in which, with given confidence level, true value lies • With 95% confidence, our design consumes 32% less energy compared to the Xilinx design
Xilinx ISE4.1i and XPower are used to measure the system-wide energy Xilinx Virtex-II XC2V1500 device is used Accuracy of the High-Level Model
MILAN Objectives MILAN is a model-based, extensiblesimulation framework It provides a unified environment capable of: • modeling a large class of embedded systems and applications • driving design space exploration tools for rapid evaluation ofa large design space • seamlessly integrating different widely-used simulators into a single framework for hierarchical simulation • enabling rapid evaluation of different performance metrics such as energy, latency, and throughput
Design Space Application Model Resource Model Generic Modeling Environment (GME 2000) Design Space Exploration (analytical technique) Constraints Offline Estimates High-level Perf. Estimator Identify a set of designs Instruction Level Simulator Cycle Accurate Simulator RT-level Simulator Final Design Accuracy Level of abstraction Hierarchical Simulation Design Flow Using MILAN Application (Task Graph) Hardware Resources
Concluding Remarks • Design methodology based on domain-specific modeling • High-level energy estimation • High-level tradeoff analysis • Matrix multiplication example • Improved energy-efficiency compared to Xilinx design (baseline) • Further References • “Energy-Efficient Matrix Multiplication on FPGAs” (FPL 2002) • “Energy Efficiency of FPGAs and Programmable Processors for Matrix Multiplication” (Manuscript)