160 likes | 287 Views
A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays. Matthew French mfrench@isi.edu University of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200 Arlington, VA 22203. Xilinx FPGA Power Trend. Figure of Merit. Xilinx Family.
E N D
A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French mfrench@isi.eduUniversity of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200Arlington, VA 22203
Xilinx FPGA Power Trend Figure of Merit Xilinx Family • Number of Logic Blocks & Maximum Operating Frequency both loosely track Moore’s Law • Voltage Reduction is Slower • Resulting Power Increase is Exponential!
Power Sensitive Applications • Need to consider power as a first-class design constraint • SRAM-based FPGA Quiescent power based on total circuit size • Dynamic Power • Toggle Rates (Data Dependant) • Components Used • Routing • Actual Quiescent and Dynamic Power not known until Circuit is Placed and Routed • For high accuracy, further simulation necessary on timing model • Tools do timing driven placement and routing • So how does one design for low power?
Virtex-II Component Power Profile • Derive micro-architecture feature capacitances from • Xilinx Power Estimation Spreadsheets • Xpower Designs • Power Monitoring Testbed • Shang, Kaviani, Bathala, “Dynamic Power Consumption in Virtex-II FPGA Family” FPGA ’02 • Only trying to establish relative capacitances • Models too imprecise to be exact • Derive Low-Power Design Strategy • Minimize Multipliers • Use Shortest Interconnect
Traditional Image Convolution Tap Mask Input Data Partial Products × = • Slide Tap Mask Over Image • Multiply each pixel • Sum all Partial Products • Resulting in new Filtered Pixel • Operations • 9 Multiplies & 9 Additions / Output Pixel Output
Straight Forward Implementation • 3x3 Kernel = 9 parallel multipliers • Multipliers are resource limited in FPGAs • Virtex E • Instance in configurable logic • XCV3200E: ~81 Multipliers Max • 9 Pixels in Parallel • Virtex-II • Embedded Multiplier Blocks • XC2V8000: 168 Multipliers • 18 Pixels in Parallel • Adder Trees Relatively Cheap • 100’s of slices • XCV32000E: 32,000 slices • XC2V8000: 46,000 slices • This also reflects Power Prioritization
1 Unique Tap Value 2 Unique Tap Values 2 Unique Tap Values 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 +1 +1 -1 -1 -1 -1 -1 -1 -1 -1 8 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 +2 -1 -1 -1 -1 0 -1 +1 0 0 0 -1 0 +1 -1 -1 -2 -1 0 0 0 -2 -1 0 0 +1 +2 +1 +1 0 0 +1 +1 3 Unique Tap Values 5 Unique Tap Values • Filter Tap Values Reused Often • Can We Exploit This? Convolution Kernel Types: Closer Look • Spatial Filtering • Blurring, Smoothing (Lowpass) • Sharpening (Highpass) • Noise Reduction • Edge Detection • Derivative Filters • Roberts • Prewitt • Sobel Edge Detection Filter Smoothing Filter Sharpening Filter Prewitt Basis Sobel Basis
1-D Symmetric FIR Filter Lessons • Telecommunication and Radar Communities • Exploit symmetric Filters • Reorder Additions Before Multiplication • 1/2 Multipliers Necessary • Can We Exploit 2-D Symmetry? • Tap Values Reprogrammable • Tap Symmetry Reprogrammable • Minimize Multipliers • Leverage Large Amount of Configurable Logic Blocks • Benefits of Increased Parallelism • Higher Throughput • More Efficient Power Utilization Over Time C(k) = C(K-(k+1))
Key Ideas • Number of Active Multipliers Varies with Tap Mask • Turn off unused Multipliers – lower power • Or, use unused Multipliers to process next pixel • Requires parallel memory accesses • Higher throughput • Finish sooner – sleep device • Lower Clock Rate • Adder Tree layers before and after multiply vary with number of Multipliers per pixel • Input Data must be able to be routed to each multiplier • Will multiplier savings outweigh extra routing, multiplexing, larger circuit quiescent power?
Adaptive Convolution Kernel Sizing • Implementing Multiple Pixel Version • How Many Multipliers to Use? • Multiple of 9 • Size that is easy to place and allow for TMR growth 18 Multipliers Per Kernel
Kernel Block Diagram Adder Tree Common Tap Mux Register Delay Bank 9 9 M Input Row 0 3 9 9 M Output 0 Input Row 1 Output 1 3 Input Row 2 Output Adder Tree 3 Data Mux Output 17 Input Row 19 9 9 3 M Dynamically Adjust Multiplier Position within Adder Tree Number of Unique Taps Tap Mask State Machine Tap Value Group Data Values with Common Taps
Implementation Comparison Quiescent Power 35% Higher
Total Energy Comparison For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power
What is hard? • Poor Tool Support for Power Design • Analyzing Power Trade offs can be complex & time consuming • Have to have fully routed and simulated designs to compare approaches • Router is optimized for throughput, not power • Finding all Chip Enables to Disable • For each of several different multiplexer settings • Secondary Power Effects • Can also use Relative Placement Macros to “help” Router • Finding where can be time consuming
Analysis • For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power • Crossover point at 7 taps is an implementation limitation of using 18 multipliers in kernel • Quiescent Power • Not much larger considering extra circuitry • 18 Adder Trees, 16 Block RAMs • Dynamic Power Consumption • Observed to vary by +50% within one circuit from one place and route to another, even using same settings • Average of 3 routes used for each circuit • For Systems Where Parallelizing Input Data Stream Is Difficult • Disabling extra Multipliers is best approach • Power savings expected to be less
Conclusions • Substantial Power Savings can be Achieved by Making Power a First-Class Design Constraint • Knowledge of Underlying Resource Capacitance a Key Foundation • Re-use Power-Critical Components • Routing Can Be Influenced to Yield Lower Power • Over-constrain timing on power sensitive nets • Use Relative Placement Macros (RPMs)