1 / 16

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays. Matthew French mfrench@isi.edu University of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200 Arlington, VA 22203. Xilinx FPGA Power Trend. Figure of Merit. Xilinx Family.

Download Presentation

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French mfrench@isi.eduUniversity of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200Arlington, VA 22203

  2. Xilinx FPGA Power Trend Figure of Merit Xilinx Family • Number of Logic Blocks & Maximum Operating Frequency both loosely track Moore’s Law • Voltage Reduction is Slower • Resulting Power Increase is Exponential!

  3. Power Sensitive Applications • Need to consider power as a first-class design constraint • SRAM-based FPGA Quiescent power based on total circuit size • Dynamic Power • Toggle Rates (Data Dependant) • Components Used • Routing • Actual Quiescent and Dynamic Power not known until Circuit is Placed and Routed • For high accuracy, further simulation necessary on timing model • Tools do timing driven placement and routing • So how does one design for low power?

  4. Virtex-II Component Power Profile • Derive micro-architecture feature capacitances from • Xilinx Power Estimation Spreadsheets • Xpower Designs • Power Monitoring Testbed • Shang, Kaviani, Bathala, “Dynamic Power Consumption in Virtex-II FPGA Family” FPGA ’02 • Only trying to establish relative capacitances • Models too imprecise to be exact • Derive Low-Power Design Strategy • Minimize Multipliers • Use Shortest Interconnect

  5. Traditional Image Convolution Tap Mask Input Data Partial Products × = • Slide Tap Mask Over Image • Multiply each pixel • Sum all Partial Products • Resulting in new Filtered Pixel • Operations • 9 Multiplies & 9 Additions / Output Pixel Output

  6. Straight Forward Implementation • 3x3 Kernel = 9 parallel multipliers • Multipliers are resource limited in FPGAs • Virtex E • Instance in configurable logic • XCV3200E: ~81 Multipliers Max • 9 Pixels in Parallel • Virtex-II • Embedded Multiplier Blocks • XC2V8000: 168 Multipliers • 18 Pixels in Parallel • Adder Trees Relatively Cheap • 100’s of slices • XCV32000E: 32,000 slices • XC2V8000: 46,000 slices • This also reflects Power Prioritization

  7. 1 Unique Tap Value 2 Unique Tap Values 2 Unique Tap Values 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 +1 +1 -1 -1 -1 -1 -1 -1 -1 -1 8 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 +2 -1 -1 -1 -1 0 -1 +1 0 0 0 -1 0 +1 -1 -1 -2 -1 0 0 0 -2 -1 0 0 +1 +2 +1 +1 0 0 +1 +1 3 Unique Tap Values 5 Unique Tap Values • Filter Tap Values Reused Often • Can We Exploit This? Convolution Kernel Types: Closer Look • Spatial Filtering • Blurring, Smoothing (Lowpass) • Sharpening (Highpass) • Noise Reduction • Edge Detection • Derivative Filters • Roberts • Prewitt • Sobel Edge Detection Filter Smoothing Filter Sharpening Filter Prewitt Basis Sobel Basis

  8. 1-D Symmetric FIR Filter Lessons • Telecommunication and Radar Communities • Exploit symmetric Filters • Reorder Additions Before Multiplication • 1/2 Multipliers Necessary • Can We Exploit 2-D Symmetry? • Tap Values Reprogrammable • Tap Symmetry Reprogrammable • Minimize Multipliers • Leverage Large Amount of Configurable Logic Blocks • Benefits of Increased Parallelism • Higher Throughput • More Efficient Power Utilization Over Time C(k) = C(K-(k+1))

  9. Key Ideas • Number of Active Multipliers Varies with Tap Mask • Turn off unused Multipliers – lower power • Or, use unused Multipliers to process next pixel • Requires parallel memory accesses • Higher throughput • Finish sooner – sleep device • Lower Clock Rate • Adder Tree layers before and after multiply vary with number of Multipliers per pixel • Input Data must be able to be routed to each multiplier • Will multiplier savings outweigh extra routing, multiplexing, larger circuit quiescent power?

  10. Adaptive Convolution Kernel Sizing • Implementing Multiple Pixel Version • How Many Multipliers to Use? • Multiple of 9 • Size that is easy to place and allow for TMR growth 18 Multipliers Per Kernel

  11. Kernel Block Diagram Adder Tree Common Tap Mux Register Delay Bank 9 9 M Input Row 0 3 9 9 M Output 0 Input Row 1 Output 1 3 Input Row 2 Output Adder Tree 3 Data Mux Output 17 Input Row 19 9 9 3 M Dynamically Adjust Multiplier Position within Adder Tree Number of Unique Taps Tap Mask State Machine Tap Value Group Data Values with Common Taps

  12. Implementation Comparison Quiescent Power 35% Higher

  13. Total Energy Comparison For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power

  14. What is hard? • Poor Tool Support for Power Design • Analyzing Power Trade offs can be complex & time consuming • Have to have fully routed and simulated designs to compare approaches • Router is optimized for throughput, not power • Finding all Chip Enables to Disable • For each of several different multiplexer settings • Secondary Power Effects • Can also use Relative Placement Macros to “help” Router • Finding where can be time consuming

  15. Analysis • For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power • Crossover point at 7 taps is an implementation limitation of using 18 multipliers in kernel • Quiescent Power • Not much larger considering extra circuitry • 18 Adder Trees, 16 Block RAMs • Dynamic Power Consumption • Observed to vary by +50% within one circuit from one place and route to another, even using same settings • Average of 3 routes used for each circuit • For Systems Where Parallelizing Input Data Stream Is Difficult • Disabling extra Multipliers is best approach • Power savings expected to be less

  16. Conclusions • Substantial Power Savings can be Achieved by Making Power a First-Class Design Constraint • Knowledge of Underlying Resource Capacitance a Key Foundation • Re-use Power-Critical Components • Routing Can Be Influenced to Yield Lower Power • Over-constrain timing on power sensitive nets • Use Relative Placement Macros (RPMs)

More Related