A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays Matthew French mfrench@isi.eduUniversity of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200Arlington, VA 22203

Xilinx FPGA Power Trend Figure of Merit Xilinx Family • Number of Logic Blocks & Maximum Operating Frequency both loosely track Moore’s Law • Voltage Reduction is Slower • Resulting Power Increase is Exponential!

Power Sensitive Applications • Need to consider power as a first-class design constraint • SRAM-based FPGA Quiescent power based on total circuit size • Dynamic Power • Toggle Rates (Data Dependant) • Components Used • Routing • Actual Quiescent and Dynamic Power not known until Circuit is Placed and Routed • For high accuracy, further simulation necessary on timing model • Tools do timing driven placement and routing • So how does one design for low power?

Virtex-II Component Power Profile • Derive micro-architecture feature capacitances from • Xilinx Power Estimation Spreadsheets • Xpower Designs • Power Monitoring Testbed • Shang, Kaviani, Bathala, “Dynamic Power Consumption in Virtex-II FPGA Family” FPGA ’02 • Only trying to establish relative capacitances • Models too imprecise to be exact • Derive Low-Power Design Strategy • Minimize Multipliers • Use Shortest Interconnect

Traditional Image Convolution Tap Mask Input Data Partial Products × = • Slide Tap Mask Over Image • Multiply each pixel • Sum all Partial Products • Resulting in new Filtered Pixel • Operations • 9 Multiplies & 9 Additions / Output Pixel Output

Straight Forward Implementation • 3x3 Kernel = 9 parallel multipliers • Multipliers are resource limited in FPGAs • Virtex E • Instance in configurable logic • XCV3200E: ~81 Multipliers Max • 9 Pixels in Parallel • Virtex-II • Embedded Multiplier Blocks • XC2V8000: 168 Multipliers • 18 Pixels in Parallel • Adder Trees Relatively Cheap • 100’s of slices • XCV32000E: 32,000 slices • XC2V8000: 46,000 slices • This also reflects Power Prioritization

1 Unique Tap Value 2 Unique Tap Values 2 Unique Tap Values 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 +1 +1 -1 -1 -1 -1 -1 -1 -1 -1 8 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 +2 -1 -1 -1 -1 0 -1 +1 0 0 0 -1 0 +1 -1 -1 -2 -1 0 0 0 -2 -1 0 0 +1 +2 +1 +1 0 0 +1 +1 3 Unique Tap Values 5 Unique Tap Values • Filter Tap Values Reused Often • Can We Exploit This? Convolution Kernel Types: Closer Look • Spatial Filtering • Blurring, Smoothing (Lowpass) • Sharpening (Highpass) • Noise Reduction • Edge Detection • Derivative Filters • Roberts • Prewitt • Sobel Edge Detection Filter Smoothing Filter Sharpening Filter Prewitt Basis Sobel Basis

1-D Symmetric FIR Filter Lessons • Telecommunication and Radar Communities • Exploit symmetric Filters • Reorder Additions Before Multiplication • 1/2 Multipliers Necessary • Can We Exploit 2-D Symmetry? • Tap Values Reprogrammable • Tap Symmetry Reprogrammable • Minimize Multipliers • Leverage Large Amount of Configurable Logic Blocks • Benefits of Increased Parallelism • Higher Throughput • More Efficient Power Utilization Over Time C(k) = C(K-(k+1))

Key Ideas • Number of Active Multipliers Varies with Tap Mask • Turn off unused Multipliers – lower power • Or, use unused Multipliers to process next pixel • Requires parallel memory accesses • Higher throughput • Finish sooner – sleep device • Lower Clock Rate • Adder Tree layers before and after multiply vary with number of Multipliers per pixel • Input Data must be able to be routed to each multiplier • Will multiplier savings outweigh extra routing, multiplexing, larger circuit quiescent power?

Adaptive Convolution Kernel Sizing • Implementing Multiple Pixel Version • How Many Multipliers to Use? • Multiple of 9 • Size that is easy to place and allow for TMR growth 18 Multipliers Per Kernel

Kernel Block Diagram Adder Tree Common Tap Mux Register Delay Bank 9 9 M Input Row 0 3 9 9 M Output 0 Input Row 1 Output 1 3 Input Row 2 Output Adder Tree 3 Data Mux Output 17 Input Row 19 9 9 3 M Dynamically Adjust Multiplier Position within Adder Tree Number of Unique Taps Tap Mask State Machine Tap Value Group Data Values with Common Taps

Implementation Comparison Quiescent Power 35% Higher

Total Energy Comparison For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power

What is hard? • Poor Tool Support for Power Design • Analyzing Power Trade offs can be complex & time consuming • Have to have fully routed and simulated designs to compare approaches • Router is optimized for throughput, not power • Finding all Chip Enables to Disable • For each of several different multiplexer settings • Secondary Power Effects • Can also use Relative Placement Macros to “help” Router • Finding where can be time consuming

Analysis • For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power • Crossover point at 7 taps is an implementation limitation of using 18 multipliers in kernel • Quiescent Power • Not much larger considering extra circuitry • 18 Adder Trees, 16 Block RAMs • Dynamic Power Consumption • Observed to vary by +50% within one circuit from one place and route to another, even using same settings • Average of 3 routes used for each circuit • For Systems Where Parallelizing Input Data Stream Is Difficult • Disabling extra Multipliers is best approach • Power savings expected to be less

Conclusions • Substantial Power Savings can be Achieved by Making Power a First-Class Design Constraint • Knowledge of Underlying Resource Capacitance a Key Foundation • Re-use Power-Critical Components • Routing Can Be Influenced to Yield Lower Power • Over-constrain timing on power sensitive nets • Use Relative Placement Macros (RPMs)

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

Presentation Transcript

Field Programmable Graphical Arrays

Introduction to Field Programmable Gate Arrays (FPGAs)

Small and Fast Finite Field Multipliers for Field Programmable Gate Arrays (FPGAs)

Parallel Computing Using FPGA ( Field Programmable Gate Arrays )

Field Programmable Gate Array

Introduction to Field Programmable Gate Arrays

Fault Tolerance in Field Programmable Gate Arrays

Digital Signal Processing and Field Programmable Gate Arrays

Radiation Testing Methodology for Field Programmable Gate Arrays

USING FIELD PROGRAMMABLE GATE ARRAYS IN A BEOWULF CLUSTER

Introduction to Field Programmable Gate Arrays

Field Programmable Gate Array

Introduction to Field Programmable Gate Arrays

Field Programmable Gate Array

SiGe HBT BiCMOS Field Programmable Gate Arrays for Fast Reconfigurable Computing

Field Programmable Gate Arrays (FPGAs) An Enabling Technology

Field Programmable Gate Arrays

Field Programmable Graphical Arrays

Single Event Upset Detection in Field Programmable Gate Arrays

Introduction to Field Programmable Gate Arrays (FPGAs)

Parallel Computing Using FPGA ( Field Programmable Gate Arrays )

Introduction to Field Programmable Gate Arrays