250 likes | 485 Views
DSP Algorithms on FPGA Part II Digital image Processing. Content. Overview image processing and FPGA Algorithm to FPGA Mapping Flow Nested Loop Algorithms and MODG Example: Motion Estimation Conclusion and Future Trends. Video signal in different formats.
E N D
Content • Overview image processing and FPGA • Algorithm to FPGA Mapping Flow • Nested Loop Algorithms and MODG • Example: Motion Estimation • Conclusion and Future Trends
Video signal in different formats • PAL 720*576(pixels) 25 (f/s) 10.4 (Mp/s) • NTSC 720*480 29.97 10.4 • HDTV 1920*1080 30.0 62.2 Common delivery form: • Analog (cable) • USB • Firewire
Image Processing Character • Need available maximize logic by supporting N-D multiple configurable devices For Example : Image *
Challenges How to……??? • Appropriate partitioning of algorithms between hardware and software • Exploiting spatial and temporal parallelism • Integration the configurable computer into the software framework • Selecting a suitable configuration strategy How shall we deal with these challenges?
Why SRAM-Based FPGAs? (Pros) • Higher logic/storage capacity * Fast carry chain for adders /subtractors * Built-in XOR gates/LUT * Array of bit-parallel multipliers * Fast and local storage: array of SRAM blocks * Interconnect supports: three-state buffers/LUT • Equivalent to fine-grained reconfigurable hardware * Finer-gained pipeling can help preserve the performance at low power supply voltage • More mature CMOS manufacturing technology
The Matrix Multiplication MODG A number of different execution orders can be carried out to achieve the same algorithm.
Nested Do Loop Algorithms and Inter-Iteration Dependence Graph Do i=1 to M Do j=1 to N c[i,j]=0; Do k=1 to K c[i,j]= c[i,j]+a[i,k]*b[k,j]; EndDo k EndDo j EndDo I Dependence vectors • da = (i,j,k)t= (0,1,0)t • db = (i,j,k)t= (1,0,0)t • dc = (i,j,k)t= (0,0,1)t • Index Space J3 = {(i,j,k)t: 1£ i,j,k£ 3}(M=N=K=3) • Inter-Iteration Data Dependence graph (DG)
s s s P 3-D DG (Dependence Graph) 2-D Processor Array Systolic Mapping (space-time) of Matrix Multiplication
a11 a21 a31 a12 a22 a32 a13 a23 a33 b13 b13 b13 b23 b23 b23 b33 b33 b33 C13 C23 C33 C13 C23 C33 C13 C23 C33 a11 a21 a31 a12 a22 a32 a13 a23 a33 b12 b12 b12 b22 b22 b22 b32 b32 b32 C12 C22 C32 C12 C22 C32 C12 C22 C32 a11 a21 a31 a12 a22 a32 a13 a23 a33 b11 b11 b11 b21 b21 b21 b31 b31 b31 C11 C21 C31 C11 C21 C31 C11 C21 C31 Systolic Mapping of Matrix Multiplication, cont. 0 0 0
Why Space-Time Mapping is suitable for FPGAs? • It can bridge the nested Do loop signal/image processing algorithms to the processorarray implementation. • The space-time array matches the modular and regular FPGA structure. • The localized/pipelined interprocessor links can overcome the long programmable interconnect delay. • The size of configuration storage can be significantly reduced because of the almost identical processing elements and interconnect structure.
Problems with Existing Design Methodologies/Tools • The dependence graphs of many other algorithms are not uniform and must be predetermined by human designers. • Existing methodologies • cannot handle these complex algorithms use unrealistic cost functions (metrics) • No built-in features of FPGAs have been incorporated. • Longer interconnect delay in deep submicron CMOS technology • Much lower hardware utilization due to programmable interconnect delay in FPGAs There is another problem--speed
What is Intra-PE Pipelining? • Interconnect delay of FPGAs results in even longer clock period. • To enhance the overall throughput, Intra-Iteration parallelism must be exploited. • A simple vector dot product array • It can be observed that the utilization of each operator is increased. • Of course, the control mechanism is more complex. Tech done example
Examples of Nested Do Loop Algorithms • Motion estimation • One of the most time consuming operations (tasks) in digital video compression • Stereo matching • used to build disparity map for 3D robot/computer navigation • Matrix/Vector Multiplication • FFT, DCT, 2D/3D graphic etc. • 2D Linear Transform/Operations • 2D FFT, 2D DCT, etc.
Illustration of Full Search Block Matching Motion Estimation (6 level Nested do loop) Motion vector=(m,n)
MAD(m,n)= MAD(m,n)+|x(hN+i,vN+j)-y(hN+i+m-p,vN+j+n-p)| Xilinx Core Generator System Critical path delay = 25 ns. based on Xilinx Virtex data 1,500-2,000 equivalent gate count Critical path (blue line) can be shortened further by the Intra-PE pipelining Exp: A Simpler PE Microarchitecture
The MODG representation for nested Do loop algorithms The actual execution is not constrained to any predetermined order. keeps track of every variable instance so that there is no redundant memory access to save I/O, bandwidth and power consumption. can be automated using memory . Without the MODG, the motion estimation and many other nested DO loop algorithms can be written in many of different DGs, human must be involved to formulate a DG, the built-in ROM/RAM of FPGA may not be exploited, and Significance of the Contributions
Significance of the Contributions, cont. • Space-Time mapping for the MODG can be applied to • any SRAM-based FPGA Architecture Constraints and Practical Cost functions • any coarse-grained architecture • Intra-PE pipelining • enhances/preserves the throughput rate at low power mode.
Conclusion • Users demand more communication/multimedia processing capabilities on the resource-limited Internet appliances. • Reconfigurable SOC is the ultimate solution to design the challenging low-power/high performance platform. • Its success lies on the embedded high-density FPGA core as a reconfigurable (programmable) accelerating hardware. • As technology (supply voltage) scales down, logic (transistor) is virtually free while the interconnect becomes the bottleneck and power consuming. • Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a viable solution. • It can compromise the three main issues: design time, power consumption, and performance.
Future Trends • Memory (storage) organization should be investigated due to multiple reads per-clock cycle in order to sustain such high throughput. • The control mechanism of the entire array is one of the aspects that will determine its success. • A given MODG may need to be partitioned of so that the resulting array fits the on-chip reconfigurable FPGA core.