1 / 18

Quadratic Programming Solver for Image Deblurring Engine

Quadratic Programming Solver for Image Deblurring Engine. Rahul Rithe, Michael Price Massachusetts Institute of Technology. Image Deblurring. Blur Kernel. For image deblurring , the solution is constrained to be non-negative l = 0, u = +∞. Algorithm. Cauchy Point Computation:

pink
Download Presentation

Quadratic Programming Solver for Image Deblurring Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology

  2. Image Deblurring Blur Kernel • For image deblurring, the solution is constrained to be non-negative • l = 0, u = +∞

  3. Algorithm Cauchy Point Computation: First local minima along the gradient projected on to the search space Gradient (Ax – b)

  4. Optimizations • Dimension Reduction • Ignore the dimensions that have active constraints by holding their solution to zero till the next outer iteration • If all but 100 constraints are active: 100×100 matrix/vector operations instead of 1000×1000 Gradient (Ax – b)

  5. Optimizations • Incremental Update • Incrementally update matrix/vector product in CP • Incrementally update gradient throughout both CP and CG steps, based on incremental changes to x • At the end of each CG refinement, recalculate cost using updated gradients • Avoids explicit computation of Ax product every outer iteration Gradient (Ax – b)

  6. Optimizations • Performance Improvement • N outer iterations with M1 breakpoints checked for CP and M2 CG iterations per outer iteration • Direct implementation: N(3+M1+M2) matrix/vector multiplications • Optimized implementation:1+N(2+M2) matrix/vector multiplications Gradient (Ax – b) Optimized implementation typically achieves ~ 50% performance improvement

  7. Architecture • A, b, x stored in DRAM • On-chip SRAMs used for temporary variables • Single-precision floating point arithmetic • Iterative execution of CP and CG • Use non-concurrency of CP and CG to share SRAMs • Control logic determines resource access • Memory controller connects the design to external DDR2 memory

  8. Matrix Multiplier • Multiplication in chunks of m: • m elements of A are fetched per clock cycle from DRAM • One element of x, b can be accessed per clock cycle from SRAM

  9. Matrix Multiplier • Active Columns • Check if any columns in a group of m columns are active • Skip over the group if no active columns • Active Rows • Check if any rows in a group of m rows are active • Skip over the group if no active rows

  10. Matrix Multiplier

  11. Sort • Cauchy Point Computation requires sorting an array of breakpoints • Sort implemented using merge sort

  12. Main Modules • The control logic in both CP and CG modules are FSMs that sequence the external operators • Each state corresponds to a discrete step of the algorithm • Each step evaluates as many operations as possible concurrently Conjugate Gradient Architecture

  13. FPGA Implementation • Vitrex-5 LX110T • QP Solver design integrated with DDR2 memory using a Request/Response interface • Integrated with Sce-Mi to communicate between a processor and the FPGA • Verified in simulation • Performance after synthesis: 51.3 MHz Resource utilization during placement

  14. FPGA Implementation • Kintex-7 K325T • QP Solver design integrated with DDR3 memory using a Request/Response interface • Integrated with USB interface to communicate between a processor and the FPGA • Performance after synthesis: 67.2 MHz

  15. FPGA Implementation • Kintex-7 K325T • QP Solver design integrated with DDR3 memory using a Request/Response interface • Integrated with USB interface to communicate between a processor and the FPGA • Performance after synthesis: 67.2 MHz Resource utilization after placement Resource utilization after synthesis

  16. Results Synthetic problem of size 256 Real problem of size 361 from image deblurring

  17. Results FPGA implementation is faster for larger problem sizes

  18. Conclusions • QP Solver module designed and implemented on Kintex-7 FPGA • Optimized the implementation to reduce matrix/vector multiplications • Maximized concurrent execution of processing steps • FPGA implementation verified to be functional for problem sizes ranging from 16 to 361 Acknowledgements PriyankaRaina Richard Uhler, Myron King, Prof. Arvind

More Related