180 likes | 333 Views
Quadratic Programming Solver for Image Deblurring Engine. Rahul Rithe, Michael Price Massachusetts Institute of Technology. Image Deblurring. Blur Kernel. For image deblurring , the solution is constrained to be non-negative l = 0, u = +∞. Algorithm. Cauchy Point Computation:
E N D
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology
Image Deblurring Blur Kernel • For image deblurring, the solution is constrained to be non-negative • l = 0, u = +∞
Algorithm Cauchy Point Computation: First local minima along the gradient projected on to the search space Gradient (Ax – b)
Optimizations • Dimension Reduction • Ignore the dimensions that have active constraints by holding their solution to zero till the next outer iteration • If all but 100 constraints are active: 100×100 matrix/vector operations instead of 1000×1000 Gradient (Ax – b)
Optimizations • Incremental Update • Incrementally update matrix/vector product in CP • Incrementally update gradient throughout both CP and CG steps, based on incremental changes to x • At the end of each CG refinement, recalculate cost using updated gradients • Avoids explicit computation of Ax product every outer iteration Gradient (Ax – b)
Optimizations • Performance Improvement • N outer iterations with M1 breakpoints checked for CP and M2 CG iterations per outer iteration • Direct implementation: N(3+M1+M2) matrix/vector multiplications • Optimized implementation:1+N(2+M2) matrix/vector multiplications Gradient (Ax – b) Optimized implementation typically achieves ~ 50% performance improvement
Architecture • A, b, x stored in DRAM • On-chip SRAMs used for temporary variables • Single-precision floating point arithmetic • Iterative execution of CP and CG • Use non-concurrency of CP and CG to share SRAMs • Control logic determines resource access • Memory controller connects the design to external DDR2 memory
Matrix Multiplier • Multiplication in chunks of m: • m elements of A are fetched per clock cycle from DRAM • One element of x, b can be accessed per clock cycle from SRAM
Matrix Multiplier • Active Columns • Check if any columns in a group of m columns are active • Skip over the group if no active columns • Active Rows • Check if any rows in a group of m rows are active • Skip over the group if no active rows
Sort • Cauchy Point Computation requires sorting an array of breakpoints • Sort implemented using merge sort
Main Modules • The control logic in both CP and CG modules are FSMs that sequence the external operators • Each state corresponds to a discrete step of the algorithm • Each step evaluates as many operations as possible concurrently Conjugate Gradient Architecture
FPGA Implementation • Vitrex-5 LX110T • QP Solver design integrated with DDR2 memory using a Request/Response interface • Integrated with Sce-Mi to communicate between a processor and the FPGA • Verified in simulation • Performance after synthesis: 51.3 MHz Resource utilization during placement
FPGA Implementation • Kintex-7 K325T • QP Solver design integrated with DDR3 memory using a Request/Response interface • Integrated with USB interface to communicate between a processor and the FPGA • Performance after synthesis: 67.2 MHz
FPGA Implementation • Kintex-7 K325T • QP Solver design integrated with DDR3 memory using a Request/Response interface • Integrated with USB interface to communicate between a processor and the FPGA • Performance after synthesis: 67.2 MHz Resource utilization after placement Resource utilization after synthesis
Results Synthetic problem of size 256 Real problem of size 361 from image deblurring
Results FPGA implementation is faster for larger problem sizes
Conclusions • QP Solver module designed and implemented on Kintex-7 FPGA • Optimized the implementation to reduce matrix/vector multiplications • Maximized concurrent execution of processing steps • FPGA implementation verified to be functional for problem sizes ranging from 16 to 361 Acknowledgements PriyankaRaina Richard Uhler, Myron King, Prof. Arvind