280 likes | 588 Views
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD. September 7, 2007 PaCT-2007, Pereslavl-Zalessky. Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwasaki and Yoshimasa Nakamura. Outline.
E N D
Accelerating the Singular Value Decompositionof Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwasaki and Yoshimasa Nakamura
Outline • Introduction • The CSX600 floating-point accelerator • Optimization of the rectangular SVD algorithm for the CSX600 • Performance evaluation • Conclusion
n m n n Singular value decomposition of rectangular matrices A:m n dense U:mn orthogonal V: nn orthogonal S:nn diagonal where m >> n = Applications Example • Image processing • Electronic structure calculation • Filter Diagonalization Method • Information retrieval • Latent Semantic Indexing • Statistical computations • PCA, ICA and Least Squares 10 5 5000
Floating-point accelerators • ClearSpeed CSX600 • 1+96 processor cores • 48GFLOPS (double precision) • Cell • 1+8 processor cores • 256GFLOPS (single precision) • GRAPE-DR • 512 processor cores • 512GFLOPS (single precision) • 256GFLOPS (double precision) Very high GFLOPS value due to a large number of cores Performance is limited due to relatively low memory bandwidth.
Use of the Level-3 BLAS (matrix multiplication) • Matrix multiplication C:=C+AB • The amount of data is O(1/N) of the computational work. • By using the cache memory effectively, the effect of low memory bandwidth can be mitigated. We can exploit the potential performance of the CSX600 by reorganizing the algorithm to use matrix multiplications efficiently. Amount of data: O(N 2) Computational work: O(N 3) C C A B = + For matrix-vector multiplication (y:= y + Ax), both the amount of data and computational work isO(N 2).
Objective of this study • Accelerate the SVD of rectangular matrices using the CSX600 processor. • To exploit the potential of the CSX600, we reorganize the existing algorithm so that matrix multiplications can be used efficiently. • Evaluate the performance and clarify the technical problems for further improving the performance.
Architecture and performance of the CSX600 • The CSX600 chip • One main processor • 96 floating-point processors • 64 bits • 2 flops / cycle • 128B register files • 6KB SRAM • Operates at 250MHz • Peak performance: 48GFLOPS • ClearSpeed Advance board • Two CSX600 processors • 1GB DRAM • Connected to a host PC via the PCI-X bus • Peak performance: 96GFLOPS
We use this in this study. Software environments for the CSX600 • Software Development Kit • Compiler: parallel programming with the Cn language • Debugger • Simulator • CSXL library • Basic Linear Algebra Subprograms (BLAS) for the ClearSpeed Advance board • The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory. • Sustained performance: 50GFLOPS with the DGEMM (dense matrix-matrix multiplication) • CSFFT library
n k m C += A × n k B B C m += A × Performance of the CSXL DGEMM m = k = 450 1000 n 6000 k = 450 1000 m = n 6000 Performance (MFLOPS) n n,m At least two of the three size parameters (m, n and k) must be large to obtain considerable performance.
Optimization of the rectangular SVD algorithm for the CSX600
n R m Q A n n B n m n n Algorithm for rectangular SVD A = QR QR decomposition: R = U1 B V1T Bidiagonalization: SVD of the bidiagonal matrix: B = U2 S V2T U’= U1 U2 Inverse transformation: R = U’S V T V = V1 V2 Multiplication by Q U = QU’ A = US V T
Computational work of each part When m >> n (e.g., m =100000, n =5000) Computational work 2mn2 A = QR QR decomposition: R = U1 B V1T (8/3)n3 Bidiagonalization: SVD of the bidiagonal matrix: B = U2 S V2T O(n2) O(n3) U’= U1 U2 Inverse transformation: R = U’S V T 2n3 4n3 Accounts for most of the computational work V = V1 V2 Multiplication by Q U = QU’ A = US V T 4mn2
A = QR QR decomposition: Multiplication by Q U = QU’ A = US V T Optimization of each part Parts accelerated with the CSX600 Reorganize the algorithms to use matrix multiplications Accelerate the matrix multiplication with the CSXL BLAS Parts executed on the host only R = U1 B V1T LAPACK DGEBRD Bidiagonalization: U’= U1 U2 Inverse transformation: LAPACK DORMBR R = U’S V T V = V1 V2 SVD of the bidiagonal matrix: B = U2 S V2T Integrable SVD
QR decomposition of A Upper triangularization by Householder transformations Hn・・・ H2 H1 A = A(n) A = H1 H2・・・ Hn A(n) = QR ・・・ A A(1) A(2) A(n) = R where, H1 A = ( I – t1y1y1T ) A = A(1) level-2 BLAS CSXL cannot be used
Aggregating the Householder transformations Blocking technique Hn・・・ H2 H1 = ( I – tn yn ynT ) ・・・ ( I – t2y2y2T )( I – t1y1y1T ) = I – Yn Tn YnT where, Yn = [ y1 | y2 | ・・・ | yn] (mn matrix) Tn: n n lower triangular matrix I – I – I – ・・・ = × × × × × × Multiple Householder transformations can be aggregated and carried out by matrix multiplications. Acceleration with the CSXL.
L: blocking size. 1 Ln/2. Blocking strategies for QR decomposition Comparison of three blocking strategies • Block QR requires the smallest amount of work, but some of the work is done with the level-2 BLAS. The size of matrix multiplication is rather small. • Recursive QR requires the largest amount of work, but all in the level-3 BLAS. The size of matrix multiplication is large.
Numerical experiments Computational environments • Xeon 3.2GHz, 8GB memory • Intel Fortran -O3 + Intel Math Kernel Library • ClearSpeed Advance board Problem SVD of an m by n matrix whose elements are random numbers in [-0.5, 0.5] 10000 m 100000,1000 n 4000 Experiments Performance comparison of the three QR decomposition algorithms on the ClearSpeed board Speedup effect of the whole SVD with the ClearSpeed board Evaluation of accuracy
Performance of three QR decomposition algorithms m =100000 n =4000 Computational time (sec) Block QR Recursive QR Extended recursive QR
Performance of three QR decomposition algorithms m =10000 n =4000 Computational time (sec) Block QR Recursive QR Extended recursive QR
x 1.8 x 3.1 x 4 Speedup of the whole SVD with the CSX600 m = 10000 n=1000 (m:n = 10:1) m = 100000 n=4000 (m:n = 25:1) Computational time (sec) x 1.3 x 1.2 LAPACK DGESDD Our code LAPACK DGESDD Our code
Speedup effect as a function of matrix size Our code with recursive QR Speedup m n Speedup = Time with the PC only / Time with the PC + CSX600
m : n : 1000 2000 3000 Evaluation of accuracy Orthogonality of left singular vector Residual ||UTU – I ||F ||US VT – A||F m : n : 1000 2000 3000
Summary and future work Summary We showed how to accelerate the rectangular SVD algorithm with the CSX600 floating-point accelerator. By modifying the algorithm to use large matrix multiplications, we obtained up to 4 times speedup over LAPACK code on the 3.2GHz Xeon. Future work Further improve the performance by optimizing the bidiagonalization and inverse transformation parts. Performance evaluation on other accelerators such as the GRAPE-DR. Application to other matrix computations