240 likes | 500 Views
Efficient Independent Component Analysis on a GPU. Rui Ramalho, Pedro Tomás, Leonel Sousa. 1. Outline. Motivation Independent Component Analysis FastICA Algorithm Experimental Results Conclusions. 2. Blind Source Separation.
E N D
Frontiers of GPU Computing 2010 Efficient Independent Component Analysis on a GPU Rui Ramalho, Pedro Tomás, Leonel Sousa 1
Frontiers of GPU Computing 2010 Outline • Motivation • Independent Component Analysis • FastICA Algorithm • Experimental Results • Conclusions 2
Frontiers of GPU Computing 2010 Blind Source Separation • Blind Source Separation (BSS) is a signal processing technique that separates a set of signals (sources) from a set of mixed signals. • Little is known about the original signals or the mixing process, only that the original signals are uncorrelated. mix sep 3
Frontiers of GPU Computing 2010 Blind Source Separation: Cocktail Party • A classical example of blind source separation is the cocktail party problem. • A number of people are talking simultaneously in a crowded room (at cocktail party). • Despite all the noise and cross talking, a human brain has little difficulty following a conversation. • Machines have to rely on blind source separation. 4
Frontiers of GPU Computing 2010 Blind Source Separation: Applications • Blind Source Separation has also been used in several other domains: • EEG/MEG measurements (each sensor picks up a mixture of brain electrical activity and BSS can be used to separate and identify them). • Denoising images (by treating the noise as an independent source it is possible to separate it from the image’s original components). • Financial analysis (BSS can be used to uncover hidden factors in financial data).
Frontiers of GPU Computing 2010 Independent Component Analysis • Independent Component Analysis (ICA) is a special case of Blind Source Separation. • The mixed signal’s sources are assumed to be statistically independent (BSS only assumes the sources are statistically uncorrelated).
Frontiers of GPU Computing 2010 Independent Component Analysis: ICA Model • Under the ICA model, the observed variables are assumed to be a linear combination of several independent sources/signals. • The objective of ICA is to find the matrix W that inverts the mixing operation performed by the matrix A, without knowledge of A or s.
Frontiers of GPU Computing 2010 Independent Component Analysis: Measuring Statistical Independence • One of the ways of measuring statistical independence is through negentropy: • H(y) is the differential information entropy of y: • In practice J(y) needs to be estimated. The estimator used by FastICA is: • G is a nonquadratic nonlinear function • is a Gaussian variable of zero mean and unit variance
Frontiers of GPU Computing 2010 FastICA Algorithm • The procedure for computing the independent components can be divided in 3 stages: • pre-processing Allows a number of simplifications on the FastICA algorithm. • weight vector computation The FastICA algorithm itself. • decorrelation Prevents the algorithm from converging to the same solutions.
Frontiers of GPU Computing 2010 FastICA Algorithm:Preprocessing & Weight Vector Computation • Preprocessing includes general tasks such as centering, whitening or filtering the data. • The computation of each of the weight vectors is done by: • g is the derivative of the nonlinear contrast function J • This algorithm can be modified to compute all the ICs simultaneously (a symmetric approach).
Frontiers of GPU Computing 2010 FastICA Algorithm:GPU Implementation • The preprocessing stage is generally inexpensive and was implemented on the CPU. • The FastICA algorithm is composed mostly of matrix operations that can be efficiently implemented using CUBLAS. • The computation of the non-linear function g and g’ have no dependencies. • The expected value is computed using hierarchical additions, storing the intermediate results in the GPU’s shared memory.
Frontiers of GPU Computing 2010 FastICA Algorithm:Decorrelation • To keep the estimated weight vectors from converging to the same results, they need to be decorrelated: • After estimating p independent components, subtract the projections of the previous p components from the p+1 estimate: • An alternative is to apply a symmetric decorrelation after every iteration:
Frontiers of GPU Computing 2010 Decorrelation:The Tricky Bit • The computation of (WWT)1/2 is complex and can be done using the eigenvalues of (WWT). • This can be done using the already available CPU-based high performance libraries (LAPACK). • Alternatively, the eigenvalues can be computed directly on the GPU
Frontiers of GPU Computing 2010 Jacobi Eigenvalue Algorithm • The Jacobi Eigenvalue Algorithm successively uses Jacobi rotations to annihilate the off-diagonal elements of a given matrix A. • A Jacobi rotation is given by: • J is a Jacobi rotation matrix • c = cos() • s = sin()
Frontiers of GPU Computing 2010 Jacobi Eigenvalue Algorithm • Each Jacobi rotation only changes two columns and two rows of the matrix A. By carefully choosing the order of the rotations, up to N/2 rotations can be done simultaneously. • The matrix J is a very sparse matrix, making CUBLAS unsuitable for this algorithm.
Frontiers of GPU Computing 2010 Decorrelation:Iterative Algorithm • Another alternative to the eigenvalue problem is to avoid its computation altogether. Algorithm 4 converges to the decorrelation expression presented earlier.
Frontiers of GPU Computing 2010 Decorrelation:Comparison of Decorrelation Algorithms • Experimental results show that the proposed GPU-based Jacobi eigenvalue algorithm is outperformed by a CPU based LAPACK eigenvalue algorithm using multiple relatively robust representations (MRRR). • However, avoiding the explicit computation of the eigenvalues is still the fastest process.
Frontiers of GPU Computing 2010 Experimental Results:Experimental Setup • Experimental Setup • A hyperbolic tangent was chosen as a typical non-linear function g • The iterative decorrelation algorithm that avoids the explicit computation of the eigenvalues is used in the decorrelation step.
Frontiers of GPU Computing 2010 Experimental Results:Single Core CPU Vs GPU • The accelerated portion of the algorithm (loop) is spedup up to 110x, for estimating 256 ICs with 10 000 samples. • As the accelerated portion gets faster, so grows the influence of the unaccelerated part of the algorithm (the preprocessing stage). This noticeably reduces the global speedup.
Frontiers of GPU Computing 2010 Experimental Results:Single Core CPU Vs GPU • The accelerated loop component ceases to be the bottleneck. • The additional penalty of transferring data to and from the GPU is negligible.
Frontiers of GPU Computing 2010 Experimental Results:Multicore CPU Vs GPU • The parallelized GPU algorithm was also tested on a more powerful Geforce GTX 285, with 240 cores. This implementation was compared with a CPU based implementation on an Intel Core 2 Quad Q9950 (@2.83GHz) using Intel’s high performance MKL library. • It was possible to attain a speedup of around 12x
Frontiers of GPU Computing 2010 Conclusions • By using a GPU it was possible to speedup the FastICA algorithm by 55x for estimating 256 ICs with 1000 samples each, in comparison with a serial version running on a single core of a CPU. • These results can be further improved as the current bottleneck lies in the preprocessing stage, which is still done on the CPU.
technologyfrom seed Frontiers of GPU Computing 2010 23 IV Jornadas sobre Sistemas Reconfiguráveis - REC'2008 01-06-08