10 likes | 87 Views
Parallel Out-of-Core “Tall Skinny” QR James Demmel, Mark Hoemmen, et al. {demmel, mhoemmen}@eecs.berkeley.edu. EE CS. Electrical Engineering and Computer Sciences. B ERKELEY P AR L AB.
E N D
Parallel Out-of-Core “Tall Skinny” QRJames Demmel, Mark Hoemmen, et al.{demmel, mhoemmen}@eecs.berkeley.edu EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y Why out-of-core? Implementation and benchmark Applications of TSQR algorithm • Can solve huge problems without huge machine • Workstation as cluster alternative • Cluster: Job queue wait counts as “run time” • Instant turnaround for compile-run-debug cycle • Full(-speed) access to graphical development tools • Easier to get grant for < $5000 workstation, than for > $10M cluster! • Frees precious, power-hungry DRAM for other tasks • Great for embedded hardware: Cell phones, robots, … • Orthogonalization step in iterative methods for sparse systems • A bottleneck for new Krylov methods we are developing • Block column QR factorization • For QR of matrices stored in any 2-D block cyclic format • Currently a bottleneck in ScaLAPACK • Solving linear systems • More flops, but less data motion than LU • Works for sparse and dense systems • Solving least-squares optimization problems • Linear tree with multithreaded BLAS and LAPACK (Intel MKL) • Fixed # blocks at 10 • Varied # threads (only 2 cores available here, Itanium 2) • Varied # rows (1000 .. 10000) and # columns (100 .. 1000) per block • Compared with standard (in-core) multithreaded QR • Read input matrix from disk using same blocks as TSQR • Wrote Q factor to disk, just like TSQR Preliminary results Why parallel out-of-core? Some TSQR reduction trees • Less compute time per block frees processor time for other tasks • Total amount of I/O ideally independent of # of cores • If enough flops per datum, can hide reads/writes with computation What does out-of-core want? • For performance: • Non-blocking I/O operations • QoS guarantee on disk bandwidth per core • For tuning: • Know ideal disk bandwidth per core • Can measure actual disk bandwidth used per core • Awareness of I/O buffering implementation details • For ease of coding: • Automatic “(un)pickling” ((un)packing of structures) Top left: TSQR on a binary tree with 4 processors. Top right: TSQR on a linear tree with 1 processor. Bottom: TSQR on a “hybrid” tree with 4 processors. In all three diagrams, steps progress from left to right. The input matrix A is divided into blocks of rows. At each step, the blocks in core are highlighted in grey. Each grey box indicates a QR factorization: multiple blocks in the factorization are stacked. Arrows show data dependencies. “Tall Skinny” QR (TSQR) algorithm • For matrices with many more rows than columns (m >> n) • Algorithm: • Divide matrix into block rows • R: Reduction on the blocks with QR factorization as the operator • Q: Stored implicitly in reduction tree • Can use any reduction tree: • Binary tree minimizes # messages in parallel case • Linear tree minimizes bandwidth costs for sequential (out-of-core) • Other trees optimize for different cases • Possible parallelism: • In the reduction tree (one processor per local factorization) • In the local factorizations (e.g., multithreaded BLAS) Conclusions • OOC TSQR competes favorably with standard QR, if blocks are large enough and “square enough” • Some benefit from multithreaded BLAS, but not much • Suggests we should try a hybrid tree • Painful development process • Must serialize data manually • How do we measure disk bandwidth per core?