150 likes | 264 Views
Applying Data Copy To Improve Memory Performance of General Array Computations. Qing Yi University of Texas at San Antonio. Improving Memory Performance. processor. processor. Registers. Registers. Cache. Locality optimizations. Cache. Memory. Memory. Shared memory. Parallelization.
E N D
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio
Improving Memory Performance processor processor Registers Registers Cache Locality optimizations Cache Memory Memory Shared memory Parallelization Network connection Communication optimizations
Compiler Optimizations For Locality • Computation optimizations • Loop blocking, fusion, unroll-and-jam, interchange, unrolling • Rearrange computations for better spatial and temporal locality • Data-layout optimizations • Static layout transformation • A single layout for entire application • No additional overhead, tradeoff between different layout choices • Dynamic layout transformation • A different layout for each computation phase • Flexible but could be expensive • Combining computation and data transformations • Static layout transformation • Transform layout first, then computation • Dynamic layout transformation • Transform computation first, then dynamic re-arrange layout
Array Copying --- Related Work • Dynamic layout transformation for arrays • Copy arrays into local buffers before computation • Copy modified local buffers back to array • Previous work • Lam, Rothberg and Wolf, Temam, Granston and Jalby • Copy arrays after loop blocking • Assume arrays are accessed through affine expressions • Array copy always safe • Static data layout transformations • A single layout throughout the application --- always safe • Optimizing irregular applications • Data access patterns not known until runtime • Dynamic layout transformation --- through libraries • Scalar Replacement • Equivalent to copying single array element into scalars • Carr and Kennedy: applied to inner loops • Question: how expensive is array copy? How difficult is it?
What is new • Array copy for arbitrary loop computations • Stand-alone optimization independent of blocing • Works for computations with non-affine array access • General array copy algorithm • Work on arbitrarily shaped loops • Automatic insert copy operations to ensure safety • Heuristics to reduce buffer size and copy cost • Performs scalar replacement when specialized • Applications • Improve cache and register locality • When combined with blocking and when without blocking • Future work • Communication and parallelization • Empirical tuning of optimizations • Interface that allows different application heuristics
Apply Array Copy: Matrix Multiplication for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Step1: build dependence graph • True, output, anti and input deps between array accesses • Is each dependence precise? C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]
Apply Array Copy: Matrix Multiplication for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Step2-3: Separate array references connected by imprecise deps • Impose an order on all array references • C[i+j*m] (R) -> A[i+k*m] -> B[k+j*l] -> C[i+j*m] (W) • Remove all back edges • Apply typed fusion C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]
Array Copy with Imprecise Dependences for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) { C[f(i,j,m)] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Array references connected by imprecise deps • Cannot precisely determine a mapping between subscripts • Sometimes may refer to the same location, sometimes not • Not safe to copy into a single buffer • Apply typed-fusion algorithm to separate them C[f(i,j,m)] C[i+j*m] A[i+k*m] B[k+j*l] Imprecise
Profitability of Applying Array Copy location to copy A for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Each buffer should be copied at most twice (splitting groups) • Determine outermost location to perform copy • No interference with other groups • Enforce size limit on the buffer • constant size => scalar replacement • Ensure reuse of copied elements • Lowering position to copy if no extra reuse is gained location to copy C location to copy B k i j k C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l] k k
Array Copy Result: Matrix Multiplication Copy A[0:m,0:l] to A_buf; for (j=0; j<n; ++j) { Copy C[0:m, j*m] to C_buf; for (k=0; k<l; ++k) { Copy B[k+j*l] to B_buf; for (i=0; i<m; ++i) C_buf[i] = C_buf[i] + alpha * A_buf[i+k*m]*B_buf; } Copy C_buf[0:m] back to C[0:m,j*m]; } • Dimensionality of buffer enforced by command-line options • Can be applied to arbitrarily shaped loop structures • Can be applied independently or after blocking
Experiments • Implementation • Loop transformation framework by Yi, Kennedy and Adve • ROSE compiler infrastructure at LLNL (Quinlan et al) • Benchmarks • dgemm (matrix multiplication, LAPACK) • dgetrf (LU factorization with partial pivoting, LAPACK) • tomcatv (mesh generation with Thompson solver, SPEC95) • Machines • A Dell PC with two 2.2GHz Intel XEON processors • 512KB cache on each processor and 2GB memory • A SGI workstation with a 195 MHz R10000 processor • 32KB 2-way L1, 1MB 4-way L2, and 256MB memory; • A single 8-way P655+ node on a IBM terascale machine • 32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2, 16MB 8-way L3 • 16GB memory
Summary • When should we apply scalar replacement? • Profitable unless register pressure too high • 3-12% improvement observed • No overhead • When should we apply array copy? • When regular conflict misses occur • When prefetching of array elements is need • 10-40% improvement observed • Overhead is 0.5-8% when not beneficial • Optimizations not beneficial on the IBM node • Both blocking and 2-dim copying not profitable • Integer operation overhead too high • Will be further investigated