Apex-Map Performance Prediction for CG Kernels

Apex-Map StatusErich Strohmaier and Hongzhang Shan

Apex-Map generator • Benchmark code will be generated based on the following performance parameters: • PARALLEL: N/Y • PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF • ACCESS PATTERN: RANDOM / STRIDE • SPATIAL LOCALITY (L): [1, M] Default: {1, 4, 16, …, 65536} • CONCURRENCY (I) : [1, X] Default: 1024 • TEMPORAL LOCALITY (a): [0,1] Default: {1.0 0.5 0.25 0.1 0.05 0.025 0.01 0.005 0.0025 0.001} • MEMORY SIZE (M) : Default: 67,108,864 Words = 512MB / process • REGISTER PRESSURE ( R ): [1, X] Default: 1 • COMPUTATIONAL INTENSITY (CI) : [1, X] Default: 1 • ACCESS MODE: FUSED / NESTED • RESULTS: SCALAR / ARRAY (left hand side of statement) • REPEAT TIMES: 100 • WARMUP TIMES: 10 • CPU MHZ: 1900 • PLATFORM: BASSI • VERSION: 1.6 • STRIDE: X • X: any positive integer

Apex-Map Meets Kernels

NAS CG (one stream) Source Code: ========== DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sum ENDDO One-Stream Approach: using one Apex-Map stream to simulate NAS CG performance behavior. Temporal locality currently needs to be defined by experiments.

Performance Prediction for CG (using one stream) The results indicate that the performance of CG for different data sets can be simulated by Apex-Map using one stream with temporal locality ranging from 0.03 - 0.01 (exception: data set S on Jacquard)

NAS CG (two streams) Source Code: =========== DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sum ENDDO Two-Stream Approach: (a, p are treated differently) Perf. of CG = 1/(1/Perf_stream1+1/Perf_stream2)

Performance Prediction for CG (using two streams) Using two-stream approach, performance matches very well on Jacquard. However, on Franklin, only large data sets match well.

GUPS Source Code: ========== For ( i = 0; i < NUPDATE; i++) { ran = (ran << 1)^ (((s64int) ran < 0) ? POLY : 0); Table[ran & (TableSize -1)] ^= ran; } Results Match Well!

Matrix-Mul (stride) Source Code: ========== For ( i = 0; i < N; i++) { For ( j = 0; j < K; j++) { tmp = 0; For ( k = 0; k < M; k++) { tmp += a[i*M+k] * b[k*K+j]; } c[i*K+j] = tmp; } } There are two choices for Apex-Map: Use random stream Use stride stream

Performance Prediction for Matrix-Mul (stride) 1. Stride stream matches well. 2. Big performance gap between MM and Apex-Map using random stream

Matrix-Mul (vector) Source Code: ========== For ( i = 0; i < N; i++) For ( k = 0; k < M; k++) For ( j = 0; j < K; j++) c[i*K+j] += a[i*M+k] * b[k*K+j]; On Franklin, perf. Matches well when temp. locality is 0.02. On Jacquard, not a close match (compiler inefficiency for Apex-Map kernels ?)

NBODY Source Code (Loop Body): =================== SUBVEC(p->position, bod->position, diff) DOTPROD(diff, diff, distSq) distSq += SOFTSQ dist = sqrt(distSq) factor = p->mass/dist bod->phi -= factor Factor = factor / distSq MULTVEC(diff, factor, extraAcc) ADDVEC(bod->acc, extraAcc, bod->acc) • FDIV, and FSQRT are implemented differently across platforms and will affect the computation of MF/s and Computational Intensity (CI): • use a test program to determine the ratio between fdiv, fsqrt and fadd to decide CI for Apex-Map • use No. Loops/second executed as performance metric instead of MF/s

Performance Prediction for Nbody Apex-Map results match well with Nbody on Franklin, big difference on Jacquard

STREAM Source Code: ========== For ( i = 0; i < N; i++) c[i] = a[i] For ( i = 0; i < N; i++) b[i] = s*c[i] For ( i = 0; i < N; i++) c[i] = a[i]+b[i] For ( i = 0; i < N; i++) a[i] = b[i]+s*c[i] Big Perf. Difference Due to: 1. Static vs. Dynamic mem alloc 2. Kernel impl. details

STREAM: Static vs. Dynamic Static: .text .align 16 .globl tuned_STREAM_Copy tuned_STREAM_Copy: ..Dcfb4: subq $8,%rsp ..Dcfi4: ## lineno: 0 ..EN5: ## lineno: 395 movl $c+0,%edi movl $a+0,%esi movl $1048576,%edx .p2align 4,,1 call __c_mcopy8 ## lineno: 396 addq $8,%rsp ret Dynamic: .text .align 16 .globl tuned_STREAM_Copy tuned_STREAM_Copy: ..Dcfb4: ## lineno: 0 ..EN5: ## lineno: 402 xorl %ecx,%ecx movl $524288,%edx movl $8,%eax .align 16 .LB2164: ## lineno: 402 movq a(%rip),%rsi movq c(%rip),%r8 decl %edx movq (%rsi,%rcx),%rdi movq %rdi,(%r8,%rcx) addq $16,%rcx movq (%rsi,%rax),%r9 movq %r9,(%r8,%rax) addq $16,%rax testl %edx,%edx jg .LB2164 ## lineno: 403 ret Different codes are generated for Static and Dynamic (may cause 50% perf diff)

Random Nested (R=1, CI=1) • Array • for (i = 0; i < times; i++) { • index- length = B / L; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index- length; j++) { • for (k = 0; k < L; k++) { • W0[j*L+k] = W0[j*L+k]+c0*(data[ind0[j]+k]); • } • } • CLOCK(time2); • } • Scalar • for (i = 0; i < times; i++) { • index-length = B / L; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • } • } • CLOCK(time2); • } • initIndexArray (length): • for (i = 0; i < length; i++) { • ind0[i] = getIndex(0) * L; • } How many Load/Store count?

Random Fused(R=1, CI=1) • Array • for (i = 0; i < times; i++) { • index-length = B; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • W0[j] = W0[j]+c0*(data[ind0[j]]); • } • CLOCK(time2); • } • Scalar • for (i = 0; i < times; i++) { • index-length = B; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • W0 = W0+c0*(data[ind0[j]]); • } • CLOCK(time2); • } • initIndexArray (length): • for (i = 0; i < length; i += L) { • ind0[i] = getIndex(0) * L; • for (j = 1; (j < L) && (i+j < length); j++) { • ind0[i+j] = ind0[i] + j; • }}

Random Nested Scalar ( R ) • R=2 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • W1 = W1+c1*(data[ind1[j]+k]); • } • } • R=1 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • } • } • initIndexArray (length): • for (i = 0; i < length; i++) { • ind0[i] = getIndex(0) * L; • } • initIndexArray (length): • for (i = 0; i < length; i++) { • ind0[i] = getIndex(0) * L; • ind1[I] = getIndex(1) * L; • } ind0 ind0 ind1

Random Nested Scalar ( CI ) • R=1, CI = 1 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • } } • R=1, CI = 2 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind0[j]+k])); • } } • R=2, CI = 4 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k]+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k])))); • W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k])))); • } }

Random Nested Scalar ( R, CI ) • R=3, CI = 3 • for (i = 0; i < times; i++) { • index-length = B / L; • initIndexArray(index-length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind2[j]+k]+c0*(data[ind1[j]+k]))); • W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind2[j]+k]))); • W2 = W2+c2*(data[ind2[j]+k]+c2*(data[ind1[j]+k]+c2*(data[ind0[j]+k]))); • } • } • CLOCK(time2); • }

Register Pressure ( R ) Effect

Computational Intensity (CI) Effect

% Peak for Random Nested Scalar (R=1, CI=1)

Apex-Map Performance Prediction for CG Kernels