SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms

SP 3D Running Average Implementation SSE + OpenMPBenchmarking on different platforms Dr. Zvi Danovich, Senior Application Engineer January 2008

Agenda • What is 3D Running Average (RA) ? • From 1D to 3D RA implementation • Basic SSE technique: AoS  SoA transforms • 1D RA 4-lines SSE implementation • 2nd dimension completion • 3rd dimension completion • Adding OpenMP, benchmarking, conclusions

V = (1/k3)sum k v v v v v v v v v k k 3D Running Average (RA) – what is it ? • 3D RA is computed for each voxel V as normalized sum inside kxkxk cube (k is odd) located “around” given voxel: where ‘v’ is source voxels. • In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(kxkxk).

Agenda • What is 3D Running Average (RA) ? • From 1D to 3D RA implementation • Basic SSE technique: AoS  SoA transforms • 1D RA 4-lines SSE implementation • 2nd dimension completion • 3rd dimension completion • Adding OpenMP, benchmarking, conclusions

1D Running Average (RA) • Unlike 1D convolution, 1D RA can be computed with complexity (O1) using following aproach: • Prolog: compute sum S of first k voxels • Main step: to compute next sum S+1 , first member of previous sum (v0) should be subtracted, and next component (vk) should be added S = ∑(v)0,k-1 0 1 2 k-3 k-2 k-1 k 3 4 5 6 S+1 = ∑(v)1,k = S – v0 + vk

S = ∑(L)0,k-1 L0 Lk S+1 = ∑(L)1,k = S– L0 + Lk Extending 1D Running Average toward 2D • Giving slice (plane) with all lines (Li) 1D-averaged, we can extend averaging for 2D by the same approach: • Prolog: compute sum S of first k lines • Main step: to compute next sum S+1 , first line of previous sum (L0) should be subtracted, and next line (Lk) should be added

0 S = ∑(P)0,k-1 k k-1 k-2 2 1 … S+1 = ∑(P)1,k = S– P0 + Pk Extending 2D Running Average toward 3D • Giving stack of planes with all planes (Pi) 2D-averaged, we can extend averaing for 3D by the same approach: • Prolog: compute sum S of first k planes • Main step: to compute next sum S+1 , first plane of previous sum (P0) should be subtracted, and next plane (Pk) should be added

Agenda • What is 3D Running Average (RA)? • From 1D to 3D RA implementation • Basic SSE technique: AoS  SoA transforms • 1D RA 4-lines SSE implementation • 2nd dimension completion • 3rd dimension completion • Adding OpenMP, benchmarking, conclusions

0 0 0 0 1 1 1 1 2 2 2 2 k-3 k-3 k-3 k-3 k-2 k-2 k-2 k-2 k-1 k-1 k-1 k-1 k k k k 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 v3 k-1 v3 k-3 v3 k v3 k-2 v3 0 v3 1 v3 6 v3 2 v3 3 v3 5 v3 4 v2 4 v2 6 v2 k v2 0 v2 1 v2 k-3 v2 k-2 v2 2 v2 5 v2 k-1 v2 3 v1 k v1 1 v1 3 v1 0 v1 k-1 v1 2 v1 k-2 v1 6 v1 5 v1 k-3 v1 4 v0 3 v0 2 v0 1 v0 0 v0 6 v0 5 v0 k-1 v0 k-3 v0 k v0 k-2 v0 4 Array of Structures (AoS ) => Structure of Arrays (SoA)Why should we transform it to vectorize 1D Running Average ? • “Transposed” data structure: SoA • ENABLED for SSE ! • Origin “natural” serial data structure: AoS • NOT enabled for SSE Ms = ∑(m)0,k-1 L0 L1 L2 L3 Ms+1 = ∑(m)1,k = Ms – m0 + mk S+1 = ∑(v)1,k = S – v0 + vk S = ∑(v)0,k-1 How it can be transformed ?

x2 z0 x0 w0 z0 x0 z2 x0 y0 y1 w0 y0 w1 y0 z1 y2 x1 w2 w2 z2 z0 x1 x2 z3 y2 x3 z1 w1 y3 y1 y3 w0 z3 w3 w3 x3 x1 y1 z1 w1 x2 y2 z2 w2 x3 y3 z3 w3 Array of Structures (AoS ) => Structure of Arrays (SoA) L0 org • Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. • Takes 12 SSE operations per 16 components. loadhi, loadlo loadhi, loadlo L1 org zw10 xy10 intermediate shuffle(xy10 ,xy32 , (2,0,2,0)) shuffle(zw10 ,zw32 , (2,0,2,0)) FINAL SSE regs shuffle(xy10 ,xy32 , (3,1,3,1)) shuffle(zw10 ,zw32 , (3,1,3,1)) xy32 intermediate zw32 L2 org loadhi, loadlo loadhi, loadlo L3 org

z0 x0 x0 x0 x2 z0 z2 w0 y0 w0 y0 w1 x1 z1 w2 y2 y0 y1 x2 z0 x1 z3 y2 z2 w2 x3 z1 y1 w3 x3 w0 w3 z3 y3 y3 w1 x1 y1 z1 w1 x2 y2 z2 w2 x3 y3 z3 w3 Array of Structures (AoS ) <= Structure of Arrays (SoA) L0 ptr • Presented below: (inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. • Takes 12 SSE operations per 16 components. L1 ptr zw10 xy10 shuffle(xy10, zw10, …) +store unpack_lo unpack_lo Org SSE regs unpack_hi unpack_hi shuffle(xy23, zw23, …) +store xy32 zw32 L2 ptr L3 ptr

v3 7 v3 12 v3 3 v3 0 v3 2 v3 9 v3 1 v3 6 v3 0 v3 2 v3 1 v3 10 v3 11 v3 15 v3 3 v3 8 v3 14 v3 4 v3 13 v3 5 v2 2 v2 0 v2 3 v2 6 v2 5 v2 4 v2 1 v2 9 v2 7 v2 10 v2 15 v2 2 v2 1 v2 0 v2 3 v2 13 v2 14 v2 8 v2 11 v2 12 v1 0 v1 1 v1 9 v1 2 v1 15 v1 6 v1 2 v1 3 v1 8 v1 10 v1 1 v1 11 v1 0 v1 4 v1 12 v1 13 v1 3 v1 14 v1 7 v1 5 v0 11 v0 10 v0 5 v0 9 v0 15 v0 6 v0 7 v0 2 v0 8 v0 4 v0 13 v0 14 v0 2 v0 3 v0 3 v0 1 v0 1 v0 0 v0 0 v0 12 1D Running Average 4-lines SSE implementation (width – 11)Cyclic SSE array buffer • AoS=>SoA transform loads 4 SSE regs. RA with width 11 needs to maintain together 12 regs, they can fit in 3 QUADs of regs, but can crawl to 4 QUADs as well Can be filled by AoS=>SoA as “next” QUAD 12: fits in 3 QUADS 12: crawls to 4 QUADS Fill by AoS=>SoA • So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values.

v3 10 v3 3 v3 0 v3 1 v3 2 v3 9 v3 5 v3 6 v3 7 v3 4 v3 11 v3 8 r3 0 r3 2 r3 3 r3 1 r3 1 r3 0 r3 2 v2 3 r2 2 v2 0 r2 2 v2 5 r2 3 v2 2 v2 6 v2 11 r2 1 r2 0 v2 7 v2 10 v2 1 v2 9 v2 4 r2 0 v2 8 r2 1 v1 9 v1 8 v1 6 v1 7 v1 1 v1 5 v1 2 v1 4 v1 3 r1 3 r1 1 r1 0 v1 0 r1 0 r1 2 r1 2 v1 11 r1 1 v1 10 v0 5 r0 0 v0 7 r0 1 v0 0 v0 6 v0 8 r0 2 v0 10 v0 9 v0 4 v0 11 r0 1 v0 2 v0 3 r0 2 r0 0 v0 1 r0 3 1D Running Average 4-lines SSE implementation (width – 11)Prolog • Loading 12 SSE regs by AoS=>SoA • Summing up (accumulate) 5 first • 4 times: (sum-up next, save result in SSE regs – SoA form) • Save QUAD of results in memory by AoS<=SoA • 2 times: (sum-up next, save result in SSE regs – SoA form) • 1 time: Sum-up next, subrtact first, save result in SSE reg Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are NOT saved Accumulate Accumulate & save Add last & subtract the very first += += += += += += +–= Will be subtracted at the end of prolog 3 NOT saved in prolog Save in memory by AoS<=SoA

r3 0 v3 3 v3 1 v3 2 v3 0 v3 11 v3 4 v3 15 v3 14 v3 13 v3 12 v3 5 r3 1 r3 2 v3 7 v3 8 v3 9 v3 10 v3 6 r3 2 r3 3 r3 0 r3 1 v2 14 v2 3 v2 8 v2 6 v2 5 v2 9 v2 7 r2 1 v2 2 v2 1 v2 13 r2 3 v2 12 v2 0 v2 15 v2 11 r2 0 r2 2 v2 4 r2 0 v2 10 r2 1 r2 2 v1 7 v1 11 v1 8 r1 0 v1 10 v1 2 r1 1 v1 15 v1 6 r1 2 r1 2 v1 0 v1 4 v1 1 r1 0 r1 1 r1 3 v1 9 v1 12 v1 3 v1 13 v1 5 v1 14 r0 2 r0 1 v0 15 v0 1 r0 1 v0 6 v0 3 v0 7 r0 0 r0 0 v0 2 v0 5 r0 3 v0 0 v0 9 v0 4 v0 13 v0 14 v0 10 v0 12 v0 11 r0 2 v0 8 1D Running Average 4-lines SSE implementation (width – 11)Main step & epilog • Main step • Loading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic buffer • Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4th • Save QUAD of results in memory by AoS<=SoA • 3 times: (sum-up next, subrtact first, save result in SSE reg) During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, and 3 resulted regs are NOT saved Subtracted in current step Added in current step +–= +–= +–= +–= Are freed after current step • Epilog For 5 last results, subtraction ONLY is done 3 from prev | new 3 NOT saved in current step Save in memory by AoS<=SoA

2nd dimension completion 2D RA: based on 4-lines 1D SSE implementation - prolog • Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation. To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs. • Prolog • Computation 12 1D RA lines by 3 calls to 1D RA 4-lines routine • Summing up (accumulate) 5 first in working memory • 6 times: (sum-up next line, save result in final place) • 1 time: sum-up next line, subrtact first line, save result in final place Here all 12 1D RA lines are used: 5+6+1 Add last & subtract the very first Accumulate Accumulate & save 1D RA L3 1D RA L10 1D RA L2 1D RA L7 1D RA L8 1D RA L11 1D RA L0 1D RA L1 1D RA L4 1D RA L5 1D RA L6 1D RA L9 + + + + + + +– Will be subtracted at the end of prolog 2D RA L1<= 2D RA L5<= 2D RA L0<= 2D RA L2<= 2D RA L3<= 2D RA L4<= 2D RA L6<= Resulting 2D Running Average lines

2nd dimension completion2D RA: based on 4-lines 1D SSE implementation – main step & epilog • Main step • Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 “last” lines from working lines cyclic buffer • 4 times - sum-up next, subrtact (next-11), save result in final place Subtracted in current step Added in current step 1D RA L3 1D RA L10 1D RA L0 1D RA L1 1D RA L4 1D RA L5 1D RA L6 1D RA L9 1D RA L2 1D RA L7 1D RA L8 1D RA L11 1D RA L14 1D RA L12 1D RA L15 1D RA L13 +– +– +– +– Are freed after current step 2D RA +0<= 2D RA +2<= 2D RA +3<= 2D RA +1<= • Epilog • For 5 last results, subtraction ONLY is done • Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash. Most of data manipulation is done in L1 cash ! Resulting 2D Running Average lines

Agenda • What is 3D Running Average (RA)? • From 1D to 3D RA implementation • Basic SSE technique: AoS  SoA transforms • 1D RA 4-lines SSE implementation • 2nddimension completion • 3rd dimension completion • Adding OpenMP, benchmarking, conclusions

3rd dimension completion • 3rd dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). • It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that includes 1D computation as internal part. • In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are firstly saved in cyclic buffer, and are copied to final place only after using appropriate line for subtracting. Pool of 12 working lines- cyclic buffer 3D RA L3 3D RA L10 3D RA L0 3D RA L1 3D RA L4 3D RA L5 3D RA L6 3D RA L9 3D RA L2 3D RA L7 3D RA L8 3D RA L11 Second: copy First Subtract Add Source: 2d RA 2D RA L3 2D RA L10 2D RA L0 2D RA L1 2D RA L4 2D RA L5 2D RA L6 2D RA L9 2D RA L2 2D RA L7 2D RA L8 2D RA L11

Agenda • What is 3D Running Average (RA)? • From 1D to 3D RA implementation • Basic SSE technique: AoS  SoA transforms • 1D RA 4-lines SSE implementation • 2nddimension completion • 3rddimension completion • Adding OpenMP, benchmarking, conclusions

Parallelizing by OpenMP and benchmarking • To parallelize the above algorithm by using OpenMP, 16 working lines for each thread are allocated. • Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil). • Results for several platforms benchmarking: • Conclusions: • SSE/serial speed-up for Penryn/Merom is ~4x, 30% better than for “old” Pentium-M (2.5x) • Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and >3x better for Penrin (9.4 msec). • OpenMP scalability is very low, it seems that performance is restricted by FSB speed.

SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms