330 likes | 455 Views
Tuesday, September 19, 2006. The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition. Reference Material. Lectures 1 & 2
E N D
Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition
Reference Material • Lectures 1 & 2 • “Parallel Computer Architecture” by David Culler et. al., Chapter 1. • “Sourcebook of Parallel Computing” by Jack Dongarra et. al., Chapters 1 and 2. • Introduction to Parallel Computing by Grama et. al., Chapter 1 and Chapter 2 §2.4. • www.top500.org • Lecture 3 • Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 • Introduction to Parallel Computing, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ • Lecture 4 & 5 • “Techniques for Optimizing Applications” by Garg et. al., Chapter 9 • “Software Optimizations for High Performance Computing” by Wadleigh et. al., Chapter 5 • Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-2.2
Software Optimizations • Optimize serial code before parallelizing it.
do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo Assumption n is divisible by 4 Loop Unrolling • Unrolled by 4. • Some compilers allow users to specify unrolling depth. • Avoid excessive unrolling: Register pressure / spills can hurt performance • Pipelining to hide instruction latencies • Reduces overhead of index increment and conditional check
do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo Loop Unrolling Unroll outer loop by 2
do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo Loop Unrolling
do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo Loop Unrolling Number of load operations can be reduced e.g. Half as many loads of X
Loop Fusion • Beneficial in loop-intensive programs. • Decreases index calculation overhead. • Can also help in instruction level parallelism. • Beneficial if same data structures are used in different loops.
for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; Loop Fusion
for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; for (i=0; i<n; i++) z[i] =x[i]*y[i]+w[i]; Loop Fusion Check for register pressure before fusing
Loop Fission • Condition statements can hurt pipelining • Split into two, one with condition statements and the other without. • Compiler can do optimizations in condition-free loop like unrolling. • Beneficial for fat loops that may lead to register spills
Loop Fission for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }
for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } } for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; } for (i=0;i<nodes;i++) { if(temp1[i] > hgreat) { temp1[i]=1; } } Loop Fission
Reductions for (i=0; i<n; i++) { sum +=x[i]; } Normally a single register would be used for reduction variable. Hide floating point instruction latency?
for (i=0; i<n; i++) { sum +=x[i]; } sum1=sum2=sum3=sum4=0.0 nend = (n>>2)<<2; for (i=0; i<nend; i+=4){ sum1 +=x[i]; sum2 +=x[i+1]; sum3 +=x[i+2]; sum4 +=x[i+3]; } sumx = sum1 + sum2+ sum3 + sum4; for (i=nend; i<n; i++) sumx += x[i] Reductions
a**0.5 vs sqrt(a) • Appropriate include files can help in generating faster code. e.g. math.h
The time to access memory has not kept pace with CPU clock speeds. • Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them. • Wastage of CPU cycles: CPU starvation
Ability of memory system to feed data to the processor • Memory latency • Memory Bandwidth
Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Cache block size : 1 word • Peak processor rating?
Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block 1 word • Peak processor rating4 GFlops
Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block: 1 word • Peak processor rating 4 GFlops • Dot product of two vectors • Peak speed of computation?
Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block 1 word • Peak processor rating 4 GFlops • Dot product of two vectors • Peak speed of computation? one floating point operation every 100ns i.e. speed of 10 MFLOPS
Effect of Memory Latency: Introduce Cache • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Memory block 1 word • Cache 32KB with 1ns latency • Multiply two matrices A and B of 32x32 words with result in C. (Note: Previous example had no data reuse). • Assume ideal cache placement and enough capacity to hold A,B and C
Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K words • Total operations and total time taken?
Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K words • Total operations and total time taken? • Two matrices = 2K require words • Multiplying two matrices requires 2n3 operations
Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K • Two matrices = 2K require 2K *100ns = 200µs. • Multiplying two matrices requires 2n3 operations = 2*323 = 64K operations • 4 operations per cycle we need 64K/4 cycles = 16µs • Total time = 200+16µs • Computation rate 64K operations/(200+16µs) = 303 MFLOPS
Effect of Memory Bandwidth • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Memory block 4 words • Cache 32KB with 1ns latency • Dot product example again • Bandwidth increased 4 fold
Reduce cache misses. • Spatial locality • Temporal locality
Impact of strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) column_sum[i]+= b[j][i];
Eliminating strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) for (i=0; i<1000; i++) column_sum[i]+= b[j][i]; Assumption: Vector column_sum is retained in the cache
do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop. Little reuse between touches How many cache misses for A and B?