1 / 33

Tuesday, September 19, 2006

Tuesday, September 19, 2006. The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition. Reference Material. Lectures 1 & 2

toan
Download Presentation

Tuesday, September 19, 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition

  2. Reference Material • Lectures 1 & 2 • “Parallel Computer Architecture” by David Culler et. al., Chapter 1. • “Sourcebook of Parallel Computing” by Jack Dongarra et. al., Chapters 1 and 2. • Introduction to Parallel Computing by Grama et. al., Chapter 1 and Chapter 2 §2.4. • www.top500.org • Lecture 3 • Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 • Introduction to Parallel Computing, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ • Lecture 4 & 5 • “Techniques for Optimizing Applications” by Garg et. al., Chapter 9 • “Software Optimizations for High Performance Computing” by Wadleigh et. al., Chapter 5 • Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-2.2

  3. Software Optimizations • Optimize serial code before parallelizing it.

  4. do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo Assumption n is divisible by 4 Loop Unrolling • Unrolled by 4. • Some compilers allow users to specify unrolling depth. • Avoid excessive unrolling: Register pressure / spills can hurt performance • Pipelining to hide instruction latencies • Reduces overhead of index increment and conditional check

  5. do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo Loop Unrolling Unroll outer loop by 2

  6. do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo Loop Unrolling

  7. do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo Loop Unrolling Number of load operations can be reduced e.g. Half as many loads of X

  8. Loop Fusion • Beneficial in loop-intensive programs. • Decreases index calculation overhead. • Can also help in instruction level parallelism. • Beneficial if same data structures are used in different loops.

  9. for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; Loop Fusion

  10. for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; for (i=0; i<n; i++) z[i] =x[i]*y[i]+w[i]; Loop Fusion Check for register pressure before fusing

  11. Loop Fission • Condition statements can hurt pipelining • Split into two, one with condition statements and the other without. • Compiler can do optimizations in condition-free loop like unrolling. • Beneficial for fat loops that may lead to register spills

  12. Loop Fission for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }

  13. for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } } for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; } for (i=0;i<nodes;i++) { if(temp1[i] > hgreat) { temp1[i]=1; } } Loop Fission

  14. Reductions for (i=0; i<n; i++) { sum +=x[i]; } Normally a single register would be used for reduction variable. Hide floating point instruction latency?

  15. for (i=0; i<n; i++) { sum +=x[i]; } sum1=sum2=sum3=sum4=0.0 nend = (n>>2)<<2; for (i=0; i<nend; i+=4){ sum1 +=x[i]; sum2 +=x[i+1]; sum3 +=x[i+2]; sum4 +=x[i+3]; } sumx = sum1 + sum2+ sum3 + sum4; for (i=nend; i<n; i++) sumx += x[i] Reductions

  16. a**0.5 vs sqrt(a)

  17. a**0.5 vs sqrt(a) • Appropriate include files can help in generating faster code. e.g. math.h

  18. The time to access memory has not kept pace with CPU clock speeds. • Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them. • Wastage of CPU cycles: CPU starvation

  19. Ability of memory system to feed data to the processor • Memory latency • Memory Bandwidth

  20. Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Cache block size : 1 word • Peak processor rating?

  21. Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block 1 word • Peak processor rating4 GFlops

  22. Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block: 1 word • Peak processor rating 4 GFlops • Dot product of two vectors • Peak speed of computation?

  23. Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block 1 word • Peak processor rating 4 GFlops • Dot product of two vectors • Peak speed of computation? one floating point operation every 100ns i.e. speed of 10 MFLOPS

  24. Effect of Memory Latency: Introduce Cache • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Memory block 1 word • Cache 32KB with 1ns latency • Multiply two matrices A and B of 32x32 words with result in C. (Note: Previous example had no data reuse). • Assume ideal cache placement and enough capacity to hold A,B and C

  25. Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K words • Total operations and total time taken?

  26. Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K words • Total operations and total time taken? • Two matrices = 2K require words • Multiplying two matrices requires 2n3 operations

  27. Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K • Two matrices = 2K require 2K *100ns = 200µs. • Multiplying two matrices requires 2n3 operations = 2*323 = 64K operations • 4 operations per cycle we need 64K/4 cycles = 16µs • Total time = 200+16µs • Computation rate 64K operations/(200+16µs) = 303 MFLOPS

  28. Effect of Memory Bandwidth • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Memory block 4 words • Cache 32KB with 1ns latency • Dot product example again • Bandwidth increased 4 fold

  29. Reduce cache misses. • Spatial locality • Temporal locality

  30. Impact of strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) column_sum[i]+= b[j][i];

  31. Eliminating strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) for (i=0; i<1000; i++) column_sum[i]+= b[j][i]; Assumption: Vector column_sum is retained in the cache

  32. do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop. Little reuse between touches How many cache misses for A and B?

More Related