140 likes | 256 Views
Computer Aided Hand Tuning. Antoine Monsifrot François Bodin CAPS Team. June 2001. Why CBR driven code tuning? Approach System overview Tuning cases Examples Conclusion. Overview. Execution speed depends on the code structure on the processor architecture
E N D
Computer Aided Hand Tuning Antoine Monsifrot François Bodin CAPS Team June 2001
Why CBR driven code tuning? Approach System overview Tuning cases Examples Conclusion Overview
Execution speed depends on the code structure on the processor architecture Compiler optimizations frequently fail unable to analyze the programs (aliasing, ...) must preserve program semantics few application or target architecture knowledge ignore most of the existing libraries Introduction
Case-based reasoning no knowledge formalization needed 4 main operations: identification, retrieval, reuse, retention Defining a Tuning case abstracting loop performance properties User interaction CBR Driven Code Tuning?
A goal and a target machine A program transformation A set of indices data about the code that indicates the optimisation opportunity abstraction of code properties High probability of recognising a code structure we know how to optimise compilers need to be conservative A Tuning Case
Based on execution time code properties data locality parallelism floating point operations libraries Abstractions data accesses data dependencies arithmetic expressions code patterns Abstract performance indices
Loop nest structure depth, gotos, function call Array accesses access strides Expression patterns div/div, power, sparse accesses, ... Loop patterns Blas, LU, Jacobi, SOR Parallelism Data dependencies Execution time and frequency etime, tcov Static Indices do k = 1,npts do j = 2,npts a(j,k) = a(j-1,k) + a(j,k)**2 if (a(j,k) .eq. 0) then goto 4 endif a(j,k) = a(j,k) + 1 4 a(j,k) = a(j-1,k) / a(j,k) enddo Dynamic Indices
For each loops all cases are checked Computing Cases char *ComputeCase1(Indices[]){ …}
Tiling for TLB Cases Example Indices } • no perfect loop nest • large body } distribution + distribution + tiling } • affine loop • line array accesses • column array accesses } tiling Skewing + tiling + } • no negative component in • dependence vectors • uniform dependencies skewing
Loop Benchmark 64 loop nests 3.3Mflop 54.1Mflop DO 3200 I = 1,NSIZE2 DO 3170 J = 1,NSIZE1 IF (B2(J,I) .EQ. 0.0) GO TO 3130 A2(J,I) = C2(J,I)*B2(J,I) GO TO 3170 3130 CONTINUE B2(J,I) = C2(J,I)*A2(J,I) 3170 CONTINUE 3200 CONTINUE • 44 are compiler friendly • 40 are improved by KAP • 13 do not exhibit a case • 12 exhibit a case • 5 parallel loops not parallelized by KAP • 1 sorted else if • 1 condition on loop index • 3 loop nests with loops to merge • 2 matrix multiply http://www.netlib.org/benchmark/parallel
A real application Gaussian Density Functional Program 75863 lines of Fortran code (comment included) Two main routines: gridwork : 47,5% 1015 lines x_annihilate : 29,7% 269 lines An Application Example: DeFT http://www.ccl.net/cca/software/SOURCES/FORTRAN/DeFT/index.shtml
Examples of cases found: Parallel loop DeFT Examples do 1012 i=1,ihits ii=iwkvec(i) …... do 1012 j=1,ihits jj=iwkvec(j) …... do 1015 k=1,npts 1015 wf(k,ii)=wf(k,ii)+factor*fv(k,jj) if((nfunctional.gt.0).and.(ipart.eq.0)) then do 1016 k=1,npts wfx(k,ii)=wfx(k,ii)+factor*fvx(k,jj) wfy(k,ii)=wfy(k,ii)+factor*fvy(k,jj) 1016 wfz(k,ii)=wfz(k,ii)+factor*fvz(k,jj) endif 1012 continue Matrix Multiplication (Blas) do 1029 k = 1,n ... do 1029 j = istart(myid+1),iend(myid+1) do 1029 i = 1,n 1029 overlap(i,j) = overlap(i,j) + coeff(i,k)*coeff(j,k) Fusion do 1011 k=istart(myid+1),iend(myid+1) 1011 veci(k)=coeff(k,i) do 1012 k=istart(myid+1),iend(myid+1) 1012 vecj(k)=coeff(k,j) do 1013 k=istart(myid+1),iend(myid+1) 1013 coeff(k,i)=coeff(k,i)+s(i)*(vecj(k)-tau(i)*veci(k)) do 1014 k=istart(myid+1),iend(myid+1) 1014 coeff(k,j)=coeff(k,j)-s(i)*(veci(k)+tau(i)*vecj(k)) do 1015 k=istart(myid+1),iend(myid+1) 1015 veci(k)=smat(k,i) do 1016 k=istart(myid+1),iend(myid+1) 1016 vecj(k)=smat(k,j) 4-processor SGI Onyx Sequential : 121s KAP : 140s CAHT : 85s
Case based reasoning provides a promising framework for code tuning Tuning the cases may be difficult take into account the compiler (f.i. unrolling) integration of dynamic data and assembly code properties learning techniques for case tuning Conclusion