240 likes | 358 Views
19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois. A Proposal of Operation History Management System for Source-to-Source Optimization of HPC Programs Yasushi Negishi, Hiroki Murata and Takao Moriyama Deep Computing, Tokyo Research Laboratory, IBM Research.
E N D
19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization of HPC ProgramsYasushi Negishi, Hiroki Murata and Takao MoriyamaDeep Computing, Tokyo Research Laboratory, IBM Research 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois
Outline of this Presentation • Proposal of an algorithm for managing operation history of source-to-source optimization. • Prototype system with new user interface for managing operation history explicitly.
Outline of this Presentation • Proposal of an algorithm for managing operation history of source-to-source optimization. • Prototype system with new user interface for managing operation history explicitly.
Background • Improvement of single processor performance is stopping, and architectures of supercomputers is becoming more complex. • Architecture-specific optimizations are needed to utilize various kinds of network and processor architectures to achieve reasonable performance. • Application areas for numerical simulations continue to expand. • We need solve performance issues more effectively and more easily. Source-to-source optimization tools are becoming important. • Automatic conversion (a.k.a. refactoring) for optimization • Support typical architecture-specific and application-specific performance optimization patterns. • Reduce programmer’s time and human errors by supporting routine but troublesome optimization.
Typical Source-to-Source Optimization Steps Optimization steps are combinations ofautomatic conversionandmanual editing • Strength reduction • Replace costly operation with an equivalent but less expensive operation • E.g. x = r ** (-1) x = 1 / r • Steps • Modify the code to use less expensive operation by manual editing • Loop unrolling & SIMDization • Use SIMD instructions If compiler does not generate optimal SIMD instructions in a loop • E.g. x(i) = a(i) + b(i) * c(i) x(i) = FPMADD(a(i), b(i), c(i)) • x(i+1) = a(i+1) + b(i+1) * c(i+1) • Steps • Unroll the loop by automatic conversion with specifying the range and unroll factor. • Modify the unrolled loop body with in-line assemble code for SIMD by manual editing • Loop tiling (a.k.a. loop blocking, strip mine and interchange) • Change loop structure to increase memory access locality and cache hit ratio. • E.g. • Steps • Modify the loop by automatic conversion with specifying the range and blocking factors. for (i=0; i<N; i+= Bi) for (j=0; j<N; j+= Bj) for (ii=i; ii<min(i+Bi,N); ii++) for (jj=j; jj<min(j+Bj,N); jj++) c[ii] =c[ii]+ a[ii,jj]*b[jj]; for (i=0; i<N; i++) for (j=0; j<N; j++) c[i] = c[i]+ a[i,j]*b[j];
“Reapplication Conflict” • Because of trial-and-error nature of optimization work, it is sometimes required to undo an operation in the past or to insert or change operation in the past even if a single user manages the code. We call this conflict caused by a single user as “Reapplication Conflict”. • System for supporting Source-to-Source optimization should handle this conflict correctly.
Issues of Existing Version Management Systems Handling “Reapplication Conflict” • Because of trial-and-error nature of optimization work, it is sometimes required to undo an operation in the past or to insert or change operation in the past even if a single user manages the code. • We call this conflict caused by a single user as “Reapplication Conflict”. • System should handle this conflict correctly. • Existing version management systems use algorithm of “patch” command or similar one to handle conflicts. • But the patch algorithm has a issue. • As for modification by manual editing, the patch algorithm works fine. • The algorithm applies difference by an operation on different base code, with adjusting target range to be applied. • As for modification by automatic conversion, the patch algorithm may generate unexpected results. Scenario in which existing system does not work expectedly is shown.
Example Scenario of “Reapplication Conflict” (original) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + x(i) ** (-1) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Original Original code is checked out.
Example Scenario of “Reapplication Conflict” (Step 1) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + x(i) ** (-1) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Operation A Original Original: Step 1: Step 1: Do loop invariant code motion by manual editing, and check it in
Example Scenario of “Reapplication Conflict” (Step 2) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end A B Original Original: Step 1: Step 2: Step 2: Do strength reduction by manual editing, and check it in.
Example Scenario of “Reapplication Conflict” (Step 3) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / fourpi + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end C A B Original Original: Step 1: Step 2: Step 3: Step 3: Do loop unrolling by automatic conversion, and check it in.
Example Scenario of “Reapplication Conflict” (Step 4) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / fourpi + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end C A B Original Original: Step 1: Step 2: N.G. O.K. O.K. Step 3: Step 4: Compile and execute the code, and analyze effects of optimizations Find the following results Optimization A: not effective Optimization B: effective Optimization C: effective
Example Scenario of “Reapplication Conflict” (Step 5) C A B Original program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / fourpi + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Original: Step 1: Step 2: Target of optimization A Step 3: Step 5: Step 5: Undo the optimization A by “patch” command Not target of optimization A, but influenced
Example Scenario of “Reapplication Conflict” (Final Results) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Problem: The wrong line is unrolled!! Because “patch” does not actually apply the automatic conversion operation again, but does just apply difference of the results by automatic conversion operation. System for managing automatic conversion operations needed. • Adjust the target range • Apply the automatic operation actually again.
Proposed Algorithm for saving/applying automatic operations Saving an operation Applying an saved operation Manual Editing • Manual editing handled by the patch algorithm • Automatic conversion handled by our proposed algorithm Modified code Original code Optimization results Optimized results on modified code Patch algorithm Context difference file Context difference file Operation log Operation log Specify Conversion ID and arguments Specify Range Original Code Pseudo change file Modified Code Pseudo change file Optimization results Optimization results Patch algorithm Apply automatic conversion Context difference file Context difference file Conversion ID Conversion ID Arguments Arguments Operation log Operation log
Scenario of Proposed Algorism to Save Automatic Operations program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Algorithm for saving operation history program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Identifier of automatic conversion Operation log Step 1: Generate pseudo change file by inserting special lines to specify range for the automatic operation. “loop unrolling” parameter Step 2: Create context difference file between the file before editing and the pseudo change file 4 context difference file By saving this context difference file, range-adjust algorithm of “patch” command can be used for identifying the target range of automatic conversion. *** opeB.F Sat Jul 11 11:36:34 2009 --- opeC2.F Sun Jul 12 13:36:10 2009 *************** *** 19,27 **** --- 19,29 ---- enddo t2 = rtc() - s s = rtc() + $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo + $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 Step 3: Save identifier of automatic conversion operation (e.g. “loop unrolling”), its parameter (e.g. “4”), and the context difference file as its operation log. pseudo change file
Identifier of automatic conversion Operation log “loop unrolling” Scenario of Proposed Algorism to Apply Automatic Operation (Step 1) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + x(i) ** (-1) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end parameter 4 context difference file *** opeB.F Sat Jul 11 11:36:34 2009 --- opeC2.F Sun Jul 12 13:36:10 2009 *************** *** 19,27 **** --- 19,29 ---- enddo t2 = rtc() - s s = rtc() + $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo + $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 Not Match Match Algorithm for applying operation history on modified target code Trial 2: Ignore the starting and ending line numbers • Step1: Apply the context diff file to the target program by using algorithm used by the “patch” command. Trial 1: Apply the history at the same position Trial 3: Ignore outer most one line before/after the modification Trial 4: Ignore outer most two lines before/after the modification pseudo change file
Identifier of automatic conversion Operation log “loop unrolling” Scenario of Proposed Algorism to Apply Automatic Operation (Step 2) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end parameter 4 context difference file *** opeB.F Sat Jul 11 11:36:34 2009 --- opeC2.F Sun Jul 12 13:36:10 2009 *************** *** 19,27 **** --- 19,29 ---- enddo t2 = rtc() - s s = rtc() + $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo + $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 Redo “loop unrolling” “4” times on “the loop” Algorithm for applying operation history on modified target code • Step2: Redo automatic conversion with its parameter saved in the operation log. pseudo change file
Proposed Algorism to Apply Automatic Operation (Final Results) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a = 0 b = 0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+1) + a) / (pi * 4.0d0)+ 1.0d0) b = b + ((x(i+2) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+3) + a) / (pi * 4.0d0) + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end Problem solved. The correct line is unrolled!! The proposed system can reapply automatic conversion operations correctly.
Outline of this Presentation • Proposal of an algorithm for managing operation history of source-to-source optimization. • Prototype system with new user interface for managing operation history explicitly.
Prototype Implementation of the Proposed System Photran module (Fortran) Open Source CDT module (C) Open Source User defined Transformation rules Pre-defined Transformation rules User defined Transformation rules • Implemented as an Eclipse plug-in module • Worked with open source CDT/Photran modules • Use CDT/Photran’s C/Fortran parser HPC refactoring module Eclipse
Proposal of user interface for operation history management system 1. Operation History is displayed as a sequence, and user can select and modify any point of source code. 2. The succeeding operations are automatically reapplied as needed to produce a new version according to the user’s instructions. Operation history view Source code view • 3. Operations are categorized into the following three categories according to the status and necessity of the reapplication, and are displayed by using three colors. • Green: Applied • Yellow: Not tried to applied • Red: Tried to applied, but fail. Source code tree view Information and console output view Operation history view
Conclusion • Explained proposal of an algorithm for managing operation history of source-to-source optimization. • Explained Prototype system with new user interface for managing operation history explicitly.