200 likes | 215 Views
Learn about optimizing scalar code for faster runtime on new computers, using aggressive compiler options, loop unrolling, subroutine inlining, and vendor-tuned code.
E N D
Parallel Computing ExplainedScalar Tuning Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009
Agenda • 1 Parallel Computing Overview • 2 How to Parallelize a Code • 3 Porting Issues • 4 Scalar Tuning • 4.1 Aggressive Compiler Options • 4.2 Compiler Optimizations • 4.3 Vendor Tuned Code • 4.4 Further Information
Scalar Tuning • If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. • This chapter describes many of these techniques: • The use of the most aggressive compiler options • The improvement of loop unrolling • The use of subroutine inlining • The use of vendor supplied tuned code • The detection of cache problems, and their solution are presented in the Cache Tuning chapter.
Aggressive Compiler Options • For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.
Aggressive Compiler Options • It should be noted that –O3 might carry out loop transformations that produce incorrect results in some codes. • It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. • On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels. • On the SGI Origin2000, the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.
Agenda • 1 Parallel Computing Overview • 2 How to Parallelize a Code • 3 Porting Issues • 4 Scalar Tuning • 4.1Aggressive Compiler Options • 4.2 Compiler Optimizations • 4.2.1 Statement Level • 4.2.2 Block Level • 4.2.3 Routine Level • 4.2.4 Software Pipelining • 4.2.5 Loop Unrolling • 4.2.6 Subroutine Inlining • 4.2.7 Optimization Report • 4.2.8 Profile-guided Optimization (PGO) • 4.3 Vendor Tuned Code • 4.4 Further Information
Compiler Optimizations • The various compiler optimizations can be classified as follows: • Statement Level Optimizations • Block Level Optimizations • Routine Level Optimizations • Software Pipelining • Loop Unrolling • Subroutine Inlining • Each of these are described in the following sections.
Statement Level • Constant Folding • Replace simple arithmetic operations on constants with the pre-computed result. • y = 5+7 becomes y = 12 • Short Circuiting • Avoid executing parts of conditional tests that are not necessary. • if (I.eq.J .or. I.eq.K) expression when I=J immediately compute the expression • Register Assignment • Put frequently used variables in registers.
Block Level • Dead Code Elimination • Remove unreachable code and code that is never executed or used. • Instruction Scheduling • Reorder the instructions to improve memory pipelining.
Routine Level • Strength Reduction • Replace expressions in a loop with an expression that takes fewer cycles. • Common Subexpressions Elimination • Expressions that appear more than once, are computed once, and the result is substituted for each occurrence of the expression. • Constant Propagation • Compile time replacement of variables with constants. • Loop Invariant Elimination • Expressions inside a loop that don't change with the do loop index are moved outside the loop.
Software Pipelining • Software pipelining allows the mixing of operations from different loop iterations in each iteration of the hardware loop. It is used to get the maximum work done per clock cycle. • Note: On the R10000s there is out-of-order execution of instructions, and software pipelining may actually get in the way of this feature.
Loop Unrolling • The loops stride (or step) value is increased, and the body of the loop is replicated. It is used to improve the scheduling of the loop by giving a longer sequence of straight line code. An example of loop unrolling follows: Original Loop Unrolled Loop do I = 1, 99 do I = 1, 99, 3 c(I) = a(I) + b(I) c(I) = a(I) + b(I) enddo c(I+1) = a(I+1) + b(I+1) c(I+2) = a(I+2) + b(I+2) enddo There is a limit to the amount of unrolling that can take place because there are a limited number of registers. • On the SGI Origin2000, loops are unrolled to a level of 8 by default. You can unroll to a level of 12 by specifying: f90 -O3 -OPT:unroll_times_max=12 ... prog.f • On the IA32 Linux cluster, the corresponding flag is –unroll and -unroll0 for unrolling and no unrolling, respectively.
Subroutine Inlining • Subroutine inlining replaces a call to a subroutine with the body of the subroutine itself. • One reason for using subroutine inlining is that when a subroutine is called inside a do loop that has a huge iteration count, subroutine inlining may be more efficient because it cuts down on loop overhead. • However, the chief reason for using it is that do loops that contain subroutine calls may not parallelize.
Subroutine Inlining • On the SGI Origin2000 computer, there are several options to invoke inlining: • Inline all routines except those specified to -INLINE:neverf90 -O3 -INLINE:all … prog.f: • Inline no routines except those specified to -INLINE:mustf90 -O3 -INLINE:none … prog.f: • Specify a list of routines to inline at every callf90 -O3 -INLINE:must=subrname … prog.f: • Specify a list of routines never to inlinef90 -O3 -INLINE:never=subrname … prog.f: • On the Linux clusters, the following flags can invoke function inlining: • inline function expansion for calls defined within the current source file-ip: • inline function expansion for calls defined in separate files-ipo:
Optimization Report • Intel 9.x and later compilers can generate reports that provide useful information on optimization done on different parts of your code. • To generate such optimization reports in a file filename, add the flag -opt-report-file filename. • If you have a lot of source files to process simultaneously, and you use a makefile to compile, you can also use make's "suffix" rules to have optimization reports produced automatically, each with a unique name. For example, .f.o: ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*.f • creates optimization reports that are named identically to the original Fortran source but with the suffix ".f" replaced by ".opt".
Optimization Report • To help developers and performance analysts navigate through the usually lengthy optimization reports, the NCSA program OptView is designed to provide an easy-to-use and intuitive interface that allows the user to browse through their own source code, cross-referenced with the optimization reports. • OptView is installed on NCSA's IA64 Linux cluster under the directory /usr/apps/tools/bin. You can either add that directory to your UNIX PATH or you can invoke optview using an absolute path name. You'll need to be using the X-Window system and to have set your DISPLAY environment variable correctly for OptView to work. • Optview can provide a quick overview of which loops in a source code or source codes among multiple files are highly optimized and which might need further work. For a detailed description of use of OptView, readers see: http://perfsuite.ncsa.uiuc.edu/OptView/
Profile-guided Optimization (PGO) • Profile-guided optimization allows Intel compilers to use valuable runtime information to make better decisions about function inlining and interprocedural optimizations to generate faster codes. Its methodology is illustrated as follows:
Profile-guided Optimization (PGO) • First, you do an instrumented compilation by adding the -prof-gen flag in the compile process: icc -prof-gen -c a1.c a2.c a3.c icc a1.o a2.o a3.o -lirc • Then, you run the program with a representative set of data to generate the dynamic information files given by the .dyn suffix. • These files contain valuable runtime information for the compiler to do better function inlining and other optimizations. • Finally, the code is recompiled again with the -prof-use flag to use the runtime information. icc -prof-use -ipo -c a1.c a2.c a3.c • A profile-guided optimized executable is generated.
Vendor Tuned Code • Vendor math libraries have codes that are optimized for their specific machine. • On the SGI Origin2000 platform, Complib.sgimath and SCSL are available. • On the Linux clusters, Intel MKL is available. Ways to link to these libraries are described in Section 3 - Porting Issues.
Further Information • SGIIRIX man and www pages • man opt • man lno • man inline • man ipa • man perfex • Performance Tuning for the Origin2000 at http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OLD/Doc/ • Linux clusters help and www pages • ifort/icc/icpc –help (Intel) • http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) • http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) • http://perfsuite.ncsa.uiuc.edu/OptView/