Compiler Techniques for Single Processor Tuning

Compiler Techniques for Single Processor Tuning An introduction

Outline • Compiler is the primary tool of computer program optimization: • structure of the compiler and the compilation process • compiler optimizations • steering of the compilation process - compiler options • structure of the run time libraries and scientific libraries • computational domain and computation accuracy

The Compiler • Solving: • data dependencies • control flow dependencies • parallelization • compactification of code • optimal scheduling of the code Compilation process Intermediate representation • Compiler manages processor resources: • registers • integer/floating-point execution units • load/store/prefetch for data flow in/out of processor • The implementation details of processor and system architecture are build into the compiler • User Program (C/C++/Fortran, etc.) • high level representation • low level representation • Machine instructions

MIPSpro Compiler Components source Executable object Inter- Procedural Analyser Linker F77/f90 cc/CC driver Global optimizer Code generator Front-end (source to WHIRL format) Macro pre- processor Inter- Procedural Analyser Loop nest optimizer Parallel optimizer • There are no source-to-source optimizers or parallelizers • source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); • same IR for different levels of representation • whirl2f and whirl2c translates back into Fortran or C from IRs • Inter-Procedural analyser requires final translation at link time

Compiler Optimizations • Global optimizer: • dead code elimination • copy propagation • loop normalisation • stride one loops • single induction variable • memory alias analysis • strength reduction • Inter-Procedural Analyser: • cross-file function inlining • dead function elimination • dead variable elimination • padding of variables in common blocks • inter-procedural constant propagation • Automatic parallelizer • loop level work distribution • Loop Nest optimizer: • loop unrolling (outer) • loop interchange • loop fusion/fission • loop blocking • memory prefetch • padding local variables • Code Generator: • software pipelining • inner loop unrolling • if-conversion • read/write optimization • recurrence breaking • instruction scheduling inside basic blocks

SGI Architecture, ABI, Languages • Instruction Set Architecture (ISA): • -mips4 (R1x000, R8000, R5000 processors) • -mips3 (R4400) • -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler) • ABI (Application Binary Interface): • -n32 (32 bit pointers, 4 byte integers, 4 byte real) • -64 (64 bit pointers, 4 byte integers, 4 byte real) • Languages: • C • C++ • Fortran 77 • Fortran 90 Variable C size[bit] F size[bit] -n32 -64 -n32 -64 char/character 8 8 8 8 short 16 16 int/integer 32 32 32 32 long 32 64 long long 64 64 logical 32 32 float/real 32 32 32 32 double 64 64 64 64 pointer 32 64

Options: ABI & ISA • Option Functionality • -n32 invoke the MIPSpro Compiler, use 32 bit addressing • -64 invoke the MIPSpro Compiler, use 64 bit addressing • -o32/-32 invoke the old ucode compiler, 32 bit addressing • -mips[1234] ISA; -mips[12] implies ucode compiler • There are two more ways to define the ABI and ISA: • environment variable “SGI_ABI” can be set to -n32 or -64 • the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: -DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3 • There is a way to find which compiler flags were used: • dwarfdump -i file.o | grep DW_AT_producer

Optimization Levels • Compilation speed degrades with higher optimization • -O0 turn off all optimizations • -O1 only local optimizations • -O2 or -O extensive but conservative optimizations • -O3 aggressive optimizations, LNO, software pipelining • -ipa inter-procedural analysis (only at -O2 and -O3) • -apo automatic parallelization option (same as -pfa) • -g[0|3]debugging switch: -g0 forces -O0-g3 to debug with -O3

Options: Performance • Option Functionality • -r10000 Generate optimal instruction schedule for the R10000 proc • -r8000 Generate optimal instruction schedule for the R8000 proc • -O[0|1|2|3] Set optimization Level to 0, 1, 2, 3 • -Ofast=[ipXX] Select best optimization for the given architecture • -mp Enable multi-processing directives • -mpio Support I/O from a parallel region • -apo Invoke automatic parallelization option XX machine (output of thehinv -c processorcommand) 27 Origin2000 (all cpu frequencies and cache sizes) 35 Origin3000 (all cpu frequencies and cache sizes) optimizations may differ on the version of the compiler. Currently: -O3 -IPA-TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed (thus -Ofast switch invokes the Interprocedural Analyser)

Options: Porting • Option Functionality • -d8/d16 Double precision variables as 8 or 16 bytes • -r8 Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1) • -i8 Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1) • -static Local variables will be initialised in fixed locations on the heap • -col[72|120] Source line is 72 or 120 columns • -Dname Define name for the pre-processor • -Idir Define include directory dir • -alignN Assume alignment on the N=8,16,32,64,128 bit boundary • -version Show compiler version • -show Put the compiler in verbose mode: all switches are displayed • (1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit

Options: Debugging • Option Functionality • -g Disable optimization and keep all symbol tables • -DEBUG: the DEBUG group option (man DEBUG_GROUP): • check_div=nn=1 (default) check integer divide by zero • n=2 check integer overflow • n=3 check integer divide by zero and overflow • subscript_check (default ON) to check for subscripts out of range • C/C++: produces trap #8 • f77: aborts run and dumps core • f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT • verbose_runtime (default OFF) to give source line number of failures • trap_uninitialized (default OFF) initialise all variables to 0xFFFA5A5 • when used as pointer - access violation • when used as fp values - NaN causes fp trap • Example: • f77 -n32 -mips4 -g file.f \ • -DEBUG:subscript_check:verbose_runtime=ON \ • -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON

Compilation Examples • 1. Produce executable a.out with default compilation options: f77 source.f • cc source.f • be aware of the defaults setting (e.g./etc/compiler.defaults) • same flags for Fortran and C • 2. Options for debugging: • f77/cc -o prog -n32 -g -static source.f • 2. Explicit setting of ABI/ISA/Processor, highest opt:f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f • 3. Detailed control of the optimization process with the • group options : • f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 • -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...

Fine Tuning Compiler Actions Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically: • program data is large (does not fit into the cache) • program does not violate language standard • program is insensitive to roundoff errors • all data in the program is alias-ed, unless it can prove otherwise if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important: • OPT for general optimizations assumptions • LNO for the Loop Nest optimizer options • IPA for the Inter-Procedural Analyser options • Additional options that help to tune the compiler properly: • TENV, TARG for the target machine and environment description • LIST, DEBUG for the listing and debugging options

Group Options • Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative: • E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc. • Group Heading Reference page Usage comments • -OPT:key=val cc(1) f77(1) opt(5) optimizations • -TENV:key=val cc(1) f77(1) Control target environment • -TARG:key=val cc(1) f77(1) Control target architecture • -FLIST/CLIST cc(1) f77(1) Listing control • -LIST:key=val cc(1) f77(1) Options to control listing • -DEBUG:key=val debug_group(5) Debugging options • -IPA:key=val ipa(5) Inter-Procedural Analyser control • -INLINE:key=val ipa(5) Procedure inliner control • -LNO:key=val lno(5) Loop Nest optimizer control • -MP:key=val cc(1) f77(1) parallelization control • -LANG: cc(1) f77(1) language compatibility features

Compiler man Pages • Primary man pages: • man f77(1) f90(1) cc(1) CC(1) ld(1) • some of the compiler option groups are rather large and deserve their own man pages • man opt(5) • man lno(5) • man ipa(5) • man DEBUG_GROUP(5) • man mp(3F) • man pe_environ(5) • man sigfpe(3C)

The Run-Time Library Structure *.a, *.so ucode-compiler nonshared/*.a *.a, *.so *.a, *.so Cmplrs/mongoose-compiler Cmplrs/mongoose-compiler nonshared/*.a nonshared/*.a *.a, *.so *.a, *.so nonshared/*.a nonshared/*.a *.a, *.so *.a, *.so mips3 mips3 nonshared/*.a nonshared/*.a mips4 mips4 o32 lib/ n32 lib32/ /usr lib64/ n64 Environment variables: LD_LIBRARY_PATH LD_LIBRARY32_PATH LD_LIBRARY64_PATH

The Scientific Libraries • Standard scientific libraries containing: • Basic Linear Algebra operations and algorithms: • BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3) • LAPACK (see man intro_lapack) • Fast Fourier Transformations (FFT): • 1D, 2D, 3D, multiple 1D transformations (see man intro_fft) • Convolutions (Signal Processing, e.g. man SIIR2D) • Sparse Solvers (see man solvers; man PSLDLT) • To use: • -lscs serial versions • -lscs_mp-mp for parallel versions • man intro_scsl for detailed description • -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions • man complib.sgimath for detailed desctiption

Computational Domain • Range of numbers (from /usr/include/limits.h): • FLT_DIG 6 /* decimal digits of precision of a float */ • FLT_MAX 3.40282347E+38F • FLT_MIN 1.17549435E-38F • DBL_DIG 15 /* decimal digits of precision of a double */ • DBL_MAX 1.7976931348623157E+308 • DBL_MIN 2.2250738585072014E-308 • LONGLONG_MIN -9223372036854775807LL-1LL • LONGLONG_MAX 9223372036854775807LL • ULONGLONG_MAX 18446744073709551615LLU • The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40)

Underflow and Denormal Numbers • When de-normalised numbers emerge in a computation (i.e. numbers x<DBL_MIN) they are flushed to zero by default: • will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. • Calling no_flush at the beginning of the program will print • 0.22250738585072014D-308 • Flush-to-zero property can lead to x-y=0, while xy . • Keeping de-normalised numbers in computations will avoid that condition, but will cause • fp exception, that must be processed in software. • It is a performance issue - not to manipulate the de-normalized numbers in calculations. #include <sys/fpu.h> void no_flush_() { union fpc_csr f; f.fc_word = get_fpc_csr(); f.fc_struct.flush = 0; set_fpc_csr(f.fc_word); } Program denorm real*8 a,b a = 2.2250738585072014D-308 b = a/10.0D0 write(6,10) b end

Overflow Example Flush to zero! • Program example that generates overflows and underflows: • Output with all exceptions ignored by default: • setenv TRAP_FPE „UNDERFL=TRACE; OVERFL=TRACE“ • will trap at Overflow and Underflow and produce traceback (Link-lfpe). Parameter (N=20) INCLUDE “/usr/include/limits.h” Real*8 A(N),B(N) complex*16 C(N) do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into double enddo C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 ! write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N) Compile with: f77 -n32 -mips4 -O3 Note: Compilation with -r8 avoids the error. 10 0.340282347000000E+39 0.117549435000000E-37 A,B 0.340282346638529E+39 0.117549435082229E-37 Cr,Ci 11 0.374310581700000E+39 0.106863122727273E-37 Infinity 0.000000000000000 12 etc… Overflow!

Floating Point Exceptions • A fp status register flag is set when fpu is has an illegal condition: • division by zero • overflow • underflow • invalid • inexact • By default, all exceptions are ignored! • (e.g. for 1/0 NaN value is set and execution continues) • The status register can be programmed to raise a Floating Point Exception. • If an FPE occurs, the system can take a specified action: • abort • ignore the exception • repair the illegal condition • You can manipulate the status register to select action: • with calls to the FPE library, link with -lfpe • with environment variable TRAP_FPE • see man handle_sigfpes

Compiler-Generated Exceptions Do i=1,N if(a(i) .lt. eps) then x = x + 1/eps else x = x + 1/a(i) endif enddo #put eps in $f1 and (1/eps) in $f0 ldc1 $f5,-8($2) load a(i) c.lt.d $fcc0,$f5,$f1 if(a(i) < eps) recip.d $f2,$f5 1/a(i) movt.d $f2,$f0,$fcc0 y = 1/a or 1/eps add.d $f1,$f1,$f2 x = x+ y • compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) • X = 0 no speculative code motion • X = 1 IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) • X = 2 all IEEE-754 exceptions disabled except 1/0 (default -O3) • X = 3 all IEEE-754 exceptions are disabled • X = 4 memory access exceptions are disabled • IF-conversion with conditional moves for Software Pipelining (with -O3): Removing IF(…) will cause divide by zero! In this case this exception must be ignored Note: transf applied already at -O3 & X=1!

IEEE_754 Compliance y_tmp = 1/y do i=1,n x = x + a(I)*y_tmp enddo Do i=1,n x = x + a(i)/y enddo • The MIPS4 instruction set contains IEEE-754 non-compliant instructions: • recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp • rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp • -OPT:IEEE_arithmetic=X specify degree of non-compliance • and what to do with inf and NaN operands • X=1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2) • X=2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3) • X=3 any mathematically valid transformation is allowed, including recip & rsqrt instr. -O3 -OPT:IEEE_arithmetic=3 21 cycles/iteration; 4% peak 1 cycles/iteration; 100% peak Note: X=3 is required!

Rounding Accuracy do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) … enddo x = x0 + x1 + … do i=1,n x = x + a(i) enddo -O3 -OPT:roundoff=2 (default at -O3) • Rounding mode can be specified with -OPT:roundoff=X switch: • X = 0 no optimizations that affect fp behaviour (default at -O1 -O2) • X = 1 allows simple transformations with limited round-off and overflow differences • X = 2 allows reordering of reduction loops (default at -O3) • X = 3 any mathematically valid transformation is allowed With -O3 -OPT:roundoff=1 2 cycles/iter; 25% peak 1 cycles/iter; 50% peak Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3

Summary • Compiler is the primary tool of program optimization • compilation is the process of lowering the code representation from high level to low, I.e. processor level • the MipsPro compiler targets the MIPS R1x000 processor and has build in the features of the processor and Origin2000 architecture • A large number of options exist to steer the compilation process • ABI, ISA and optimization options selections • setting of assumptions about the program behaviour • there are optimized and parallelized libraries of subroutines for scientific computation • when programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations

Compiler Techniques for Single Processor Tuning

Compiler Techniques for Single Processor Tuning

Presentation Transcript

Compiler techniques for exposing ILP

Code Tuning Techniques

Compiler Techniques for Single Processor Tuning

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Code Tuning Techniques

Compiler techniques for exposing ILP

Advanced Compiler Techniques

Single Cycle Processor Design

Optimizing Compiler for the Cell Processor

Compiler Techniques for ILP

Single Cycle Processor Design

Single Cycle Processor