240 likes | 398 Views
Compiler Techniques for Single Processor Tuning . An Introduction. The Compiler. Solving: data dependencies control flow dependencies parallelization compactification of code optimal scheduling of the code. Compilation process. Intermediate representation.
E N D
Compiler Techniques for Single Processor Tuning An Introduction
The Compiler • Solving: • data dependencies • control flow dependencies • parallelization • compactification of code • optimal scheduling of the code Compilation process Intermediate representation • Compiler manages processor resources: • registers • integer/floating-point execution units • load/store/prefetch for data flow in/out of processor • the implementation details of processor and system architecture are built into the compiler • User Program (C/C++/Fortran, etc.) • high level representation • low level representation • Machine instructions
MIPSpro Compiler Components source Executable object Inter- Procedural Analyzer Linker F77/f90 cc/CC driver Global optimizer Code generator Front-end (source to WHIRL format) Macro pre- processor Inter- Procedural Analyzer Loop nest optimizer Parallel optimizer • There are no source-to-source optimizers or parallelizers • Source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); • same IR for different levels of representation • whirl2f and whirl2c translates back into Fortran or C from IRs • Inter-Procedural analyzer requires final translation at link time
Compiler Optimizations • Global Optimizer: • dead code elimination • copy propagation • loop normalization • stride one loops • single induction variable • memory alias analysis • strength reduction • Inter-Procedural Analyzer: • cross-file function inlining • dead function elimination • dead variable elimination • padding of variables in common blocks • inter-procedural constant propagation • Automatic Parallelizer • loop level work distribution • Loop Nest Optimizer: • loop unrolling (outer) • loop interchange • loop fusion/fission • loop blocking • memory prefetch • padding local variables • Code Generator: • software pipelining • inner loop unrolling • if-conversion • read/write optimization • recurrence breaking • instruction scheduling inside basic blocks
SGI Architecture, ABI, Languages • Instruction Set Architecture (ISA): • -mips4 (R1x000, R8000, R5000 processors) • -mips3 (R4400) • -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler) • ABI (Application Binary Interface): • -n32 (32 bit pointers, 4 byte integers, 4 byte real) • -64 (64 bit pointers, 4 byte integers, 4 byte real) • Languages: • C • C++ • Fortran 77 • Fortran 90 Variable C size[bit] F size[bit] -n32 -64 -n32 -64 char/character 8 8 8 8 short 16 16 int/integer 32 32 32 32 long 32 64 long long 64 64 logical 32 32 float/real 32 32 32 32 double 64 64 64 64 pointer 32 64
Options: ABI & ISA • Option Functionality • -n32 invoke the MIPSpro Compiler, use 32 bit addressing • -64 invoke the MIPSpro Compiler, use 64 bit addressing • -o32/-32 invoke the old ucode compiler, 32 bit addressing • -mips[1234] ISA; -mips[12] implies ucode compiler • There are two more ways to define the ABI and ISA: • environment variable “SGI_ABI” can be set to -n32 or -64 • the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: -DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3 • There is a way to find which compiler flags were used: • dwarfdump -i file.o | grep DW_AT_producer
Optimization Levels • Compilation speed degrades with higher optimization • -O0 turn off all optimizations • -O1 only local optimizations • -O2 or -O extensive but conservative optimizations • -O3 aggressive optimizations, LNO, software pipelining • -ipa inter-procedural analysis (only at -O2 and -O3) • -apo automatic parallelization option (same as -pfa) • -g[0|3]debugging switch: -g0 forces -O0-g3 to debug with -O3
Options: Performance • Option Functionality • -r10000 Generate optimal instruction schedule for the R10000 proc • -r8000 Generate optimal instruction schedule for the R8000 proc • -O[0|1|2|3] Set optimization Level to 0, 1, 2, 3 • -Ofast=[ipXX] Select best optimization for the given architecture • -mp Enable multi-processing directives • -mpio Support I/O from a parallel region • -apo Invoke automatic parallelization option XX machine (output of thehinv -c processorcommand) 27 Origin2000 (all cpu frequencies and cache sizes) 35 Origin3000 (all cpu frequencies and cache sizes) optimizations may differ on the version of the compiler. Currently: -O3 -IPA-TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed (thus -Ofast switch invokes the Interprocedural Analyzer)
Options: Porting • Option Functionality • -d8/d16 Double precision variables as 8 or 16 bytes • -r8 Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1) • -i8 Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1) • -static Local variables will be initialized in fixed locations on the heap • (-static_threadprivate makes static variables private to each thread) • -col[72|120] Source line is 72 or 120 columns • -Dname Define name for the pre-processor • -Idir Define include directory dir • -alignN Assume alignment on the N=8,16,32,64,128 bit boundary • -G0 Put all static data into indirect address area • -xgot make big tables for static data and program addresses • -multigot Automatic choice of table sizes for static variables and addresses • -version Show compiler version • -show Put the compiler in verbose mode: all switches are displayed • (1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit
Options: Debugging • Option Functionality • -g Disable optimization and keep all symbol tables • -DEBUG: the DEBUG group option (man DEBUG_GROUP): • check_div=nn=1 (default) check integer divide by zero • n=2 check integer overflow • n=3 check integer divide by zero and overflow • subscript_check (default ON) to check for subscripts out of range • C/C++: produces trap #8 • f77: aborts run and dumps core • f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT • verbose_runtime (default OFF) to give source line number of failures • trap_uninitialized (default OFF) initialize all variables to 0xFFFA5A5 • when used as pointer - access violation • when used as fp values - NaN causes fp trap • Example: • f77 -n32 -mips4 -g file.f \ • -DEBUG:subscript_check:verbose_runtime=ON \ • -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON
Compilation Examples • 1. Produce executable a.out with default compilation options: f77 source.f • cc source.f • be aware of the defaults setting (e.g./etc/compiler.defaults) • same flags for Fortran and C • 2. Options for debugging: • f77/cc -o prog -n32 -g -static source.f • 2. Explicit setting of ABI/ISA/Processor, highest opt:f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f • 3. Detailed control of the optimization process with the • group options : • f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 • -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...
Fine Tuning Compiler Actions Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically: • program data is large (does not fit into the cache) • program does not violate language standard • program is insensitive to roundoff errors • all data in the program is alias-ed, unless it can be proved otherwise if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important: • OPT for general optimizations assumptions • LNO for the Loop Nest optimizer options • IPA for the Inter-Procedural Analyzer options • Additional options that help to tune the compiler properly: • TENV, TARG for the target machine and environment description -TENV:align_aggregates=x (bytes) • LIST, DEBUG for the listing and debugging options
Group Options • Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative: • E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc. • Group Heading Reference page Usage comments • -OPT:key=val cc(1) f77(1) opt(5) Optimizations • -TENV:key=val cc(1) f77(1) Control target environment • -TARG:key=val cc(1) f77(1) Control target architecture • -FLIST/CLIST cc(1) f77(1) Listing control • -LIST:key=val cc(1) f77(1) Options to control listing • -DEBUG:key=val debug_group(5) Debugging options • -IPA:key=val ipa(5) Inter-Procedural Analyzer control • -INLINE:key=val ipa(5) Procedure inliner control • -LNO:key=val lno(5) Loop Nest Optimizer control • -MP:key=val cc(1) f77(1) Parallelization control • -LANG: cc(1) f77(1) language compatibility features • -CG: cc(1) f77(1) code generation • -WOPT: cc(1) f77(1) global optimizer
Compiler man Pages • Primary man pages: • man f77(1) f90(1) cc(1) CC(1) ld(1) • some of the compiler option groups are rather large and deserve their own man pages • man opt(5) • man lno(5) • man ipa(5) • man DEBUG_GROUP(5) • man mp(3F) • man pe_environ(5) • man sigfpe(3C)
The Run-Time Library Structure *.a, *.so ucode-compiler nonshared/*.a *.a, *.so *.a, *.so Cmplrs/mongoose-compiler Cmplrs/mongoose-compiler nonshared/*.a nonshared/*.a *.a, *.so *.a, *.so nonshared/*.a nonshared/*.a *.a, *.so *.a, *.so mips3 mips3 nonshared/*.a nonshared/*.a mips4 mips4 o32 lib/ n32 lib32/ /usr lib64/ n64 Environment variables: LD_LIBRARY_PATH LD_LIBRARY32_PATH LD_LIBRARY64_PATH
The Scientific Libraries • Standard scientific libraries containing: • Basic Linear Algebra operations and algorithms: • BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3) • LAPACK (see man intro_lapack) • Fast Fourier Transformations (FFT): • 1D, 2D, 3D, multiple 1D transformations (see man intro_fft) • Convolutions (Signal Processing, e.g. man SIIR2D) • Sparse Solvers (see man solvers; man PSLDLT) • To use: • -lscs serial versions ( -lscs_i8, -lscs_i8_mp for long integers) • -lscs_mp-mp for parallel versions • man intro_scsl for detailed description • -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions • man complib.sgimath for detailed description
Computational Domain • Range of numbers (from /usr/include/limits.h): • FLT_DIG 6 /* decimal digits of precision of a float */ • FLT_MAX 3.40282347E+38F • FLT_MIN 1.17549435E-38F • DBL_DIG 15 /* decimal digits of precision of a double */ • DBL_MAX 1.7976931348623157E+308 • DBL_MIN 2.2250738585072014E-308 • LONGLONG_MIN -9223372036854775807LL-1LL • LONGLONG_MAX 9223372036854775807LL • ULONGLONG_MAX 18446744073709551615LLU • The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40)
Underflow and Denormal Numbers • When de-normalized numbers emerge in a computation (i.e. numbers x<DBL_MIN) they are flushed to zero by default: • will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. • Calling no_flush at the beginning of the program will print • 0.22250738585072014D-308 • Flush-to-zero property can lead to x-y=0, while xy . • Keeping de-normalized numbers in computations will avoid that condition, but will cause • fp exception, that must be processed in software. • It is a performance issue - not to manipulate the de-normalized numbers in calculations. #include <sys/fpu.h> void no_flush_() { union fpc_csr f; f.fc_word = get_fpc_csr(); f.fc_struct.flush = 0; set_fpc_csr(f.fc_word); } Program denorm real*8 a,b a = 2.2250738585072014D-308 b = a/10.0D0 write(6,10) b end
Overflow Example Flush to zero! • Program example that generates overflows and underflows: • Output with all exceptions ignored by default: • setenv TRAP_FPE “UNDERFL=TRACE; OVERFL=TRACE“ • will trap at Overflow and Underflow and produce traceback (Link-lfpe). Parameter (N=20) INCLUDE “/usr/include/limits.h” Real*8 A(N),B(N) complex*16 C(N) do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into double enddo C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 ! write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N) Compile with: f77 -n32 -mips4 -O3 Note: Compilation with -r8 avoids the error. 10 0.340282347000000E+39 0.117549435000000E-37 A,B 0.340282346638529E+39 0.117549435082229E-37 Cr,Ci 11 0.374310581700000E+39 0.106863122727273E-37 Infinity 0.000000000000000 12 etc… Overflow!
Floating Point Exceptions • A fp status register flag is set when fpu is has an illegal condition: • division by zero • overflow • underflow • invalid • inexact • By default, all exceptions are ignored! • (e.g. for 1/0 NaN value is set and execution continues) • The status register can be programmed to raise a Floating Point Exception. • If an FPE occurs, the system can take a specified action: • abort • ignore the exception • repair the illegal condition • You can manipulate the status register to select action: • with calls to the FPE library, link with -lfpe • with environment variable TRAP_FPE • see man handle_sigfpes
Compiler-Generated Exceptions Do i=1,N if(a(i) .lt. eps) then x = x + 1/eps else x = x + 1/a(i) endif enddo #put eps in $f1 and (1/eps) in $f0 ldc1 $f5,-8($2) load a(i) c.lt.d $fcc0,$f5,$f1 if(a(i) < eps) recip.d $f2,$f5 1/a(i) movt.d $f2,$f0,$fcc0 y = 1/a or 1/eps add.d $f1,$f1,$f2 x = x+ y • Compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) • X = 0 no speculative code motion • X = 1 IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) • X = 2 all IEEE-754 exceptions disabled except 1/0 (default -O3) • X = 3 all IEEE-754 exceptions are disabled • X = 4 memory access exceptions are disabled • IF-conversion with conditional moves for Software Pipelining (with -O3): Removing IF(…) will cause divide by zero! In this case this exception must be ignored Note: transf applied already at -O3 & X=1!
IEEE_754 Compliance y_tmp = 1/y do i=1,n x = x + a(I)*y_tmp enddo Do i=1,n x = x + a(i)/y enddo • The MIPS4 instruction set contains IEEE-754 non-compliant instructions: • recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp • rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp • -OPT:IEEE_arithmetic=X specify degree of non-compliance • and what to do with inf and NaN operands • X = 1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2) • X = 2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3) • X = 3 any mathematically valid transformation is allowed, including recip & rsqrt instr. -O3 -OPT:IEEE_arithmetic=3 21 cycles/iteration; 4% peak 1 cycles/iteration; 100% peak Note: X=3 is required!
Rounding Accuracy do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) … enddo x = x0 + x1 + … do i=1,n x = x + a(i) enddo -O3 -OPT:roundoff=2 (default at -O3) • Rounding mode can be specified with -OPT:roundoff=X switch: • X = 0 no optimizations that affect fp behaviour (default at -O1 -O2) • X = 1 allows simple transformations with limited round-off and overflow differences • X = 2 allows reordering of reduction loops (default at -O3) • X = 3 any mathematically valid transformation is allowed With -O3 -OPT:roundoff=1 2 cycles/iter; 25% peak 1 cycles/iter; 50% peak Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3
Summary • Compiler is the primary tool of program optimization • Compilation is the process of lowering the code representation from high level to low, I.e. processor level • The MipsPro compiler targets the MIPS R1x000 processor and has built in the features of the processor and Origin architecture • A large number of options exist to steer the compilation process • ABI, ISA and optimization options selections • setting of assumptions about the program behaviour • There are optimized and parallelized libraries of subroutines for scientific computation • When programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations