180 likes | 289 Views
Compiler Ecosystem. Compiler Comparisons Table Critical Features Supported by x86 Compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
E N D
Compiler Ecosystem Computation Products Group
Compiler Comparisons TableCritical Features Supported by x86 Compilers Computation Products Group
Intel CPUID ChecksHow to determine if they exist in a binary • CPUID instruction reports: • Types of x86/x86-64 instructions supported (SSE, SSE2, SSE3) • Vendor of the processor(Genuine Intel or Authentic AMD) • Intel C and FORTRAN compiler’s runtime library enviorments check “Vendor of Processor” and then run down alternate code path that: • segmentation faults because Intel doesn’t support non-Intel processors • executes legacy code optimized for Pentium PRO, PII or PIII • CPUID checks also exist in Intel’s Math Kernel Library • applications calling FFTs or Linear Algebra strongly impacted • ISVs and customers must utilize ACML (likely a2xperformance boost) Computation Products Group
Illustrating to ISVs and customers the practices employed by Intel at the user’s inconvenience builds rapport and confidence between them and AMD Intel CPUID ChecksHow to determine if they exist in a binary • How to check if CPUID checks exist in a binary, type: • Dump all assembly instructions in binary to a txt file, type: • objdump –d “binary” > binary.txt • Search “binary.txt” file for lines containing cpuid instructions, type: • grep “cpuid” binary.txt • Search above will print out instruction address at the beginning of each line containing cpuid • cpuid located in function called: “IntelProcessorIdentificationFunction:” • determine how many times it is called in “binary.txt” by typing: • grep “IntelProcessorIdentificationFunction” binary.txt Computation Products Group
Intel Compiler and MKL on OpteronThreat Assessment of using Intel Compilers • The compiler is a weapon – maker can control the code generated and run upon their chip and their competitor • working with PGI and NAG we can address the performance and functionality issues of a customer by modifying the compiler or ACML • CPUID checks – instruction compatibility not checked but rather the Vendor ID • AMD platform issues not supported unless reproducible on Intel platforms • CPUID checks placed into code because Intel doesn’t trust users intellect http://support.intel.com/support/performancetools/c/sb/cs-009787.htm Issues on AMD platforms can not be addressed and will not be reproducible since we do not issue the same VENDOR ID in the CPUID instruction ISVs and customers draw the conclusion AMD Platforms aren’t dependable Computation Products Group
On LS-DYNA the PGI 64-bit binary targeted towards XEON with -tp p7-64 is faster than the Intel 8.1 binary by 4% Intel Compiler and MKL on OpteronThreat Assessment of using Intel Compilers • The AMD Core Math Library (ACML) can not be linked with the Intel 8.1 AMD64 compiler, the only option is Intel’s MKL • Opteron runs many Intel MKL routines 25-75% the rate it runs the counterpart ACML routines (ex: CFFT1D, CFFT2D, DGEMM, …) • ISVs and customers whose applications are performance bound by FFTs, BLAS or LAPACK strongly impacted (ex: ANSYS performance increased 43% moving to 64-bit using ACML rather than MKL) • Necessitates increasing the # of compilers and binaries required to support both AMD and Intel platforms • PGI creates both AMD (-tp k8-64) and Intel (-tp p7-64) tuned binaries • work done by AMD tuning PGI compiler leveraged also in Intel binaries Computation Products Group
Intel Compiler and MKL on OpteronThreat Assessment of using Intel Compilers • Intel has stated at the link below that in 8.1 Intel compilers the switches to target chips without SSE2 or SSE3 will no longer function http://support.intel.com/support/performancetools/c/sb/cs-009787.htm • Opteron lacks SSE3 support until Jackhammer in Q2 ‘05 • The user will be unable to tell the compiler not to utilize SSE3 insturctions • ISVs and Customers will have no solution as to using binaries built by Intel compilers upon Opteron • Occurrences such as this will continue every time Intel introduces a new instruction set for x86 based systems (SSE4?) Users presently using the Intel compiler upon Opteron based systems or ISVs supporting customers in a similar manner will have no method of optimizing code for an AMD based system with the exception of compiling without optimization Computation Products Group
Tuning Performance with CompilersMaintaining Stability while Optimizing • STEP 0: Build application using the following procedure: • compile all files with the most aggressive optimization flags below: -tp k8-64 –fastsse • if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k8-64 –fast –Mscalarsse • if problems persist compile at Optimization level 1: -tp k8-64 –O0 • STEP 1: Profile binary and determine performance critical routines • STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability Computation Products Group
Tuning Memory IO BandwidthOptimizing large streaming operations • 2 Methods of writing to memory in x86/x86-64: • traditional memory stores cause write allocates to cache Mov %rax,[%rdi] movsd %xmm0,[%rdi] movapd %xmm0,[%rdi] • page to be modified is read into cache • cache is modified, written to memory when new memory page loaded • to write N bytes, 2N bytes of bandwidth generated • non-temporal stores bypass cache and write directly to memory • no write allocate to cache, to write N bytes, N bytes of bandwidth generated • data is not backed up into cache, do not use with often reused data • Use only on functions which write L2/2 > bytes of data or more, normally would assure little cache reuse value Group all eligible routines into a common file to as to simplifythe compilation procedure. Enable non-temporal stores in PGIcompiler with the –Mnontemporal compiler option Computation Products Group
PGI Compiler FlagsOptimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: • Most aggressive: -tp k8-64 –fastsse –Mipa=fast • enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling • strongly recommended for any single precision source code • Middle of the ground: -tp k8-64 –fast –Mscalarsse • enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results • in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code • Least aggressive: -tp k8-64 –O0 (or –O1) Computation Products Group
PGI Compiler FlagsFunctionality Flags • -mcmodel=medium • use if your application statically allocates a net sum of data structures greater than 2GB • -Mlarge_arrays • use if any array in your application is greater than 2GB • -KPIC • use when linking to shared object (dynamically linked) libraries • -mp • process OpenMP/SGI directives/pragmas (build multi-threaded code) • -Mconcur • attempt auto-parallelization of your code on SMP system with OpenMP Computation Products Group
Absoft Compiler FlagsOptimization Flags Below are 3 different sets of recommended Absoft compiler flags for flag mining application source bases: • Most aggressive: -O3 • loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases • strongly recommended for any single precision source code • Middle of the road: -O2 • enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling. • in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code • Least aggressive: -O1 Computation Products Group
Absoft Compiler FlagsFunctionality Flags • -mcmodel=medium • use if your application statically allocates a net sum of data structures greater than 2GB • -g77 • enables full compatibility with g77 produced objects and libraries (must use this option to link to GNU ACML libraries) • -fpic • use when linking to shared object (dynamically linked) libraries • -safefp • performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of NaNs Computation Products Group
Pathscale Compiler FlagsOptimization Flags • Most aggressive: -Ofast • Equivalent to –O3 –ipa –OPT:Ofast –fno-math-errno • Aggressive : -O3 • optimizations for highest quality code enabled at cost of compile time • Some generally beneficial optimization included may hurt performance • Reasonable: -O2 • Extensive conservative optimizations • Optimizations almost always beneficial • Faster compile time • Avoids changes which affect floating point accuracy. Computation Products Group
Pathscale Compiler FlagsFunctionality Flags • -mcmodel=medium • use if static data structures are greater than 2GB • -ffortran-bounds-check • (fortran) check array bounds • -shared • generate position independent code for calling shared object libraries • Feedback Directed Optimization • STEP 0: Compile binary with -fb_create_fbdata • STEP 1: Run code collect data • STEP 2: Recompile binary with -fb_opt fbdata • -march=(opteron|athlon64|athlon64fx) • Optimize code for selected platform (Opteron is default) Computation Products Group
Microsoft Compiler FlagsOptimization Flags • Recommended Flags : /O2 /Ob2 /GL /fp:fast • /O2 turns on several general optimization & /O2 enable inline expansion • /GL enables inter-procedural optimizations • /fp:fast allows the compiler to use a fast floating point model • Feedback Directed Optimization • STEP 0: Compile binary with /LTCG:PGI • STEP 1: Run code collect data • STEP 2: Recompile binary with /LTCG:PGO • Turn off Buffer Over Run Checking • The compiler by default runs on /GS to check for buffer overruns. Turning off checking by specifying /GS- may result in additional performance Computation Products Group
Microsoft Compiler FlagsFunctionality Flags • /GT • enables run-time information • /Wp64 • supports fiber safety for data allocated using static thread-local storage • /LD • detects most 64-bit portability problems • /Oa • creates a dynamic-link library • /Ow • assumes aliasing across function calls but not inside functions Computation Products Group
64-Bit Operating SystemsRecommendations and Status • SUSESLES 9 with latest Service Pack available • Has technology for supporting latest AMD processor features • Widest breadth of NUMA support and enabled by default • Oprofile system profiler installable as an RPM and modularized • complete support for static & dynamically linked 32-bit binaries • Red Hat Enterprise Server 3.0 Service Pack 2 or later • NUMA features support not as complete as that of SUSE SLES 9 • Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn’t satisfactory • only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit) • Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applications Computation Products Group