420 likes | 811 Views
developing high performance hp-ux applications on the Intel® Itanium® processor. Hewlett-Packard June 2003. Key Intel ® Itanium® Processor Family Features. predication speculation support for modulo scheduling rotating registers. predication.
E N D
developing high performance hp-ux applications on the Intel® Itanium® processor Hewlett-Packard June 2003
Key Intel ® Itanium® Processor Family Features • predication • speculation • support for modulo scheduling • rotating registers
predication • allows instructions to be dynamically turned on or off using a predicate register value: • example: cmp.eq p1, p2 = r1, r2 ;; (p1) add r1 = r2, r4 (p2) ld8.sa r7 = [ r8 ],8 • if p1 is true, the add is performed, else it acts as a nop • if p2 is true, the ld8 is performed, else it acts as a nop
control speculation original: (p1) br.cond ld8 r1 = [ r2 ] transformed: ld8.s r1 = [ r2 ] . . . (p1) br.cond . . . chk.s r1, recovery data speculation original: st4 [ r3 ] = r7 ] ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ] . . . st4 [ r3 ] = r7 . . . chk.a r1, recovery speculation
modulo scheduling • overlapping execution of different loop iterations Itanium-based modulo scheduling (through register rotation and predication) traditional modulo scheduling (through unrolling) no software pipelining
hp-ux Itanium-based C++ development tools
Overview of C++ Tools for HP-UX Open Source Third Party HP TogetherSoftControlCenter Magic Draw Design Rational Rose Object Domain Eclipse VIM Visual SlickEdit CodeForge Edit Firebolt Softbench NetBeans XEMACS Bristol Tributary WindRiver SNiFF+ Compile C/ANSI C aC++ GCC G++ Build Make Make Clearmake Debug TotalView HP WDB GDB DDD Manage CVS RCS Clearcase ParasoftCodeWizard Parasoft Insure++ Analyze HP WDB Rational Purify Rational PureCoverage Optimize HP Prospect Rational Quantify HP Caliper Support DSPP
performance with reliability and ease-of-use testing strategies for bulletproof optimization large apps white box tests random tests hot code: aggressively optimized debugger support for debug of optimized code compiler quality designed in full debug support cold code: lightly optimized caliper transparent profiling profile database
hp Caliper 1.0 hp Caliper is a suite of program analysis tools • release 1.0 contains three tools: • caliper/PMU: measure performance using Itanium-based PMU • caliper/PBO: generate feedback file for compiler PBO • caliper/gprof: get gprof-style information using PMU • developers can generate faster code with caliper/PBO and HP Itanium compilers • developers can measure performance on Itanium-based platforms with caliper/PMU and caliper/gprof
hp Caliper 2.1 (for HP-UX 11i 1.6 and later) • Full support for measuring multi-process applications with output files for each process. • Identifying and selecting one, some or all processes for measurement. • Saving performance results to data files for report generation. • Support for attaching and detaching processes. • Limiting PMU measurements to specific code regions. • Improved cgprof accuracy. • Various reporting changes and improvements, including cumulative percentages in text reports and changes to report link-time addresses. • Improved shared memory handling, memory and performance savings.
profile based optimization • a critical performance tool • application branching behavior is measured • this information is fed into the compiler to guide optimization • predication, speculation, code layout, code generation for switch statements, etc. • studies show that profiling nearly always pays off, even under slightly different workloads • new features for Itanium-based PBO • post-link & dynamic instrumentation • options and pragmas
instrumented application compiler +I PA-RISC or Itanium-based optimized application profile database compiler training data sets -O +P Caliper application Itanium-based collecting profile information
developer-specified profile information • #pragma estimated_frequency f Example: foo() { if( cond ) { #pragma estimated_frequency 0.8 … for( …) { #pragma estimated_frequency 4.0 … } } } • #pragma frequently_called symbol[,symbol]* • #pragma rarely_called symbol[,symbol]*
levels of optimization • +O1 (default) • low-cost optimizations • instruction reordering • efficient instruction packing • no reordering of user-visible state updates • supports full debugging • performed under -g with no optimization explicitly specified • some limitations on where local variables may be modified from within the debugger
levels of optimization • -O (+O2) performs intraprocedural optimization, plus user-directed inlining (C++ only) • +O3 performs interprocedural optimization within a source file • +O4 performs cross-module optimization within a load module • interprocedural optimization includes inlining and cross-module analysis • +Ofast provides a combination of options which are valid for most applications: -O +Olibcalls +Onolimit +Ofltacc=relaxed +FPD +DSnative +Oshortdata
-O, +O1, +O2: optimizefunctions, bind withincompilation units +O3: optimize & bind withincompilation units +O4: optimize & bind withinload modules compiler has no visibility across load modules improve performance by restructuring ensuring that frequent call paths are within the same load module, and ideally within the same source file application scope of compilation & optimization load modules compilation units functions
developer-guided optimization:inline assembly • semaphore operations _Asm_cmpxchg, _Asm_xchg, _Asm_fetchadd • memory management _Asm_lfetch, _Asm_fc • miscellaneous _Asm_popcnt, _Asm_mux1, _Asm_mux2 • plus many more • fully integrated at the source level • operand expressions and target lvalues • fully integrated into optimization phase
developer-guided optimization:inline assembly • fences allow developer to constrain code motion • upward or downward • for specific instruction types (can specify more than one): • externally visible memory accesses • floating point operations • alu operations • system operations • call instructions • branch instructions • fences may be • specified as a standalone pseudo-assembly instruction • associated with an inline assembly instruction
developer-guided optimization:if-conversion & loop unrolling • #pragma if_convert Example: foo() { for( ... ) { #pragma if_convert if( ... ) { ... } else if( ... ) { ... } else { ... } } } • #pragma unroll_factor
developer-guided optimization • +O[no]store_ordering • preserves program order for stores to memory locations that are possibly visible to another thread • note that this does not imply strong ordering • appropriate when volatile semantics are not required • ensures that state is consistent on signals, context switch, etc
overcoming performance limiters • the scope visible to the compiler determines the limits of optimization • the compiler must generally make conservative assumptions about • aliasing • which pointers may point to the same data • binding • in which load module a data or code reference will be resolved • exception behavior • floating point accuracy and precision requirements
aliasing • make local copies while ( ... ) p->foo += ... • use high levels of optimization to increase compiler visibility
aliasing • +Otype_safety=[off|limited|ansi| strong] • asserts type safety within a compilation unit • pointers reference only their declared type except: • char * may point to anything (limited, ansi) • int fields of structs & unions may be referenced by an int * (limited, ansi) • unnamed objects are assumed to have unknown type (limited) • #pragma no_side_effects
aliasing (cont) • +Onoparmsoverlap (Fortran-like semantics) copy( char *s1, char *s2 ) { while ( *s2 != 0 ) *s1++ = *s2++; }
executables, shared libraries and symbol binding • shared libraries are fundamental to application architecture • but binding across shared libraries incurs additional cost • for any symbols not defined in the current compilation unit, the compiler must assume that they might be defined in a separate load module • data must be referenced through linkage table • function calls are indirect through the linkage table • and data pointer (gp) must be saved and restored around the call
compiler bindingclasses • default • if defined in the same compilation unit • bind directly • otherwise • global data items indirect through the linkage table • gp saved around calls • direct call assumed; linker inserts stub if needed • extern (-Bextern, +Oextern) • always go through linkage table, even if defined locally (expected to be preempted) • import stub emitted inline • gp saved around calls
compiler binding classes • protected (-Bprotected) • must be defined in the same load module (linker error if not) • global data items referenced directly • gp not saved around calls • direct call assumed • hidden (-Bhidden) • like protected, but not visible to other load modules
compiler binding classes • specifying the target load module • -exec when building an executable (a.out) • specifying binding types • -B[no]extern, -Bprotected, -Bhidden • +dumpextern filename • example • % cc -Wl,+dumpextern extFile *.o • % cc -exec -Bnoextern -Bextern:extFile *.o
floating point accuracy: +Ofltacc • specifies the level of floating point accuracy required: • +Ofltacc=strict (also +Ofltacc): disallows any optimizations that may change result values • +Ofltacc=default: allows contractions • e.g. fused multiply-add ( a = b * c + d a = fma(b,c,d) ) • +Ofltacc=limited (Itanium-based) allows optimizations which may affect the generation and propagation of NaNs and the sign of zero • e.g. x*0.0 0.0 • +Ofltacc=relaxed (also +Onofltacc): also allows optimizations (such as reordering of expressions) that may change rounding error • e.g. a = b * c * d * e a = (b * c) * (d * e) • for C and C++, this option must be given to enable the sum reduction optimization
floating point exceptions & flags • by default, the compiler assumes that applications • do not rely on precise floating point exceptions • do not query the value of floating point flags • conservative behavior can be requested with +Ofenvaccess • #pragma FLOAT_TRAPS_ON is equivalent to +Ofenvaccess (the PA definition is a bit more conservative)
other floating point options • +O[no]cxlimitedrange • equivalent to STDC CX_LIMITED_RANGE pragma • default is +Onocxlimitedrange • -fpeval=[float|double|extended] • -fpwidetypes • +O[no]libmerrno
data and pointer sizes • hp-ux supports both 32-bit and 64-bit data models • on both PA-RISC and Itanium-based • +DD32, +DD64 • in general, the 32-bit data model is more efficient • see “64-Bit Application Development for PA-RISC & Itanium” under “64-Bit Computing” on http://devresource.hp.com/devresource/Topics/Porting/Port.html • smaller data structures are generally more efficient • reduce size of Booleans and enums within objects • on Itanium-based solutions, for applications with less than 4Mb of global data, the +Oshortdata option improves performance • default with +Ofast on Itanium-based
target-specific optimization • by default, the compiler will generate code which will run well on all current platforms • the +DS option specifies a target implementation • Itanium-based: +DSblended (default), +DSitanium, … • +DS Options do not affect compatibility with older systems
constants • in earlier PA-RISC compiler versions, constants and string literals were generally placed in process-private data • this is less efficient than placing them in read-only data, and prevents sharing • accommodates developers who modify string literals • in the latest PA-RISC C and C++ compiler releases, and on Itanium • +Olit=const is default for C • +Olit=all is default for C++ • the difference is limited to the treatment of string literals • for +Olit=const they are only placed in read-only data if they are in a context where const char * would be legal
volatile • the C compiler supports four new type qualifiers to modify the volatile qualifier • __unordered • strong ordering is not required • __side_effect_free • (e.g. not to I/O space): the compiler is free to remove redundant references and/or to speculate loads • __synchronous • the value is not updated by another thread • __non_sequential • sequentiality need not be maintained relative to other memory references • the most important is __unordered
Statement of Source Compatibility Between PA-RISC and Itanium-based • As of HP-UX Release 11i v1.6 (Itanium-based), ISV and customer applications* that are supported on HP-UX 11i v 1 on PA-RISC will compile and execute correctly on Itanium with no changes to the source code. * Applications must be well behaved and free of explicit dependencies on the PA-RISC architecture. See the following URL for a definition of a well-behaved application:http://devresource.hp.com/STK/hpux11i/exceptions.html
Ensuring a Smooth Migration to Itanium • The C++ compilers for PA-RISC and Itanium-based platforms share common front-end source code. • HP-UX header files and system APIs are based on shared source code for PA-RISC and Itanium. • HP-UX supports 32-bit applications on Itanium, so applications do not need to port to 64-bit. • All HP-UX compilers and libraries must undergo compatibility testing. • Incompatibilities are subject to a thorough review process, and are allowed only when necessary. Once allowed, they are documented as compatibility exceptions. • The Software Transition Kit (STK) can be used to scan an application's source code to look for portability problems. • The HP-UX operating system, its commands and libraries, and dozens of ISV applications, comprising 10s of millions of lines of source code, have been compiled with the new Itanium-based compilers. Incompatibilities not already identified as exceptions have been treated as defects and fixed.
Compatibility Exceptions • K&R C is no longer supported. (A legacy C compiler is provided to minimize the impact of this change.) • Convex parallelization pragmas and library functions are no longer supported. (OpenMP is supported on both platforms as a replacement.) • Architecture-specific code, options, and pragmas must be modified for Itanium. These include PA-RISC assembly code, inline assembly operations, options and pragmas used for tuning code for the PA-RISC architecture and runtime, and calls to any system APIs that are supported only on PA-RISC. • The #pragma HP_ALIGN is no longer supported. (#pragma pack, which is common to Gnu and Sun compilers, should be used instead.) • Floating-point operations may result in slightly different results (Itanium-based will usually give greater accuracy), and applications may observe differences in the treatment of NaNs, denorms, infinities, signed zeroes, exceptions, and flush-to-zero. • The use of any third-party library is subject to the availability and support of that library on Itanium and HP-UX 11i v2. Native Itanium-based code and compatibility-mode PA-RISC code cannot be mixed within a single program.
summary hp-ux Itanium-based compilers provide • performance • reliability • usability