140 likes | 222 Views
Is Out-Of-Order Out Of Date ?. IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture Lab. Slides by Selvin, Pascal, Pavel. The prelude to the IA-64. The need for greater processing power is increasing
E N D
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performanceWilliam S. Worley Jr., HP LabsJerry Huck, IA-64 Architecture Lab Slides by Selvin, Pascal, Pavel
The prelude to the IA-64 • The need for greater processing power is increasing • New innovative computing technologies • Traditional computing has increasing problem sizes • Architecture design from the ground-up to support ILP • Enables the compiler to express more parallelism (EPIC) • Reduces hardware cost of scheduling parallel instructions • Current approaches • Legacy architectures were not designed primarily for high ILP • Non-architectural, principally OOO dynamic superscalar hardware • IA-64 • Growing market for high-performance 64-bit architecture • No existing Intel 64-bit binaries
Not just for ILP • Better building block for high performance systems • Multi-programming gives limited improvements • Parallelism has to be improved at all levels in the system • Solely hardware-based multithreading cannot compensate for lack of parallelism in the basic processing element. • SMT, CMP apply equally to RISC, CISC and EPIC • Integrated hardware multithreading is orthogonal to EPIC • Inter-thread interference in SMT processors • Hardware Resource Utilization vs. Complexity • Transistors : PA-8000 re-order buffer = PA-7200 • Complexity scales quadratically for 1.5x or 2x increase in issue-width
Architecture vs. Implementation • Speed of Functional units is architecture independent • Memory and Data-cache hierarchy • Largely independent of the architecture • OOO RISC designs achieve better utilization • With additional cost, it is possible to realise better designs • IA-64 memory-system balanced cost and performance • Cycle time of IA-64 • IC process, number of registers, register ports, bypass network, number of cache ports • Critical path is found in functional units and bypass networks • IA-64 have higher utilization of this fundamental structure
IA-64 Parallelism Capabilities • Predication: • less encountered branches • less mispredicted branches • more parallelism • Larger register set: • new coding strategies (impossible with RISC) • more efficient than register renaming (RISC) • less data loss in the event of an interruption
IA-64 Parallelism Capabilities (2) • Features to deal with memory latency: • earlier access to variables • not restricted to fixed hardware algorithms for: • correctly predicting execution path • triggering memory fetches • heuristics to identify speculative load candidates • compiler involved • control of the degree of speculation by the programmer
IA-64 Parallelism Capabilities (3) • Register Stack Engine (RSE): • increases the utilization of the register file • reduces the cost of procedures calls, returns • especially valuable for object-oriented code • straightforward hardware design • Mechanisms to deliver instructions to the processor • eliminate effects of increased code size • modest design costs
Results • Comparison between PA-RISC and IA-64 • 15 codes (encryption, decryption and keying for five AES algorithms) • 8/15 IA-64 codes used more than 32 reg. • 6/15 IA-64 codes smaller • 2/15 IA-64 codes 4 times smaller • overall code size 27 % larger (could have been reduced to 10%)
IA-64 uses existing compiler techniques to exploit parallelism: data prefetch branch hints loop unrolling profile-based path instructions other Compilers and IA-64
IA-64 does require well-prepared code: (profiled, with branch hints, etc) to achieve high performance, but this is also true for Out-of-Order processors. Lack of code profiling is equally harmful both for IA-64 and OOO architectures. With profiled code, IA-64 is superior to OOO, as proven by benchmark tests (specFP64) Need for compiler support
Critical path instructions (e.g. long latency operations) • OOO compilers don’t distinguish them, so such instrs. often have high exec. cost • IE-64 compilers must detect such instructions and make sure they start first (*) * Cost of mispredicts is minimized by prefetches issued by the compiler
Compiler contribution: static code generation (i.e. fewer branches) branch hints Hardware mechanisms: sample instructions on timer ticks, get information about actual program flow (HP) feedback info on cache misses back to the program (Intel Itanium) Dealing with cache misses
IA-64 has hint fields in most branch and memory instructions to allow the program collect flow info from and pass it to the processor. These features allow software to improve performance during the run-time, without recompilation. Dynamic prediction mechanisms (2)
Initial implementation (as always) focuses on the most important architectural elements only. It uses the ideas of EPIC while providing compatibility with IA-32 and PA-RISC processors. Future implementations will deliver even more ILP Creators assure that the IA-64 architecture will not remain fixed Current and Future IA-64 implementations