240 likes | 532 Views
IA-64 Microarchitecture --- Itanium Processor. Jun Feng Jun Xie Huafeng Lü. Outline. Introduction Pipeline Issue Performance Comparison Summary. Itanium Processor. First implementation of IA-64 Compiler based exploitation of ILP Also has many features of superscalar.
E N D
IA-64 Microarchitecture --- Itanium Processor Jun Feng Jun Xie Huafeng Lü
Outline • Introduction • Pipeline Issue • Performance Comparison • Summary
Itanium Processor • First implementation of IA-64 • Compiler based exploitation of ILP • Also has many features of superscalar
10-stage Pipeline • Front-end • Instruction delivery • Operand delivery • Execution
Front-end • IPG, Fetch, Rotate Prefetches up to 32 bytes per cycle (2 bundles) into a prefetch buffer (up to hold 8 bundles) Branch prediction is done using a multilevel adaptive predictor
Instruction delivery • EXP and REN Distributes up to 6 instructions to the 9 functional units Implements registers renaming for both rotation and register stacking
Operand delivery • WLD and REG Accesses the register file Performs register bypassing Accesses and updates a register scoreboard Checks predicate dependences
Execution • EXE, DET and WRB Executes instructions through ALUs and load/store units Detects exceptions and posts NaTs Retires instructions and performs write-back
Integer PerformanceSPECint benchmark: considerably slower • Itanium is considerably slower than Alpha 21264 and Pentium 4. • Only: 60% of of P4, 68% of Alpha Itanium: HP rx4610, 800MHz, 4MB off-chip L3 cache Alpha 21264: Compaq GS320, 1GHz, on-chip L2 cache Pentium 4: Compaq Precision 330, 2GHz, 256KB on-chip L2 cache
Floating Point Performance SPECfp benchmarks: a different story • Itanium is quicker than Alpha 21264 and Pentium 4. • 108% of of P4, 120% of Alpha Itanium: HP rx4610, 800MHz, 4MB off-chip, L3 cache Alpha 21264: Compaq GS320, 1GHz, on-chip L2 cache Pentium 4: Compaq Precision 330, 2GHz, on-chip L2 cache
Discussion on SPECfp • Floating point app: competitive .higher degrees of ILP .aggressive memory system Art benchmark: 4 times of Pentium 4 Alpha: outperform when tuned In terms of power: worse than P4 56% of floating point performance per watt
Summary By Us • Good floating point performance • Poor integer performance • Overall: not so good as Intel has advertised
Conclusion • Large code size • Only static instruction-level parallelism • Cannot manage cache misses/hits flexibly • Lack of applications