130 likes | 198 Views
Montecito and POWER4. Chris Thomas Chris Chaney 9/7/2005. Outline. Background Design Summary Montecito POWER4 Comparisons Memory hierarchy Threading. Background. Transistors are providing decreasing returns for exploiting ILP Lots of TLP available in commercial workloads
E N D
Montecito and POWER4 Chris Thomas Chris Chaney 9/7/2005
Outline • Background • Design Summary • Montecito • POWER4 • Comparisons • Memory hierarchy • Threading
Background • Transistors are providing decreasing returns for exploiting ILP • Lots of TLP available in commercial workloads • Power envelope is a major design constraint • Memory latency is an increasing factor in performance
Montecito • 1.72 Billion transistors • 100W • ~27 MB of total cache (L1-L3) • 1.8GHz • Dual In-order Core, each core dual-threaded • 6 issue
Montecito • Cache (per core) • 16KB L1 (I & D), write through L1D • 1MB L2I (parity), 256KB L2D • 12 MB L3 (unified, Pellston) • L3 is asynchronous • Other arrays are parity/ECC protected • Off chip bandwidth 10.66 GB/s • almost double from previous Itanium2
Montecito • TLP • TMT in the core • SMT in the memory system
Montecito • Power • Would be 300W w/o power management • Foxton • Dynamically scales voltage and frequency • Removed clock from L3 accesses (saves 10W)
POWER4 • 174 Million Transistors • Up to 128 MB total cache per module • 1.1-1.3 Ghz, deeply pipelined • 4 single thread dual core chips per module • 8 issue (peak) • Support for glueless SMPs up to 4 chips
POWER4 • Memory Hierarchy • each processor has dedicated 64KB L1I, 32KB L1D • Write through L1's, parity protected • each chip shares 1.5 MB L2, ECC • split into 3 banks, w/ separate cache controllers • L3 off chip, up to 32 MB per chip (eDRAM)
POWER4 • Memory Hierarchy Continued • Coherency takes place at L2, enhanced MESI protocol • IO handled in separate chip, connected via GX bus • bus frequencies scale w/ core frequency
POWER4 • Instruction Grouping • Helps to simplify tracking for precise interrupts • Groups of up to five instructions • Groups execute in order • Many cases cause instructions to issue one by one
Conclusions • No benchmarks were presented • High ILP processors now also exploiting TLP • No large instruction windows