Montecito and POWER4

Montecito and POWER4 Chris Thomas Chris Chaney 9/7/2005

Outline • Background • Design Summary • Montecito • POWER4 • Comparisons • Memory hierarchy • Threading

Background • Transistors are providing decreasing returns for exploiting ILP • Lots of TLP available in commercial workloads • Power envelope is a major design constraint • Memory latency is an increasing factor in performance

Montecito • 1.72 Billion transistors • 100W • ~27 MB of total cache (L1-L3) • 1.8GHz • Dual In-order Core, each core dual-threaded • 6 issue

Montecito • Cache (per core) • 16KB L1 (I & D), write through L1D • 1MB L2I (parity), 256KB L2D • 12 MB L3 (unified, Pellston) • L3 is asynchronous • Other arrays are parity/ECC protected • Off chip bandwidth 10.66 GB/s • almost double from previous Itanium2

Montecito • TLP • TMT in the core • SMT in the memory system

Montecito • Power • Would be 300W w/o power management • Foxton • Dynamically scales voltage and frequency • Removed clock from L3 accesses (saves 10W)

POWER4 • 174 Million Transistors • Up to 128 MB total cache per module • 1.1-1.3 Ghz, deeply pipelined • 4 single thread dual core chips per module • 8 issue (peak) • Support for glueless SMPs up to 4 chips

POWER4 • Memory Hierarchy • each processor has dedicated 64KB L1I, 32KB L1D • Write through L1's, parity protected • each chip shares 1.5 MB L2, ECC • split into 3 banks, w/ separate cache controllers • L3 off chip, up to 32 MB per chip (eDRAM)

POWER4 • Memory Hierarchy Continued • Coherency takes place at L2, enhanced MESI protocol • IO handled in separate chip, connected via GX bus • bus frequencies scale w/ core frequency

POWER4 • Instruction Grouping • Helps to simplify tracking for precise interrupts • Groups of up to five instructions • Groups execute in order • Many cases cause instructions to issue one by one

Conclusions • No benchmarks were presented • High ILP processors now also exploiting TLP • No large instruction windows

Montecito and POWER4

Montecito and POWER4

Presentation Transcript

User experiences on Heterogeneous TACC IBM Power4 System

Price and Output and

and

and

AND

Impulse and Momentum and Collisions and Stuff

maggie and milly and molly and may

Adult and on and on and on and on and on

maggie and milly and molly and may

“ maggie and milly and molly and may”

TACC/NPACI IBM Regatta-HPC (Power4) Overview

Early Experiences with Datastar: A 10TF Power4 + Federation system

and

Latin (and Greek and English and . . . )

and

AND, OR and NOT

Optimal coding practices for IBM POWER4 processors

Weekend Beach House in Montecito

and