IBM Power 7

IBM Power 7 HarshitDokania (hdokani2 ) DhashrathRaguraman (raguram2) PrasannaGiridhar (giridha2)

System Overview • Performance • 8 Cores • 4 SMT threads/core • 45nm technology – 567 sq mm • 1.2B transistors • Integrated L1,L2 and L3(local+global) Cache • Memory • On-Chip eDRAM • 2 DDR 3 Memory Controllers • Micro Architecture • Fetch, Decode, Dispatch, Rename, Execute and Commit • 12 execution units/core • 2 integer and 4 binary floating-point units • 1 vector and 1 decimal floating-point unit • 2 load/store, 1 branch, 1 condition register

Instruction Fetch • 8 Instructions per cycle • I-Cache • 32 KB, 4 way, 128 B cache line • Tournament Branch Predictor Tag Address Translation I Cache 32KB 4 way, 16 banks 128 B line 21bit GHV 8 Instructions Hash Addr Cache Hit/Miss

Dispatch and Rename • Dispatch a group of upto 6 instructions • Physical Register Renaming • Global Completion Table

Issue • Different issue queues • Unified issue queue • Branch issue queue • Unified Issue Queue • Partitioned across threads • Complexity vs. Performance • Issue upto 8 instructions per cycle BRQ UQ0 UQ1 Branch Ex Unit FX1 LS1 VS1 FX0 LS0 VS0

Pipeline Commit DE IF EX MEM

Simultaneous Multi Threading and Mode Selection • Idea: • A single thread will be unable to keep all these resources busy • Modes • ST: Single instruction execution thread per core • SMT2: Two instruction execution threads per core • SMT4: Four instruction execution threads per core • An 8-core Power7 can execute 32 threads simultaneous • Dynamic SMT mode selection based on throughput comparison in single thread mode vs SMT mode

S80 HW Multi-thread Single thread Out of Order FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2 Way SMT POWER7 4 Way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing Comparison

Partitioning in SMT • Non partitioned resources • Complexity and larger area • Power Consumption • In SMT4 • Thread 0 and Thread 1 use • {UQ0,GRP0,FXU0,LSU0} • Thread 2 and Thread 3 use • {UQ1,GRP1,FXU1,LSU1} • In ST,SMT2 • The GPR store identical contents • Functional(FX,LS) units are shared T0, T2 T1, T3 GPR0 GPR1

Cache Hierarchy • L1 and L2 latency reduced • L3 on chip offers low latency and high bandwidth

L1 Cache • 32KB, 8 way associativity, 128B cache lines • Increased bandwidth with concurrent reads and writes • Divided into banks • Each bank has 2 read ports and 1 write port • Follows write through policy, all stores go to L2 directly

L2 Cache • All stores from L1 at low latency • 256KB, 8 way associativity • L3 local (Partial Victim Cache) management policy • determines exclusive or inclusive L3: • Exclusive provides more capacity and associativity • Inclusive helps reduce energy

L3 Cache • 32MB divided into 8 regions of 4MB each: • 8 way associativity, 128B cache lines • Each region coupled with L2 cache • L3 made of eDRAM: • eDRAM has more latency and cycle time than SRAM • But still less volume occupied and less standy • energy consumed • On chip for lower latency and higher bandwidth

L3 Cache Usage Other L3 can also use as victim cache L2 uses local L3 as the victim cache Higher weightage given to associated L2 compared to other L3

L2 Cast out Check if there is a copy of line. If not allocate a line

Lateral Cast out Check if there is a copy of line If not decide to allocate line based on capacity

L3 replacement policy • All lines of L3 divided into 2 classes: • Victim lines due to cast out by associated L2 • Victim lines due to lateral cast out by other L3 • Lines are evicted in the following order: • LRU from class 2 • If no such line exists, then LRU from class 1

Memory Subsystem Total Sustained ~ 100 GB/s MC0 MC1 Buffer Chips … DDR3 DRAM Chips …

On-Chip Interconnect Coherence Bus 4x16B Data 4x16B Data Coherence Bus

(RAS) Reliability, Availability, Serviceability • Stores previous architecture states. • On Error detection rollback. • Recovery unit is not required like POWER6 • Register renaming mechanism used. • Flushes all the instruction in pipeline • Fence the core from the rest of the system • Run an automatic built-in self-test

Energy Management • While having 4 times as many cores on the chip, POWER7 chip is in the same power envelope as POWER6. • Reasons • Reduced frequency (From 2 cores on 3.6 ~ 5 GHz to 8 cores on 2.4–4.25 GHz ) • Microarchitecture optimization(For example, combining GPR, FPR and VR) • On chip eDRAM (one fifth compared to equivalent SRAM)

Power Saving Techniques • Turn off threads in a core Remaining threads share the core execution resources • Turn off each core individually • cache, eDRAM reallocation • Better Core level performance • Per core frequency control • Operating requirements • Clock gating

Questions

THANK YOU

IBM Power 7

IBM Power 7

Presentation Transcript

Nigel Adams IBM i Product Manager IBM Power Systems nigel_adams@uk.ibm.com

IBM Lotus Domino 7 Meets IBM DB2

IBM FlashSystems and IBM Power Systems for SAP landscapes

IBM Power Systems – deliver better insights, faster

IBM Power Systems Compiler Roadmap

Application Development on Power System (IBM i )

IBM Lotus Domino 7 Meets IBM DB2

Moving from IBM IDS 7.x to IBM IDS 9.x

Lecture 7: Power

Application Development on Power System (IBM i )

Application Development on Power System (IBM i )

Get started on IBM Linux on Power SDK

IBM Channel Marketing Speed Sheet – Power Systems

Chapter 7 AC Power

Selling IBM Systems Director for Power Servers

IBM POWER

for IBM Cognos Series 7 Impromptu

Application Development on Power System (IBM i )

Performance is What IBM Power Systems Provide

IBM