UIUC - CS 433 IBM POWER7

Adam Kunk Anil John Pete Bohman UIUC - CS 433 IBM POWER7

Quick Facts • Released by IBM in 2010 (~ February) • Successor of the POWER6 • Shift from high frequency to multi-core • Implements IBM PowerPC architecture v2.06 • Clock Rate: 2.4 GHz - 4.25 GHz • Feature size: 45 nm • ISA: Power ISA v 2.06 (RISC) • Cores: 4, 6, 8 • Cache: L1, L2, L3 – On Chip References: [1], [2], [5]

Why the POWER7? • PERCS – Productive, Easy-to-use, Reliable Computer System • DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) • Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project. • IBM, Cray, and Sun Microsystems received HPCS grant for Phase II. • IBM was chosen for Phase III in 2006. References: [1], [2]

Blue Waters • Side note: • The Blue Waters system was meant to be the first supercomputer using PERCS technology. • But, the contract was cancelled (cost and complexity). References: [2]

History of Power POWER8 POWER7/7+ POWER6/6+ MostPOWERful & Scalable Processor in Industry POWER5/5+ FastestProcessor In Industry POWER4/4+ Hardware Virtualization for Unix & Linux • 4,6,8 Core • 32MB On-Chip eDRAM • Power Optimized Cores • Mem Subsystem ++ • 4 Thread SMT++ • Reliability + • VSM & VSX • Protection Keys+ • 45nm, 32nm • Dual Core • High Frequencies • Virtualization + • Memory Subsystem + • Altivec • Instruction Retry • Dyn Energy Mgmt • 2 Thread SMT + • Protection Keys • 65nm First Dual Corein Industry • Dual Core & Quad Core Md • Enhanced Scaling • 2 Thread SMT • Distributed Switch + • Core Parallelism + • FP Performance + • Memory bandwidth + • 130nm, 90nm • Dual Core • Chip Multi Processing • Distributed Switch • Shared L2 • Dynamic LPARs (32) • 180nm, 2001 2004 2007 2010 Future References: [3]

Core Core Core Core S M P F A B R I C P O W E R L2 L2 L2 L2 G X B U S L2 L2 L2 L2 Memory Interface Core Core Core Core Memory++ POWER7 Layout Cores: • 8 Intelligent Cores / chip (socket) • 4 and 6 Intelligent Cores available on some models • 12 execution units per core • Out of order execution • 4 Way SMT per core • 32 threads per chip • L1 – 32 KB I Cache / 32 KB D Cache per core • L2 – 256 KB per core Chip: • 32MB Intelligent L3 Cache on chip L3 Cache eDRAM References: [3]

POWER7 Options (8, 6, 4 cores) References: [3]

Execution Units • Each POWER7 core has 12 execution units: • 2 fixed point units • 2 load store units • 4 double precision floating point units (2x power6) • 1 vector unit • 1 branch unit • 1 condition register unit • 1 decimal floating point unit References: [4]

Scalability • POWER7 can handle up to 32 Sockets • 32 sockets with up to 8 cores/socket • Each core can execute 32 threads simultaneously (this means up to 32*32 = 1024 simultaneous threads) • 360 GB/s peak SMP bandwidth / chip • 590 GB/s peak I/O bandwidth / chip • 33.12 GFLOPS per core (max) • 264.96 GFLOPS per chip (max) • Up to 20,000 coherent operations in flight (very aggressive out-of-order execution) References: [1], [3]

POWER7 TurboCore • TurboCore mode • 8 core to 4 Core • 7.25% higher core frequency • 2X the amount of L3 cache (fluid cache) • Tradeoffs • Reduces per core software licenses • Increases throughput computing • Decreases parallel transactional based workloads

POWER7 Core • Each core implements “aggressive” out-of-order (OoO) instruction execution • The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues • Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

Pipeline References: [5]

Instruction Fetch • 8 inst. fetched from L2 to L1 I-cache or fetch buffer • Balanced instruction rates across active threads • Inst. Grouping • Instructions belonging to group issued together • Groups contain independent instructions References: [5]

Branch Prediction • POWER7 uses different mechanisms to predict the branch direction (taken/not taken) and the branch target address. • Instruction Fetch Unit (IFU) supports 3-cycle branch scan loop (to scan instructions for branches taken, compute target addresses, and determine if it is an unconditional branch or taken) References: [5]

Branch Direction Prediction • Tournament Predictor (due to GSEL): • 8-K entry local BHT (LBHT) • BHT – Branch History Table • 16-K entry global BHT (GBHT) • 8-K entry global selection array (GSEL) • All arrays above provide branch direction predictions for all instructions in a fetch group (fetch group - up to 8 instructions) • The arrays are shared by all threads References: [5]

Branch Direction Prediction (cont.) • Indexing : • 8-K LBHT directly indexed by 10 bits from instruction fetch address • The GBHT and GSEL arrays are indexed by the instruction fetch address hashed with a 21-bit global history vector (GHV) folded down to 11 bits, one per thread References: [5]

Branch Direction Prediction (cont.) • Value in GSEL chooses between LBHT and GBHT for the direction of the prediction of each individual branch • Hence the tournament predictor! • Each BHT (LBHT and GBHT) entry contains 2 bits: • Higher order bit determines direction (taken/not taken) • Lower order bit provides hysteresis (history of the branch) References: [5]

Branch Target Address Prediction • Predicted in two ways: • Indirect branches that are not subroutine returns use a 128-entry count cache (shared by all active threads). • Count cache is indexed by doing an XOR of 7 bits from the instruction fetch address and the GHV (global history vector) • Each entry in the count cache contains a 62-bit predicted address with 2 confidence bits References: [5]

Branch Target Address Prediction (cont.) • Predicted in two ways: • Subroutine returns are predicted using a link stack (one per thread). • This is like the “Return Address Stack” discussed in lecture • Support in POWER7 modes: • ST, SMT2  16-entry link stack (per thread) • SMT4  8-entry link stack (per thread)

ILP • Advanced branch prediction • Large out-of-order execution windows • Large and fast caches • Execute more than one execution thread per core • A single 8-core Power7 processor can execute 32 threads in the same clock cycle.

POWER7 Demo • IBM POWER7 Demo • Visual representation of the SMT capabilities of the POWER7 • Brief introduction to the on-chip L3 cache

SMT • Simultaneous Multithreading • Separate instruction streams running concurrently on the same physical processor • POWER7 supports: • 2 pipes for storage instructions (load/stores) • 2 pipes for executing arithmetic instructions (add, subtract, etc.) • 1 pipe for branch instructions (control flow) • Parallel support for floating-point and vector operations References: [7], [8]

SMT (cont.) • Simultaneous Multithreading Explanation: • SMT1: Single instruction execution thread per core • SMT2: Two instruction execution threads per core • SMT4: Four instruction execution threads per core • This means that an 8-core Power7 can execute 32 threads simultaneously • POWER7 supports SMT1, SMT2, SMT4 References: [5], [8]

S80 HW Multi-thread Single thread Out of Order FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2 Way SMT POWER7 4 Way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing Thread 2 Executing Thread 3 Executing Multithreading History References: [3]

Cache Overview References: [13]

Cache Design Considerations • On-Chip cache required for sufficient bandwidth to 8 cores. • Previous off-chip socket interface unable to scale • Support dynamic cores • Utilize ILP and increased SMT latency overlap References: [5] [13]

L1 Cache • I and D cache split to reduce latency • Way prediction bits reduce hit latency • Write-Through • No L1 write-backs required on line eviction • High speed L2 able to handle bandwidth • B-Tree LRU replacement • Prefetching • On each L1 I-Cache miss, prefetch next 2 blocks References: [5] [13]

L2 Cache • Superset of L1 (inclusive) • Reduced latency by decreasing capacity • L2 utilizes larger L3-Local cache as victim cache • Increased associativity References: [5] [13]

L3 Cache • 32 MB Fluid L3 cache • Lateral cast outs, disabled core provisioning • 4 MB of local L3 cache per 8 cores • Local cache closer to respective core, reduced latency • L3 cache access routed to the local L3 cache first • Cache lines cloned when used by multiple cores References: [5] [13]

eDRAM • Embedded Dynamic Random-Access memory • Less area (1 transistor vs. 6 transistor SRAM) • Enables on-chip L3 cache • Reduces L3 latency • Larger internal bus size which increases bandwidth • Compared to off chip SRAM cache • 1/6 latency • 1/5 standby power • Utilized in game consoles (PS2, Wii, Etc.) References: [5], [6]

Memory • 2 memory controllers, 4 channels per core • Exploits elimination of off-chip L3 cache interface • 32 GB per core, 256 GB Capacity • 180 GB/s throughput • Power6: 75GB/s • 16 KB scheduling buffer References: [5]

Multicore • On-chip interconnect ties 8 cores • First level SMP interconnect links up to 4 chips (QCM) • Second level SMP interconnect links up to 8 QCMs • Scalable to 256-way SMP References: [5] [14]

Multicore • High Performance Systems • Can contain up to 513 SNs • As system size grows, Coherence broadcast traffic increases References: [5] [14]

Cache Coherence • Multicore coherence protocol: Extended MESI with behavioral and locality hints • NOTE: MESI is the “Illinois Protocol” • Most common protocol to support write-back cache • Each cache line marked with the following states (2 additional bits): • Modified: present only in current cache, dirty • Exclusive: present only in current cache, clean • Shared: may be stored in other caches, clean • Invalid: cache line invalid References: [9], [10]

Cache Coherence (cont.) • Multicore Coherence Implementation: Directory at the L3 cache • Directory as opposed to snooping-based system • Directory-based: sharing status of block oh physical memory is kept in one location, called the directory • One centralized directory in outermost cache (L3) • Snooping: every cache that has a copy of the data from a block of physical memory could track the sharing status of the block. • Monitor or snoop the broadcast medium References: [9], [11]

Maintaining The Balance References: [13]

Network • On-Chip interconnect • 500 GB/s chip bandwidth • nonblocking coherence transport mechanism • rate slowest snoop processing rate • 8 16-byte data buses References: [5] [14]

Network • SMP interconnect • shared by coherence and data traffic • independently tuned • 360 GB/s bandwidth • simultaneously balanced priorities and flow rates for coherence requests References: [5] [14]

Exceptions • Processor • Exceptions handled by ISU • Instructions tracked in groups • Flushes for groups are comibed • Supports partial group flushes • Memory • single-error-correct double-error-detect ECC References: [4] [5]

Balanced Design • Performance achieved through: • More flexible execution units • Increased pipeline utilization with SMT4 • Aggressive out of order execution • Energy efficiency achieved through • independent frequency control per core • dynamically partitioning and reallocating the L3 cache References: [5] [7]

References • 1. http://en.wikipedia.org/wiki/POWER7 • 2. http://en.wikipedia.org/wiki/PERCS • 3. Central PA PUG POWER7 review.ppt • http://www.ibm.com/developerworks/wikis/download/attachments/104533149/POWER7+Technology.pdf

References (cont.) • 4. http://www.redbooks.ibm.com/redpapers/pdfs/redp4639.pdf • 5. http://www.serc.iisc.ernet.in/~govind/243/Power7.pdf • 6. http://en.wikipedia.org/wiki/EDRAM • 7. http://www.spscicomp.org/ScicomP16/presentations/Power7_Performance_Overview.pdf • 8. http://www-03.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf • 9. Computer Architecture: A Quantitative Approach. Fifth Edition. Morgan Kaufman. • 10. Wikipedia: MESI Protocol. http://en.wikipedia.org/wiki/MESI_protocol • 11. Wikipedia: Cache Coherence. http://en.wikipedia.org/wiki/Cache_coherence • 12. Wikipedia: Blue Waters. http://en.wikipedia.org/wiki/Blue_Waters • 13. http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf • 14. http://mmc.geofisica.unam.mx/edp/Ejemplitos/SC11/src/pdf/papers/tp39.pdf

UIUC - CS 433 IBM POWER7