1 / 42

UIUC - CS 433 IBM POWER7

Adam Kunk Anil John Pete Bohman. UIUC - CS 433 IBM POWER7. Quick Facts. Released by IBM in 2010 (~ February) Successor of the POWER6 Shift from high frequency to multi-core Implements IBM PowerPC architecture v2.06 Clock Rate: 2.4 GHz - 4.25 GHz Feature size: 45 nm

noura
Download Presentation

UIUC - CS 433 IBM POWER7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adam Kunk Anil John Pete Bohman UIUC - CS 433 IBM POWER7

  2. Quick Facts • Released by IBM in 2010 (~ February) • Successor of the POWER6 • Shift from high frequency to multi-core • Implements IBM PowerPC architecture v2.06 • Clock Rate: 2.4 GHz - 4.25 GHz • Feature size: 45 nm • ISA: Power ISA v 2.06 (RISC) • Cores: 4, 6, 8 • Cache: L1, L2, L3 – On Chip References: [1], [2], [5]

  3. Why the POWER7? • PERCS – Productive, Easy-to-use, Reliable Computer System • DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) • Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project. • IBM, Cray, and Sun Microsystems received HPCS grant for Phase II. • IBM was chosen for Phase III in 2006. References: [1], [2]

  4. Blue Waters • Side note: • The Blue Waters system was meant to be the first supercomputer using PERCS technology. • But, the contract was cancelled (cost and complexity). References: [2]

  5. History of Power POWER8 POWER7/7+ POWER6/6+ MostPOWERful & Scalable Processor in Industry POWER5/5+ FastestProcessor In Industry POWER4/4+ Hardware Virtualization for Unix & Linux • 4,6,8 Core • 32MB On-Chip eDRAM • Power Optimized Cores • Mem Subsystem ++ • 4 Thread SMT++ • Reliability + • VSM & VSX • Protection Keys+ • 45nm, 32nm • Dual Core • High Frequencies • Virtualization + • Memory Subsystem + • Altivec • Instruction Retry • Dyn Energy Mgmt • 2 Thread SMT + • Protection Keys • 65nm First Dual Corein Industry • Dual Core & Quad Core Md • Enhanced Scaling • 2 Thread SMT • Distributed Switch + • Core Parallelism + • FP Performance + • Memory bandwidth + • 130nm, 90nm • Dual Core • Chip Multi Processing • Distributed Switch • Shared L2 • Dynamic LPARs (32) • 180nm, 2001 2004 2007 2010 Future References: [3]

  6. Core Core Core Core S M P F A B R I C P O W E R L2 L2 L2 L2 G X B U S L2 L2 L2 L2 Memory Interface Core Core Core Core Memory++ POWER7 Layout Cores: • 8 Intelligent Cores / chip (socket) • 4 and 6 Intelligent Cores available on some models • 12 execution units per core • Out of order execution • 4 Way SMT per core • 32 threads per chip • L1 – 32 KB I Cache / 32 KB D Cache per core • L2 – 256 KB per core Chip: • 32MB Intelligent L3 Cache on chip L3 Cache eDRAM References: [3]

  7. POWER7 Options (8, 6, 4 cores) References: [3]

  8. Execution Units • Each POWER7 core has 12 execution units: • 2 fixed point units • 2 load store units • 4 double precision floating point units (2x power6) • 1 vector unit • 1 branch unit • 1 condition register unit • 1 decimal floating point unit References: [4]

  9. Scalability • POWER7 can handle up to 32 Sockets • 32 sockets with up to 8 cores/socket • Each core can execute 32 threads simultaneously (this means up to 32*32 = 1024 simultaneous threads) • 360 GB/s peak SMP bandwidth / chip • 590 GB/s peak I/O bandwidth / chip • 33.12 GFLOPS per core (max) • 264.96 GFLOPS per chip (max) • Up to 20,000 coherent operations in flight (very aggressive out-of-order execution) References: [1], [3]

  10. POWER7 TurboCore • TurboCore mode • 8 core to 4 Core • 7.25% higher core frequency • 2X the amount of L3 cache (fluid cache) • Tradeoffs • Reduces per core software licenses • Increases throughput computing • Decreases parallel transactional based workloads

  11. POWER7 Core • Each core implements “aggressive” out-of-order (OoO) instruction execution • The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues • Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

  12. Pipeline References: [5]

  13. Instruction Fetch • 8 inst. fetched from L2 to L1 I-cache or fetch buffer • Balanced instruction rates across active threads • Inst. Grouping • Instructions belonging to group issued together • Groups contain independent instructions References: [5]

  14. Branch Prediction • POWER7 uses different mechanisms to predict the branch direction (taken/not taken) and the branch target address. • Instruction Fetch Unit (IFU) supports 3-cycle branch scan loop (to scan instructions for branches taken, compute target addresses, and determine if it is an unconditional branch or taken) References: [5]

  15. Branch Direction Prediction • Tournament Predictor (due to GSEL): • 8-K entry local BHT (LBHT) • BHT – Branch History Table • 16-K entry global BHT (GBHT) • 8-K entry global selection array (GSEL) • All arrays above provide branch direction predictions for all instructions in a fetch group (fetch group - up to 8 instructions) • The arrays are shared by all threads References: [5]

  16. Branch Direction Prediction (cont.) • Indexing : • 8-K LBHT directly indexed by 10 bits from instruction fetch address • The GBHT and GSEL arrays are indexed by the instruction fetch address hashed with a 21-bit global history vector (GHV) folded down to 11 bits, one per thread References: [5]

  17. Branch Direction Prediction (cont.) • Value in GSEL chooses between LBHT and GBHT for the direction of the prediction of each individual branch • Hence the tournament predictor! • Each BHT (LBHT and GBHT) entry contains 2 bits: • Higher order bit determines direction (taken/not taken) • Lower order bit provides hysteresis (history of the branch) References: [5]

  18. Branch Target Address Prediction • Predicted in two ways: • Indirect branches that are not subroutine returns use a 128-entry count cache (shared by all active threads). • Count cache is indexed by doing an XOR of 7 bits from the instruction fetch address and the GHV (global history vector) • Each entry in the count cache contains a 62-bit predicted address with 2 confidence bits References: [5]

  19. Branch Target Address Prediction (cont.) • Predicted in two ways: • Subroutine returns are predicted using a link stack (one per thread). • This is like the “Return Address Stack” discussed in lecture • Support in POWER7 modes: • ST, SMT2  16-entry link stack (per thread) • SMT4  8-entry link stack (per thread)

  20. ILP • Advanced branch prediction • Large out-of-order execution windows • Large and fast caches • Execute more than one execution thread per core • A single 8-core Power7 processor can execute 32 threads in the same clock cycle.

  21. POWER7 Demo • IBM POWER7 Demo • Visual representation of the SMT capabilities of the POWER7 • Brief introduction to the on-chip L3 cache

  22. SMT • Simultaneous Multithreading • Separate instruction streams running concurrently on the same physical processor • POWER7 supports: • 2 pipes for storage instructions (load/stores) • 2 pipes for executing arithmetic instructions (add, subtract, etc.) • 1 pipe for branch instructions (control flow) • Parallel support for floating-point and vector operations References: [7], [8]

  23. SMT (cont.) • Simultaneous Multithreading Explanation: • SMT1: Single instruction execution thread per core • SMT2: Two instruction execution threads per core • SMT4: Four instruction execution threads per core • This means that an 8-core Power7 can execute 32 threads simultaneously • POWER7 supports SMT1, SMT2, SMT4 References: [5], [8]

  24. S80 HW Multi-thread Single thread Out of Order FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2 Way SMT POWER7 4 Way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing Thread 2 Executing Thread 3 Executing Multithreading History References: [3]

  25. Cache Overview References: [13]

  26. Cache Design Considerations • On-Chip cache required for sufficient bandwidth to 8 cores. • Previous off-chip socket interface unable to scale • Support dynamic cores • Utilize ILP and increased SMT latency overlap References: [5] [13]

  27. L1 Cache • I and D cache split to reduce latency • Way prediction bits reduce hit latency • Write-Through • No L1 write-backs required on line eviction • High speed L2 able to handle bandwidth • B-Tree LRU replacement • Prefetching • On each L1 I-Cache miss, prefetch next 2 blocks References: [5] [13]

  28. L2 Cache • Superset of L1 (inclusive) • Reduced latency by decreasing capacity • L2 utilizes larger L3-Local cache as victim cache • Increased associativity References: [5] [13]

  29. L3 Cache • 32 MB Fluid L3 cache • Lateral cast outs, disabled core provisioning • 4 MB of local L3 cache per 8 cores • Local cache closer to respective core, reduced latency • L3 cache access routed to the local L3 cache first • Cache lines cloned when used by multiple cores References: [5] [13]

  30. eDRAM • Embedded Dynamic Random-Access memory • Less area (1 transistor vs. 6 transistor SRAM) • Enables on-chip L3 cache • Reduces L3 latency • Larger internal bus size which increases bandwidth • Compared to off chip SRAM cache • 1/6 latency • 1/5 standby power • Utilized in game consoles (PS2, Wii, Etc.) References: [5], [6]

  31. Memory • 2 memory controllers, 4 channels per core • Exploits elimination of off-chip L3 cache interface • 32 GB per core, 256 GB Capacity • 180 GB/s throughput • Power6: 75GB/s • 16 KB scheduling buffer References: [5]

  32. Multicore • On-chip interconnect ties 8 cores • First level SMP interconnect links up to 4 chips (QCM) • Second level SMP interconnect links up to 8 QCMs • Scalable to 256-way SMP References: [5] [14]

  33. Multicore • High Performance Systems • Can contain up to 513 SNs • As system size grows, Coherence broadcast traffic increases References: [5] [14]

  34. Cache Coherence • Multicore coherence protocol: Extended MESI with behavioral and locality hints • NOTE: MESI is the “Illinois Protocol” • Most common protocol to support write-back cache • Each cache line marked with the following states (2 additional bits): • Modified: present only in current cache, dirty • Exclusive: present only in current cache, clean • Shared: may be stored in other caches, clean • Invalid: cache line invalid References: [9], [10]

  35. Cache Coherence (cont.) • Multicore Coherence Implementation: Directory at the L3 cache • Directory as opposed to snooping-based system • Directory-based: sharing status of block oh physical memory is kept in one location, called the directory • One centralized directory in outermost cache (L3) • Snooping: every cache that has a copy of the data from a block of physical memory could track the sharing status of the block. • Monitor or snoop the broadcast medium References: [9], [11]

  36. Maintaining The Balance References: [13]

  37. Network • On-Chip interconnect • 500 GB/s chip bandwidth • nonblocking coherence transport mechanism • rate slowest snoop processing rate • 8 16-byte data buses References: [5] [14]

  38. Network • SMP interconnect • shared by coherence and data traffic • independently tuned • 360 GB/s bandwidth • simultaneously balanced priorities and flow rates for coherence requests References: [5] [14]

  39. Exceptions • Processor • Exceptions handled by ISU • Instructions tracked in groups • Flushes for groups are comibed • Supports partial group flushes • Memory • single-error-correct double-error-detect ECC References: [4] [5]

  40. Balanced Design • Performance achieved through: • More flexible execution units • Increased pipeline utilization with SMT4 • Aggressive out of order execution • Energy efficiency achieved through • independent frequency control per core • dynamically partitioning and reallocating the L3 cache References: [5] [7]

  41. References • 1. http://en.wikipedia.org/wiki/POWER7 • 2. http://en.wikipedia.org/wiki/PERCS • 3. Central PA PUG POWER7 review.ppt • http://www.ibm.com/developerworks/wikis/download/attachments/104533149/POWER7+Technology.pdf

  42. References (cont.) • 4. http://www.redbooks.ibm.com/redpapers/pdfs/redp4639.pdf • 5. http://www.serc.iisc.ernet.in/~govind/243/Power7.pdf • 6. http://en.wikipedia.org/wiki/EDRAM • 7. http://www.spscicomp.org/ScicomP16/presentations/Power7_Performance_Overview.pdf • 8. http://www-03.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf • 9. Computer Architecture: A Quantitative Approach. Fifth Edition. Morgan Kaufman. • 10. Wikipedia: MESI Protocol. http://en.wikipedia.org/wiki/MESI_protocol • 11. Wikipedia: Cache Coherence. http://en.wikipedia.org/wiki/Cache_coherence • 12. Wikipedia: Blue Waters. http://en.wikipedia.org/wiki/Blue_Waters • 13. http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf • 14. http://mmc.geofisica.unam.mx/edp/Ejemplitos/SC11/src/pdf/papers/tp39.pdf

More Related