1 / 72

POWER5

POWER5. Ewen Cheslack-Postava Case Taintor Jake McPadden. POWER5 Lineage. IBM 801 – widely considered the first true RISC processor POWER1 – 3 chips wired together (branch, integer, floating point) POWER2 – Improved POWER1 – 2 nd FPU and added cache and 128 bit math

gordon
Download Presentation

POWER5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. POWER5 Ewen Cheslack-Postava Case Taintor Jake McPadden

  2. POWER5 Lineage • IBM 801 – widely considered the first true RISC processor • POWER1 – 3 chips wired together (branch, integer, floating point) • POWER2 – Improved POWER1 – 2nd FPU and added cache and 128 bit math • POWER3 – moved to 64 bit architecture • POWER4…

  3. POWER4 • Dual-core • High-speed connections to up to 3 other pairs of POWER4 CPUs • Ability to turn off pair of CPUs to increase throughput • Apple G5 uses a single-core derivative of POWER4 (PowerPC 970) • POWER5 designed to allow for POWER4 optimizations to carry over

  4. Pipeline Requirements • Maintain binary compatibility • Maintain structural compatibility • Optimizations for POWER4 carry forward • Improved performance • Enhancements for server virtualization • Improved reliability, availability, and serviceability at chip and system levels

  5. Pipeline Improvements • Enhanced thread level parallelism • Two threads per processor core • a.k.a Simultaneous Multithreading (SMT) • 2 threads/core * 2 cores/chip = 4 threads/chip • Each thread has independent access to L2 cache • Dynamic Power Management • Reliability, Availability, and Serviceability

  6. POWER5 Chip Stats • Copper interconnects • Decrease wire resistance and reduce delays in wire dominated chip timing paths • 8 levels of metal • 389 mm2

  7. POWER5 Chip Stats • Silicon on Insulator devices (SOI) • Thin layer of silicon (50nm to 100 µm) on insulating substrate, usually sapphire or silicon dioxide (80nm) • Reduces electrical charge transistor has to move during switching operation (compared to CMOS) • Increased speed (up to 15%) • Reduced switching energy (up to 20%) • Allows for higher clock frequencies (> 5GHz) • SOI chips cost more to produce and are therefore used for high-end applications • Reduces soft errors

  8. Pipeline • Pipeline identical to POWER4 • All latencies including branch misprediction penalty and load-to-use latency with L1 data cache hit same as POWER4

  9. POWER5 Pipeline IF – instruction fetch, IC – instruction cache, BP – branch predict, Dn – decode stage, Xfer – transfer, GD – group dispatch, MP – mapping, ISS – instruction issue, RF – register file read, EX – execute, EA – compute address, DC – data cache, F6 – six cycle floating point unit, Fmt – data format, WB – write back, CP – group commit

  10. Instruction Data Flow LSU – load/store unit, FXU – fixed point execution unit, FPU – floating point unit, BXU – branch execution unit, CRL – condition register logical execution unit

  11. Instruction Fetch • Fetch up to 8 instructions per cycle from instruction cache • Instruction cache and instruction translation shared between threads • One thread fetching per cycle

  12. Branch Prediction • Three branch history tables shared by 2 threads • 1 bimodal, 1 path-correlated prediction • 1 to predict which of the first 2 is correct • Can predict all branches – even if every instruction fetched is a branch

  13. Branch Prediction • Branch to link register (bclr) and branch to count register targets predicted using return address stack and count cache mechanism • Absolute and relative branch targets computed directly in branch scan function • Branches entered in branch information queue (BIQ) and deallocated in program order

  14. Instruction Grouping • Separate instruction buffers for each thread • 24 instructions / buffer • 5 instructions fetched from 1 thread’s buffer and form instruction group • All instructions in a group decoded in parallel

  15. Group Dispatch & Register Renaming • When all resources necessary for group are available, group is dispatched (GD) • D0 – GD: instructions still in program order • MP – register renaming, registers mapped to physical registers • Register files shared dynamically by two threads • In ST mode all registers are available to single thread • Placed in shared issue queues

  16. Group Tracking • Instructions tracked as group to simplify tracking logic • Control information placed in global completion table (GCT) at dispatch • Entries allocated in program order, but threads may have intermingled entries • Entries in GCT deallocated when group is committed

  17. Load/Store Reorder Queues • Load reorder queue (LRQ) and store reorder queue (SRQ) maintain program order of loads/stores within a thread • Allow for checking of address conflicts between loads and stores

  18. Instruction Issue • No distinction made between instructions for different threads • No priority difference between threads • Independent of GCT group of instruction • Up to 8 instructions can issue per cycle • Instructions then flow through execution units and write back stage

  19. Group Commit • Group commit (CP) happens when • all instructions in group have executed without exceptions and • the group is the oldest group in its thread • One group can commit per cycle from each thread

  20. Enhancements to Support SMT • Instruction and data caches same size as POWER4 but double to 2 and 4 way associativity respectively • IC and DC entries can be fully shared between threads

  21. Enhancements to Support SMT • Two step address translation • Effective address  Virtual Address using 64 entry segment lookaside buffer (SLB) • Virtual address  Physical Address using hashed page table, cached in a 1024 entry four way set associative TLB • Two first level translation tables (instruction, data) • SLB and TLB only used in case of first-level miss

  22. Enhancements to Support SMT • First Level Data Translation Table – fully associative 128 entry table • First Level Instruction Translation Table – 2-way set associative 128 entry table • Entries in both tables tagged with thread number and not shared between threads • Entries in TLB can be shared between threads

  23. Enhancements to Support SMT • LRQ and SRQ for each thread, 16 entries • But threads can run out of queue space – add 32 virtual entries, 16 per thread • Virtual entries – contain enough information to identify the instruction, but not address for load/store • Low cost way to extend LRQ/SRQ and not stall instruction dispatch

  24. Enhancements to Support SMT • Branch Information Queue (BIQ) • 16 entries (same as POWER4) • Split in half for SMT mode • Performance modeling suggested this was a sufficient solution • Load Miss Queue (LMQ) • 8 entries (same as POWER4) • Added thread bit to allow dynamic sharing

  25. Enhancements to Support SMT • Dynamic Resource Balancing • Resource balancing logic monitors resources (e.g. GCT and LMQ) to determine if one thread exceeds threshold • Offending thread can be throttled back to allow sibling to continue to progress • Methods of throttling • Reduce thread priority (using too many GCT entries) • Inhibit instruction decoding until congestion clears (incurs too many L2 cache misses) • Flush all thread instructions waiting for dispatch and stop thread from decoding instructions until congestion clears (executing instruction that takes a long time to complete)

  26. Enhancements to Support SMT • Thread priority • Supports 8 levels of priority • 0  not running • 1  lowest, 7  highest • Give thread with higher priority additional decode cycles • Both threads at lowest priority  power saving mode

  27. Single Threaded Mode • All rename registers, issue queues, LRQ, and SRQ are available to the active thread • Allows higher performance than POWER4 at equivalent frequencies • Software can change processor dynamically between single threaded and SMT mode

  28. RAS of POWER4 • High availability in POWER4 • Minimize component failure rates • Designed using techniques that permit hard and soft failure detection, recovery, isolation, repair deferral, and component replacement while system is operating • Fault tolerant techniques used for array, logic, storage, and I/O systems • Fault isolation and recovery

  29. RAS of POWER5 • Same techniques as POWER4 • New emphasis on reducing scheduled outages to further improve system availability • Firmware upgrades on running machine • ECC on all system interconnects • Single bit interconnect failures dynamically corrected • Deferred repair scheduled for persistent failures • Source of errors can be determined – for non recoverable error, system taken down, book containing fault taken offline, system rebooted – no human intervention • Thermal protection sensors

  30. Dynamic Power Management • Reduce switching power • Clock gating • Reduce leakage power • Minimum low-threshold transistors • Low power mode • Two stage fix for excess heat • Stage 1: alternate stalls and execution until the chip cools • Stage 2: clock throttling

  31. Effects of dynamic power management with and without simultaneous multithreading enabled. Photographs were taken with a heat-sensitive camera while a prototype POWER5 chip was undergoing tests in the laboratory.

  32. Memory Subsystem • Memory controller and L3 directory moved on-chip • Interfaces with DDR1 or DDR2 memory • Error correction/detection handled by ECC • Memory scrubbing for “soft errors” • Error correction while idle

  33. Cache Sizes

  34. Cache Hierarchy

  35. Cache Hierarchy • Reads from memory are written into L2 • L2 and L3 are shared between cores • L3 (36MB) acts as a victim cache for L2 • Cache line is reloaded into L2 if there is a hit in L3 • Write back to main memory if line in L3 is dirty and evicted

  36. Important Notes on Diagram • Three buses between the controller and the SMI chips • Address/command bus • Unidirectional write data bus (8 bytes) • Unidirectional read data bus (16 bytes) • Each bus operates at twice the DIMM speed

  37. Important Notes on Diagram • 2 or 4 SMI chips can be used • Each SMI can interface with two DIMMs • 2 SMI mode – 8-byte read, 2 byte write • 4 SMI mode – 4-byte read, 2 byte write

  38. Size does matter POWER5 Pentium III

  39. …compensating?

  40. Possible configurations • DCM (Dual Chip Module) • One POWER5 chip, one L3 chip • MCM (Multi Chip Module) • Four POWER5 chips, four L3 chips • Communication is handled by a Fabric Bus Controller (FBC) • “distributed switch”

  41. Typical Configurations • 2 MCMs to form a “book” • 16 way symmetric multi-processor • (appears as 32 way) • DCM books also used

  42. Fabric Bus Controller • “buffers and sequences operations among the L2/L3, the functional units of the memory subsystem, the fabric buses that interconnect POWER5 chips on the MCM, and the fabric buses that interconnect multiple MCMs” • Separate address and data buses to facilitate split transactions • Each transaction tagged to allow for out of order replies

  43. 16-way system built with eight dual-chip modules.

  44. Address Bus • Addresses broadcasted from MCM to MCM using ring structure • Each chip forwards address down the ring and to the other chip in MCM • Forwarding ends when originating chip receives address

  45. Response Bus • Includes coherency information gleaned from memory subsystem “snooping” • One chip in MCM combines other three chips’ snoop responses with the previous MCM snoop response and forwards it on

  46. Response Bus • When originating chip receives the responses, transmits a “combined response” which details actions to be taken • Early combined response mechanism • Each MCM determines whether to send a cache line from L2/L3 depending on previous snoop responses • Reduces cache-to-cache latency

More Related