900 likes | 1.02k Views
Section 2: The Technology. "Any sufficiently advanced technology will have the appearance of magic." …Arthur C. Clarke. Section Objectives. On completion of this unit you should be able to: Describe the relationship between technology and solutions.
E N D
Section 2: The Technology "Any sufficiently advanced technology will have the appearance of magic." …Arthur C. Clarke
Section Objectives • On completion of this unit you should be able to: • Describe the relationship between technology and solutions. • List key IBM technologies that are part of the POWER5 products. • Be able to describe the functional benefits that these technologies provide. • Be able to discuss the appropriate use of these technologies. Concepts of Solution Design
Solutions Products Technology Science IBM and Technology Concepts of Solution Design
Technology and innovation • Having technology available is a necessary first step. • Finding creative new ways to use the technology for the benefit of our clients is what innovation is about. • Solution design is an opportunity for innovative application of technology. Concepts of Solution Design
When technology won’t ‘fix’ the problem • When the technology is not related to the problem. • When the client has unreasonable expectations. Concepts of Solution Design
POWER4 and POWER5 Cores POWER4 Core POWER5 Core Concepts of Solution Design
Enhanced distributed switch SMT Core SMT Core L3 Dir 1.9 MB L2 Cache Mem Ctrl Chip-Chip / MCM-MCM / SMPLink POWER5 • Designed for entry and high-end servers • Enhanced memory subsystem • Improved performance • Simultaneous Multi-Threading • Hardware support for Shared Processor Partitions (Micro-Partitioning) • Dynamic power management • Compatibility with existing POWER4 systems • Enhanced reliability, availability, serviceability GX+ Concepts of Solution Design
Enhanced distributed switch SMT Core SMT Core L3 Dir Mem Ctrl 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanced memory subsystem • Improved L1 cache design • 2-way set associative i-cache • 4-way set associative d-cache • New replacement algorithm (LRU vs. FIFO) • Larger L2 cache • 1.9 MB, 10-way set associative • Improved L3 cache design • 36 MB, 12-way set associative • L3 on the processor side of the fabric • Satisfies L2 cache misses more frequently • Avoids traffic on the interchip fabric • On-chip L3 directory and memory controller • L3 directory on the chip reduces off-chip delays after an L2 miss • Reduced memory latencies • Improved pre-fetch algorithms Concepts of Solution Design
Enhanced memory subsystem POWER4 system structure POWER5 system structure Reduced L3 latency Processor Processor Processor Processor Processor Processor Processor Processor L2 Cache L2 Cache L3 Cache L3 Dir L2 Cache L2 Cache L3 Dir L3 Cache Fabric controller Fabric controller Fabric controller Fabric controller Larger SMPs 64-way Memory controller Memory controller L3 Cache L3 Cache Faster access to memory Memory controller Memory controller Memory Memory Number of chips cut in half Memory Memory Concepts of Solution Design
Simultaneous Multi-Threading (SMT) • What is it? • Why would I want it? Concepts of Solution Design
POWER4 pipeline Out-of-order processing Branch redirects Instruction Fetch Branch pipeline MP ISS RF EX WB Xfer Load/store pipeline IC IF BP MP ISS RF EA DC WB Xfer Fmt CP D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer Fixed-point pipeline Instruction Crack and Group Formation MP ISS RF F6 F6 F6 F6 F6 WB F6 Xfer Floating- point pipeline Interrupts & Flushes POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit) POWER5 pipeline Concepts of Solution Design
Multi-threading evolution Memory Instruction streams • Execution unit utilization is low in today’s microprocessors • 25% of average execution unit utilization across a broad spectrum of environments i-Cache FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles Next evolution step Concepts of Solution Design
Coarse-grained multi-threading Memory Instruction streams • Two instruction streams, one thread at any instance • Hardware swaps in second thread when long-latency event occurs • Swap requires several cycles Swap Swap Swap i-Cache FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles Next evolution step Concepts of Solution Design
Coarse-grained multi-threading (Cont.) • Processor (for example, RS64-IV) is able to store context for two threads • Rapid switching between threads minimizes lost cycles due to I/O waits and cache misses. • Can yield ~20% improvement for OLTP workloads. • Coarse-grained multi-threading only beneficial where number of active threads exceeds 2x number of CPUs • AIX must create a “dummy” thread if there are insufficient numbers of real threads. • Unnecessary switches to “dummy” threads can degrade performance ~20% • Does not work with dynamic CPU deallocation Concepts of Solution Design
Fine-grained multi-threading Memory Instruction streams • Variant of coarse-grained multi-threading • Thread execution in round-robin fashion • Cycle remains unused when a thread encounters a long-latency event i-Cache FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles Next evolution step Concepts of Solution Design
POWER5 pipeline Out-of-order processing Branch redirects Instruction Fetch Branch pipeline MP ISS RF EX WB Xfer Load/store pipeline IF IC IF BP CP MP ISS RF EA DC WB Xfer Fmt CP D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer Fixed-point pipeline Instruction Crack and Group Formation MP ISS RF F6 F6 F6 F6 F6 WB F6 Xfer Floating- point pipeline Interrupts & Flushes POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit) POWER4 pipeline Concepts of Solution Design
Simultaneous multi-threading (SMT) Memory Instruction streams • Reduction in unused execution units results in a 25-40% boost and even more! i-Cache FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles First evolution step Concepts of Solution Design
Simultaneous multi-threading (SMT) (Cont.) • Each chip appears as a 4-way SMP to software • Allows instructions from two threads to execute simultaneously • Processor resources optimized for enhanced SMT performance • No context switching, no dummy threads • Hardware, POWER Hypervisor, or OS controlled thread priority • Dynamic feedback of shared resources allows for balanced thread execution • Dynamic switching between single and multithreaded mode Concepts of Solution Design
Dynamic resource balancing • Threads share many resources • Global Completion Table, Branch History Table, Translation Lookaside Buffer, and so on • Higher performance realized when resources balanced across threads • Tendency to drift toward extremes accompanied by reduced performance Concepts of Solution Design
Adjustable thread priority Single-threaded operation • Instances when unbalanced execution is desirable • No work for opposite thread • Thread waiting on lock • Software determined non uniform balance • Power management • Control instruction decode rate • Software/hardware controls eight priority levels for each thread 2 2 1 1 1 Instructions per cycle 1 Power Save Mode 1 0 0 0 0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1 Thread 0 Priority - Thread 1 Priority Thread 0 IPC Thread 1 IPC Hardware thread priorities Concepts of Solution Design
Thread states Dormant Software Hardware or Software Active Software Null Software Single-threaded operation • Advantageous for execution unit limited applications • Floating or fixed point intensive workloads • Execution unit limited applications provide minimal performance leverage for SMT • Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread • Determined dynamically on a per processor basis Concepts of Solution Design
Micro-Partitioning overview • Mainframe inspired technology • Virtualized resources shared by multiple partitions • Benefits • Finer grained resource allocation • More partitions (Up to 254) • Higher resource utilization • New partitioning model • POWER Hypervisor • Virtual processors • Fractional processor capacity partitions • Operating system optimized for Micro-Partitioning exploitation • Virtual I/O Concepts of Solution Design
Processor terminology Shared processor partition SMT Off Shared processor partition SMT On Dedicated processor partition SMT Off Logical (SMT) Virtual Shared Dedicated Inactive (CUoD) Entitled capacity Deconfigured Shared processor pool Installed physical processors Concepts of Solution Design
CPU 0 CPU 1 LPAR 5 LPAR 2 LPAR 6 LPAR 4 LPAR 3 CPU 3 CPU 4 LPAR 1 Shared processor partitions • Micro-Partitioning allows for multiple partitions to share one physical processor • Up to 10 partitions per physical processor • Up to 254 partitions active at the same time • Partition’s resource definition • Minimum, desired, and maximum values for each resource • Processor capacity • Virtual processors • Capped or uncapped • Capacity weight • Dedicated memory • Minimum of 128 MB and 16 MB increments • Physical or virtual I/O resources Concepts of Solution Design
Understanding min/max/desired resource values • The desired value for a resource is given to a partition if enough resource is available. • If there is not enough resource to meet the desired value, then a lower amount is allocated. • If there is not enough resource to meet the min value, the partition will not start. • The maximum value is only used as an upper limit for dynamic partitioning operations. Concepts of Solution Design
Partition capacity entitlement • Processing units • 1.0 processing unit represents one physical processor • Entitled processor capacity • Commitment of capacity that is reserved for the partition • Set upper limit of processor utilization for capped partitions • Each virtual processor must be granted at least 1/10 of a processing unit of entitlement • Shared processor capacity is always delivered in terms of whole physical processors Minimum requirement 0.1 processing units 0.5 processing unit 0.4 processing unit Processing capacity 1 physical processor 1.0 processing units Concepts of Solution Design
Capped and uncapped partitions • Capped partition • Not allowed to exceed its entitlement • Uncapped partition • Is allowed to exceed its entitlement • Capacity weight • Used for prioritizing uncapped partitions • Value 0-255 • Value of 0 referred to as a “soft cap” Concepts of Solution Design
Partition capacity entitlement example • Shared pool has 2.0 processing units available • LPARs activated in sequence • Partition 1 activated • Min = 1.0, max = 2.0, desired = 1.5 • Starts with 1.5 allocated processing units • Partition 2 activated • Min = 1.0, max = 2.0, desired = 1.0 • Does not start • Partition 3 activated • Min = 0.1, max = 1.0, desired = 0.8 • Starts with 0.5 allocated processing units Concepts of Solution Design
Understanding capacity allocation – An example • A workload is run under different configurations. • The size of the shared pool (number of physical processors) is fixed at 16. • The capacity entitlement for the partition is fixed at 9.5. • No other partitions are active. Concepts of Solution Design
Uncapped – 16 virtual processors • 16 virtual processors. • Uncapped. • Can use all available resource. • The workload requires 26 minutes to complete. Concepts of Solution Design
Uncapped – 12 virtual processors • 12 virtual processors. • Even though the partition is uncapped, it can only use 12 processing units. • The workload now requires 27 minutes to complete. Concepts of Solution Design
Capped • The partition is now capped and resource utilization is limited to the capacity entitlement of 9.5. • Capping limits the amount of time each virtual processor is scheduled. • The workload now requires 28 minutes to complete. Concepts of Solution Design
Dynamic partitioning operations • Add, move, or remove processor capacity • Remove, move, or add entitled shared processor capacity • Change between capped and uncapped processing • Change the weight of an uncapped partition • Add and remove virtual processors • Provided CE / VP > 0.1 • Add, move, or remove memory • 16 MB logical memory block • Add, move, or remove physical I/O adapter slots • Add or remove virtual I/O adapter slots • Min/max values defined for LPARs set the bounds within which DLPAR can work Concepts of Solution Design
Move resources between live partitions Part#2 Part#1 Production Part#3 Part#4 Legacy Apps Test/ Dev File/ Print Linux HMC AIX 5L AIX 5L AIX 5L Hypervisor Dynamic LPAR • Standard on all new systems Concepts of Solution Design
Firmware POWER Hypervisor
POWER Hypervisor strategy • New Hypervisor for POWER5 systems • Further convergence with iSeries • But brands will retain unique value propositions • Reduced development effort • Faster time to market • New capabilities on pSeries servers • Shared processor partitions • Virtual I/O • New capability on iSeries servers • Can run AIX 5L Concepts of Solution Design
POWER Hypervisor component sourcing pSeries H-Call Interface iSeries Location codes Nucleus (SLIC) Virtual I/O Load from flash Bus recovery Dump Drawer concurrent maint. Slot/tower concurrent maint. Message passing 255 partitions Shared processor LPAR NVRAM Partition on demand Capacity on Demand I/O configuration Virtual Ethernet FSP SCSI IOA LAN IOA VLAN IOA VLAN HMC HSC Concepts of Solution Design
Dynamic Micro-Partitioning Enhanced distributed switch Enhanced distributed switch Enhanced distributed switch Enhanced distributed switch CPU 0 CPU 1 SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core L3 Dir L3 Dir L3 Dir L3 Dir CPU 2 CPU 3 Shared processor pools 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache Mem Ctrl Mem Ctrl Mem Ctrl Mem Ctrl Virtual I/O Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Dynamic LPAR Capacity Upgrade on Demand Client Capacity Growth Planned Disk LAN Actual POWER Hypervisor functions • Same functions as POWER4 Hypervisor. • Dynamic LPAR • Capacity Upgrade on Demand • New, active functions. • Dynamic Micro-Partitioning • Shared processor pool • Virtual I/O • Virtual LAN • Machine is always in LPAR mode. • Even with all resources dedicated to one OS Concepts of Solution Design
POWER Hypervisor implementation • Design enhancements to previous POWER4 implementation enable the sharing of processors by multiple partitions • Hypervisor decrementer (HDECR) • New Processor Utilization Resource Register (PURR) • Refine virtual processor objects • Does not include physical characteristics of the processor • New Hypervisor calls Concepts of Solution Design
Enhanced distributed switch Enhanced distributed switch Enhanced distributed switch Enhanced distributed switch SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core L3 Dir L3 Dir L3 Dir L3 Dir 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache Mem Ctrl Mem Ctrl Mem Ctrl Mem Ctrl Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink CPU 0 CPU 1 CPU 2 CPU 3 POWER Hypervisor processor dispatch Virtual processor capacity entitlement for six shared processor partitions • Manage a set of processors on the machine (shared processor pool). • POWER5 generates a 10 ms dispatch window. • Minimum allocation is 1 ms per physical processor. • Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window. • ms/VP = CE * 10 / VPs • The partition entitlement is evenly distributed among the online virtual processors. • Once a capped partition has received its CE within a dispatch interval, it becomes not-runnable. • A VP dispatched within 1 ms of the end of the dispatch interval will receive half its CE at the start of the next dispatch interval. POWER Hypervisor’s processor dispatch Shared processor pool Concepts of Solution Design
Dispatching and interrupt latencies • Virtual processors have dispatch latency. • Dispatch latency is the time between a virtual processor becoming runnable and being actually dispatched. • Timers have latency issues also. • External interrupts have latency issues also. Concepts of Solution Design
Enhanced distributed switch Enhanced distributed switch Enhanced distributed switch Enhanced distributed switch SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core L3 Dir L3 Dir L3 Dir L3 Dir 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache Mem Ctrl Mem Ctrl Mem Ctrl Mem Ctrl Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink CPU 0 CPU 1 CPU 2 CPU 3 Shared processor pool Shared processor pool Virtual processor capacity entitlement for six shared processor partitions • Processors not associated with dedicated processor partitions. • No fixed relationship between virtual processors and physical processors. • The POWER Hypervisor attempts to use the same physical processor. • Affinity scheduling • Home node POWER Hypervisor’s processor dispatch Concepts of Solution Design
Affinity scheduling • When dispatching a VP, the POWER Hypervisor attempts to preserve affinity by using: • Same physical processor as before, or • Same chip, or • Same MCM • When a physical processor becomes idle, the POWER Hypervisor looks for a runnable VP that: • Has affinity for it, or • Has affinity to no-one, or • Is uncapped • Similar to AIX affinity scheduling Concepts of Solution Design
Operating system support • Micro-Partitioning capable operating systems need to be modified to cede a virtual processor when they have no runnable work • Failure to do this results in wasted CPU resources • For example, an partition spends its CE waiting for I/O • Results in better utilization of the pool • May confer the remainder of their timeslice to another VP • For example, a VP holding a lock • Can be redispatched if they become runnable again during the same dispatch interval Concepts of Solution Design
Example Physical processor 0 LPAR 1 VP 1 LPAR 3 VP 2 LPAR 1 VP 1 IDLE LPAR 3 VP 0 LPAR 1 VP 1 IDLE Physical processor 1 LPAR 2 VP 0 LPAR 1 VP 0 LPAR 3 VP 0 LPAR 3 VP 1 LPAR 3 VP 2 LPAR 1 VP 0 LPAR 3 VP 1 LPAR 2 VP 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec) LPAR1 Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped) LPAR2 Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped) LPAR3 Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped) Concepts of Solution Design
Disk LAN POWER Hypervisor and virtual I/O • I/O operations without dedicating resources to an individual partition • POWER Hypervisor’s virtual I/O related operations • Provide control and configuration structures for virtual adapter images required by the logical partitions • Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition • The POWER Hypervisor does not own any physical I/O devices; they are owned by an I/O hosting partition • I/O types supported • SCSI • Ethernet • Serial console Concepts of Solution Design
Performance monitoring and accounting • CPU utilization is measured against CE. • An uncapped partition receiving more than its CE will record 100% but will be using more. • SMT • Thread priorities compound the variable speed rate. • Twice as many logical CPUs. • For accounting, interval may be incorrectly allocated. • New hardware support is required. • Processor utilization register (PURR) records actual clock ticks spent executing a partition. • Used by performance commands (for example, new flags) and accounting modules. • Third party tools will need to be modified. Concepts of Solution Design