280 likes | 308 Views
Hyperthread Support in OpenVMS V8.3. What to do about Montecito?. Pre-Summary. We added some features to help you manage hyperthreads SHOW CPU/BRIEF displays thread info SET CPU/NOCOTHREAD [SYSTEST]HTHREAD.EXE We added some features to reduce hyperthreads hurting or confusing you
E N D
Hyperthread Support in OpenVMS V8.3 What to do about Montecito?
Pre-Summary • We added some features to help you manage hyperthreads • SHOW CPU/BRIEF displays thread info • SET CPU/NOCOTHREAD • [SYSTEST]HTHREAD.EXE • We added some features to reduce hyperthreads hurting or confusing you • Scheduler change • Accounting change • You need to experiment with your own application mix to see if hyperthreads help you
Definitions of terms • Processor • A chip or package • Core • A ‘thing’ within a processor that physically executes programs • Hyperthread • A ‘thing’ within a core that logically executes programs • CPU • The OpenVMS abstraction for a ‘thing’ that executes programs • Thread of execution • Software concept of what a CPU executes
What is “Hyperthreading” vs “Dual Core”? • Both are features of new “Montecito” Itanium chips • Both abstracted as CPUs on OpenVMS • Very different in implementation
Dual Core • Two (nearly) complete CPUs on one chip • Think two older CPU chips glued together :-) • Separate cache, separate processing units, separate state. (Share bus interface) • Both cores executing simultaneously
2 Way Multi-threading 1MB L2I Power Management/ Frequency Boost (Foxton) Dual- core 2x12MB L3 caches with Pellston Soft Error Detection/ Correction Montecito Micrograph Arbiter
Hyperthreading • Hyperthread: A set of state (e.g. user registers, control registers, IP, etc) in a core • Shares execution resources with other threads • Only one hyperthread active (i.e. executing a program) at once on Montecito • When hyperthread blocks, other hyperthread activates • Also swaps on a timer
Montecito Multi-threading Serial Execution Ai Idle Ai+1 Bi Idle Bi+1 Montecito Multi-threaded Execution Ai Idle Ai+1 Bi Bi+1 Multi-threading decreases stalls and increases performance
Dynamic Thread Switching • Speculate that a long latency event will stall execution • L3 miss • Uncached accesses • Time outs ensure fairness • hint@pause gives software control • OS has no knowledge or control of hyperthread switches
Hyperthread Abstraction in VMS • Reminder: 1 processor (or package or chip) has • 2 Cores • 4 Threads • Each hyperthread appears in OpenVMS as a CPU • CPUs that share the same cores are called “Cothread CPUs” • Note: Cores that share a processor (or package or chip) are not named or treated differently
Identifying CoThread CPUs on OpenVMS • $ show cpu/brief • System: XXXXXX, HP rx4640 (1.40GHz/12.0MB) • CPU 0 State: RUN CPUDB: 8202A000 Handle: 00005D70 • Owner: 000004C8 Current: 000004C8 Partition 0 • Cothd: 8 • CPU 1 State: RUN CPUDB: 820FDF80 Handle: 00005E80 • Owner: 000004C8 Current: 000004C8 Partition 0 • Cothd: 9 • CPU 2 State: RUN CPUDB: 820FFC80 Handle: 00005F90 • Owner: 000004C8 Current: 000004C8 Partition 0 • Cothd: 10 • CPU 3 State: RUN CPUDB: 82101A80 Handle: 000060A0 • Owner: 000004C8 Current: 000004C8 Partition 0 • Cothd: 11
Tradeoffs with Hyperthreads: Basics • One core with two threads MAY perform better than one core with one thread (but not always) • One core with two threads NEVER performs as well as two cores
Montecito Multi-threading Serial Execution Ai Idle Ai+1 Bi Idle Bi+1 Montecito Multi-threaded Execution Ai Idle Ai+1 Bi Bi+1 Multi-threading decreases stalls and increases performance
Montecito Multi-threading (No Stalls) Serial Execution Ai Ai+1 Bi Bi+1 Montecito Multi-threaded Execution Ai Ai+1 Bi Bi+1
Multi-threading vs Two Cores Execution on Two Cores Ai Ai+1 Bi Bi+1 Montecito Multi-threaded Execution Ai Ai+1 Bi Bi+1
VMS support for Hyperthreading • Three categories of support • Managing/getting info • Reducing “waste” of hyperthread cycles • Scheduling
Managing/Getting Info • Hyperthread to CPU mapping • First thread of all cores followed by second threads • Ex: 2 processor system. CPU 0,1,2,3 are all separate cores. CPU 4,5,6,7 are cothreads of 0,1,2,3 • SHOW CPU/BRIEF and /FULL • Notes CPU that is the Cothread of the displayed CPU • SET CPU/[NO]COTHREAD • Stops one of the cothreads on the core associated with this CPU • Accounting • Only charge a process ½ the CPU time if CPUs cothread is busy
Managing • Efi command: cpuconfig threads on/off • Supported part of efi • Requires two resets: one to get to efi; one to make thread command take effect. • [systest]hthread.exe • Like RADCHECK, an unsupported but helpful little utility • Check and modify firmware state of hyperthreading • $hthread –show $hthread –on $ hthread –off • Change after next reboot (i.e. only a single reset)
Reducing Hyperthread Cycle Waste • Main point: A hyperthread spinning in halt or idle still uses cycles that its cothread might have used • Idle loop • hint@pause between each check for busy • Power saver mode as usual • STOP/CPU • hint@pause while halted • Future possibilities: • hint@pause while spinning on locks? • Tradeoffs abound!
Scheduler Changes • Two cores always better than two hyperthreads on the same core so: • Attempt to schedule processes on CPUs without a busy cothread • Ties in with waste reduction since an idle hyperthread will give up its cycles to its cothread
Question you are too polite to ask • Why didn’t you change the scheduler to make good use of hyperthreads? • Answer: • We don’t know how. • Seriously, it is VERY application mix dependent.
Tradeoffs with Hyperthreading • Imagine you want to make best use of hyperthreads • What threads of execution do you run on same core?
Who shares a core? • Threads that share the same memory space (e.g. kernel threads within a process) • They might share some cache and require fewer cache fills and thus perform better! • But if they stall less, hyperthreads are less advantageous! • Threads that have nothing to do with each other • More cache misses so threads help more • But more cache misses means poorer individual performance! • Clearly there is a tradeoff somewhere, but we can’t make it automatically
My recommendation • Even without threads, Montecito works well • Try it with threads off; you will likely be happy • Experiment with processes on threads • Use affinity to group different processes on cothreads, or to avoid cothreads • Experiment with fastpath CPUs on threads. • Do you get better throughput spreading I/O across all threads or only using one thread per core?
Other features - Soon • ar.ruc • NUMA • Power control
Other features further out • User mode rfi • Might allow one to go to an instruction within a bundle • Useful for AST returns (maybe?)