Hyper-Threading , Chip multiprocessors and both

Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic

To Be Tackled in Multithreading • Review of Threading Algorithms • Hyper-Threading Concepts • Hyper-Threading Architecture • Advantages/Disadvantages

Threading Algorithms • Time-slicing • A processor switches between threads in fixed time intervals. • High expenses, especially if one of the processes is in the wait state. Fine grain • Switch-on-event • Task switching in case of long pauses • Waiting for data coming from a relatively slow source, CPU resources are given to other processes. Coarse grain

Threading Algorithms (cont.) • Multiprocessing • Distribute the load over many processors • Adds extra cost • Simultaneous multi-threading • Multiple threads execute on a single processor without switching. • Basis of Intel’s Hyper-Threading technology.

Hyper-Threading Concept • At each point of time only a part of processor resources is used for execution of the program code of a thread. • Unused resources can also be loaded, for example, with parallel execution of another thread/application. • Extremely useful in desktop and server applications where many threads are used.

Quick Recall: Many Resources IDLE! For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. Slide source: John Kubiatowicz

(a) (b) (c) (d) • A superscalar processor with no multithreading • A superscalar processor with coarse-grain multithreading • A superscalar processor with fine-grain multithreading • A superscalar processor with simultaneous multithreading (SMT)

Simultaneous Multithreading (SMT) Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple threads! • i.e., convert thread-level parallelism into more ILP • exploit following features of modern processors: • multiple functional units • modern processors typically have more functional units available than a single thread can utilize • register renaming and dynamic scheduling • multiple instructions from independent threads can co-exist and co-execute!

Hyper-Threading Architecture • First used in Intel Xeon MP processor • Makes a single physical processor appear as multiple logical processors. • Each logical processor has a copy of architecture state. • Logical processors share a single set of physical execution resources

Hyper-Threading Architecture • Operating systems and user programs can schedule processes or threads to logical processors as if they were in a multiprocessing system with physical processors. • From an architecture perspective we have to worry about the logical processors using shared resources. • Caches, execution units, branch predictors, control logic, and buses.

Power 5 dataflow ... • Why only two threads? • With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck • Cost: • The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

Advantages • Extra architecture only adds about 5% to the total die area. • No performance loss if only one thread is active. Increased performance with multiple threads • Better resource utilization.

Disadvantages • To take advantage of hyper-threading performance, serial execution can not be used. • Threads are non-deterministic and involve extra design • Threads have increased overhead • Shared resource conflicts

Multicore Multiprocessors on a single chip

Basic Shared Memory Architecture • Processors all connected to a large shared memory • Where are caches? P2 P1 Pn interconnect memory • Now take a closer look at structure, costs, limits, programming CS267 Lecture 6

P P n 1 $ $ Bus I/O devices Mem What About Caching??? • Want High performance for shared memory: Use Caches! • Each processor has its own cache (or multiple caches) • Place data from memory into cache • Writeback cache: don’t send all writes over bus to memory • Caches Reduce average latency • Automatic replication closer to processor • More important to multiprocessor than uniprocessor: latencies longer • Normal uniprocessor mechanisms to access data • Loads and Stores form very low-overhead communication primitive • Problem: Cache Coherence! Slide source: John Kubiatowicz

u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 Example Cache Coherence Problem P P P 2 1 3 $ $ $ • Things to note: • Processors could see different values for u after event 3 • With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when • How to fix with a bus: Coherence Protocol • Use bus to broadcast writes or invalidations • Simple protocols rely on presence of broadcast medium • Bus not scalable beyond about 64 processors (max) • Capacity, bandwidth limitations I/O devices Memory Slide source: John Kubiatowicz

Limits of Bus-Based Shared Memory I/O MEM ° ° ° MEM Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate => 80 MB/s inst BW per processor => 60 MB/s data BW per processor • 140 MB/s combined BW Assuming 1 GB/s bus bandwidth \ 8 processors will saturate bus 140 MB/s ° ° ° cache cache 5.2 GB/s PROC PROC CS267 Lecture 6

Cache Organizations for Multi-cores • L1 caches are always private to a core • L2 caches can be private or shared • Advantages of a shared L2 cache: • efficient dynamic allocation of space to each core • data shared by multiple cores is not replicated • every block has a fixed “home” – hence, easy to find the latest copy • Advantages of a private L2 cache: • quick access to private L2 – good for small working sets • private bus to private L2  less contention

A Reminder: SMT (Simultaneous Multi Threading) SMT vs. CMP

SMT A Single Chip MultiprocessorL. Hammond at al. (Stanford), IEEE Computer 97 Superscalar (SS) • For Same area (a billion tr. DRAM area) • Superscalar and SMT: Very Complex • Wide • Advanced Branch prediction • Register Renaming • OOO Instruction Issue • Non-Blocking data caches CMP

SS and SMT vs. CMP • CPU Cores: Three main hardware design problems (of SS and SMT): • Area increases quadratically with core complexity • Number of Registers O(Instruction window size) • Register ports - O(Issue width) • CMP solves this problem (~ linear Area to Issue width) • Longer Cycle Times • Long Wires, many MUXes and crossbars • Large buffers, queues and register files • Clustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties) • CMP allows small cycle time (with little effort) • Small and fast • Relies on software to schedule • Poor ILP • Complex Design and Verification

SMT SS and SMT vs. CMP • Memory: • 12 issue SS or SMT require multiport data cache (4-6 ports) • 2 X 128 Kbyte (2 cycle latency) • CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport) • Shared memory: write through caches CMP

Performance comparison • Compress: (Integer apps) Low ILP and no TLP • • Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand) • + SMT utilizes core resources better • + But CMP has 16 issue slots instead of 12 • • Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) • + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache • • Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)

CMP Motivation • How to utilize available silicon? • Speculation (aggressive superscalar) • Simultaneous Multithreading (SMT, Hyperthreading) • Several processors on a single chip • What is a CMP (Chip MultiProcessor)? • Several processors (several masters) • Both shared and distributed memory architectures • Both homogenous and heterogeneous processor types • Why? • Wire Delays • Diminishing of Uniprocessors • Very long design and verification times for modern processors

A Single Chip MultiprocessorL. Hammond at al. (Stanford), IEEE Computer 97 • TLP and PLP become widespread in future applications • Various Multimedia applications • Compilers and OS • Favours CMP • CMP: • Better performance with simple hardware • Higher clock rates, better memory bandwidth • Shorter pipelines • SMT: has better utilizations but CMP has more resources (no wide-issue logic) • Although CMP bad for no TLP and ILP (compress), SMT and SS not much better

A Reminder: SMT (Simultaneous Multi Threading) CMP SMT • Pool of execution units (Wide machine) • Several Logical processors • Copy of State for each • Mul. Threads are running concurrently • Better utilization and Latency Tolerance • Simple Cores • Moderate amount of parallelism • Threads are running concurrently on different cores

SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point Schedulers Schedulers L2 Cache and Control Uop queues Uop queues L2 Cache and Control Rename/Alloc Rename/Alloc BTB Trace Cache uCode ROM BTB Trace Cache uCode ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4

Hyper-Threading , Chip multiprocessors and both