CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer EngineeringUniversity of Alabama in Huntsville Aleksandar Milenkovicmilenka@ece.uah.edu http://www.ece.uah.edu/~milenka

Outline • Trends in microarchitecture • Exploiting thread-level parallelism • Exploiting TLP within a processor • Resource sharing • Performance implications • Design challenges • Intel’s HT technology

Trends in microarchitecture • Higher clock speeds • To achieve high clock frequency make pipeline deeper (superpipelining) • Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles • ILP: Instruction Level Parallelism • Extract parallelism in a single program • Superscalar processors have multiple execution units working in parallel • Challenge to find enough instructions that can be executed concurrently • Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order

Trends in microarchitecture • Cache hierarchies • Processor-memory speed gap • Use caches to reduce memory latency • Multiple levels of caches: smaller and faster closer to the processor core • Thread-level Parallelism • Multiple programs execute concurrently • Web-servers have an abundance of software threads • Users: surfing the web, listening to music, encoding/decoding video streams, etc.

Exploiting thread-level parallelism • CMP – Chip Multiprocessing • Multiple processors, each with a full set of architectural resources, reside on the same die • Processors may share an on-chip cache or each can have its own cache • Examples: HP Mako, IBM Power4 • Challenges: Power, Die area (cost) • Time-slice multithreading • Processor switches between software threads after a predefined time slice • Can minimize the effects of long lasting events • Still, some execution slots are wasted

Multithreading Within a Processor • Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? • Why is this desireable? • inexpensive – one CPU, no external interconnects • no remote or coherence misses (more capacity misses) • Why does this make sense? • most processors can’t find enough work – peak IPC is 6, average IPC is 1.5! • threads can share resources  we can increase threads without a corresponding linear increase in area

What Resources are Shared? • Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) • For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs) • For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources • Each additional thread costs a PC, rename table, and ROB – cheap!

Approaches to Multithreading Within a Processor • Fine-grained multithreading:switches threads on every clock cycle • Pro: hide latency of from both short and long stalls • Con: Slows down execution of the individual threads ready to go • Course-grained multithreading:switches threads only on costly stalls (e.g., L2 stalls) • Pros: no switching each clock cycle, no slow down for ready-to-go threads • Con: limitations in hiding shorter stalls • Simultaneous Multithreading:exploits TLP at the same time it exploits ILP

How Resources are Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Coarse-grainedMultithreading Fine-Grained Multithreading Superscalar Simultaneous Multithreading • Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss • Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated • Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot

Resource Sharing Thread-1 R1  R1 + R2 R3  R1 + R4 R5  R1 + R3 P73 P1 + P2 P74  P73 + P4 P75  P73 + P74 Instr Fetch Instr Rename Issue Queue Instr Fetch Instr Rename P73 P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 R2  R1 + R2 R5  R1 + R2 R3  R5 + R3 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 Thread-2 Register File FU FU FU FU

Performance Implications of SMT • Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread • While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources • With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 • Alpha 21464 and Intel Pentium 4 are examples of SMT

Design Challenges • How many threads? • Many to find enough parallelism • However, mixing many threads will compromise execution of individual threads • Processor front-end (instruction fetch) • Fetch as far as possible in a single thread (to maximize thread performance) • However, this limits the number of instructions available for scheduling from other threads • Larger register files (multiple contexts) • Minimize clock cycle time • Cache conflicts

Pentium 4: Hyperthreading architecture • One physical processor appears as multiple logical processors • HT implementation on NetBurst microarchitecture has 2 logical processors Architectural State Architectural State • Architectural state: • general purpose registers • control registers • APIC: advanced programmable interrupt controller Processor execution resources

Pentium 4: Hyperthreading architecture • Main processor resources are shared • caches, branch predictors, execution units, buses, control logic • Duplicated resources • register alias tables (map the architectural registers to physical rename registers) • next instruction pointer and associated control logic • return stack pointer • instruction streaming buffer and trace cache fill buffers

Pentium 4: Die Size and Complexity

Pentium 4: Resources sharing schemes • Partition – dedicate equal resources to each logical processors • Good when expect high utilization and somewhat unpredicatable • Threshold – flexible resource sharing with a limit on maximum resource usage • Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods • Full sharing – flexible with no limits • Good for large structures, with variable working-set sizes

Pentium 4: Shared vs. partitioned queues shared partitioned

NetBurst Pipeline threshold partitioned

Pentium 4: Shared vs. partitioned resources • Partitioned • E.g., major pipeline queues • Threshold • Puts a threshold on the number of resource entries a logical processor can have • E.g., scheduler • Fully shared resources • E.g., caches • Modest interference • Benefit if we have shared code and/or data

Pentium 4: Scheduler occupancy

Pentium 4: Shared vs. partitioned cache

Pentium 4: Performance Improvements

Multi-Programmed Speedup

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor