1 / 59

Motivation for Multithreaded Architectures

Motivation for Multithreaded Architectures. Processors not executing code at their hardware potential late 70’s: performance lost to memory latency 90’s: performance not in line with the increasingly complex parallel hardware instruction issue bandwidth is increasing

Download Presentation

Motivation for Multithreaded Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motivation for Multithreaded Architectures Processors not executing code at their hardware potential • late 70’s: performance lost to memory latency • 90’s: performance not in line with the increasingly complex parallel hardware • instruction issue bandwidth is increasing • number of functional units is increasing • instructions can execute out-of-order execution & complete in-order • still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width CSE 548P

  2. Motivation for Multithreaded Architectures CSE 548P

  3. Motivation for Multithreaded Architectures Major cause is the lack of instruction-level parallelism in a single executing thread Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor CSE 548P

  4. Multithreaded Processors Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling • holds processor state for more than one thread of execution • registers • PC • each thread’s state is a hardware context • execute the instruction stream from multiple threads without software context switching • utilize thread-level parallelism (TLP) to compensate for a lack in ILP CSE 548P

  5. Multithreading Traditional multithreaded processors hardware switch to a different context to avoid processor stalls 3 styles 1. coarse-grain multithreading • switch on a long-latency operation (e.g., L2 cache miss) • another thread executes while the miss is handled • modest increase in instruction throughput with potentially no slowdown to the thread with the miss • doesn’t hide latency of short-latency operations • doesn’t add much to an already long stall • need to fill the pipeline on a switch • no switch if no long-latency operations • HEP, IBM RS64 III CSE 548P

  6. Traditional Multithreading 3 styles 2. fine-grain multithreading • can switch to a different thread each cycle (usually round robin) • hides latencies of all kinds • larger increase in instruction throughput but slows down the execution of each thread • Cray (Tera) MTA CSE 548P

  7. Comparison of Issue Capabilities CSE 548P

  8. Multithreading 3 styles 3. simultaneous multithreading (SMT) • issues multiple instructions from multiple threads each cycle • no hardware context switching • huge boost in instruction throughput with less degradation to individual threads CSE 548P

  9. Comparison of Issue Capabilities CSE 548P

  10. Cray (Tera) MTA Fine-grain multithreaded processor • can switch to a different thread each cycle • switches to ready threads only • allows execution to remain with the current thread for a specific number of cycles (discussed under compiler support) • up to 128 hardware contexts • lots of latency to hide, mostly from the multi-hop interconnection network • average instruction latency is 70-cycles (at one point)(i.e., 70 instruction streams needed to hide all latency, on average) • processor state for all 128 contexts • GPRs (total of 4K registers!) • status registers (includes the PC) • branch target registers CSE 548P

  11. Cray (Tera) MTA Interesting features • no data caches • to avoid having to keep caches coherent (topic of the next lecture) • increases the latency for data accesses but reduces the variation • L1 & L2 instruction caches • instruction accesses are more predictable & have no coherency problem • prefetch straight-line & target code CSE 548P

  12. Cray (Tera) MTA Interesting features • trade-off between avoiding memory bank conflicts & exploiting spatial locality • memory distributed among processors • memory addresses are randomized to avoid conflicts • want to fully utilize all memory bandwidth • run-time system can confine consecutive virtual addresses to a single (close-by) memory unit • reduces latency (used mainly for instructions) CSE 548P

  13. Cray (Tera) MTA Interesting features • tagged memory for synchronization, pointer-chasing, trap handling • indirectly set full/empty bits to prevent data races • prevents a consumer/producer from loading/overwriting a value before a producer/consumer has written/read it • set to empty when producer instruction starts • consumer instructions block if try to read the producer value • set to full when producer value is written • consumers can now read a valid value • explicitly set full/empty bits for synchronization • primarily used to synchronize threads that are accessing shared data (topic of the next lecture) • lock: read memory location & set to empty • other readers are blocked • unlock: write & set to full CSE 548P

  14. Cray (Tera) MTA Interesting features • virtual memory system • no paging: want pages pinned down in memory • page size is 256MB • user-mode trap handlers • fatal exceptions, normalizing floating point numbers CSE 548P

  15. Cray (Tera) MTA Compiler support • each instruction is a VLIW instruction • memory/arithmetic/branch • load/store architecture • need a good code scheduler • explicit dependence look-ahead • field in a memory instruction that specifies the number of independent LIW instructions that follow • deviation from fine-grain multithreading • 8 memory operations can be outstanding • handling branches • instruction to store a branch target in a register before the branch is executed • can start prefetching the target code CSE 548P

  16. Cray (Tera) MTA Run-time support • thread manipulation • protection domains: group of threads executing in the same virtual address space • OS sets the maximum number of thread contexts (instruction streams) a domain is allowed • domain can create & kill threads within that limit, depending on its need for them • OS bases resource allocations on resource usage CSE 548P

  17. SMT: The Executive Summary Simultaneous multithreaded (SMT) processors combine designs from: • superscalar processors • traditional multithreaded processors The combination: • SMT issues & executes instructions from multiple threads simultaneously => converting TLP to ILP • threads share almost all hardware resources CSE 548P

  18. Performance Implications Multiprogramming workload • 2.5X on SPEC95, 4X on SPEC2000 Parallel programs • ~.7 on SPLASH2 Commercial databases • 2-3X on TPC B; 1.5 on TPC D Web servers & OS • 4X on Apache and Digital Unix CSE 548P

  19. Does this Processor Sound Familiar? Huge performance boost + straightforward implementation => • 2-context Intel Hyperthreading • 2-context Sun UltraSPARC on 4-processor CMP • 4-context Compaq 21464 • network processor & mobile device start-ups • others in the wings CSE 548P

  20. An SMT Architecture Three primary goals for this architecture: 1. Achieve significant throughput gains with multiple threads 2. Minimize the performance impact on a single thread executing alone 3. Minimize the architectural impact on a conventional out-of-order superscalar design CSE 548P

  21. Implementing SMT CSE 548P

  22. Implementing SMT Can use as is most hardware on current out-or-order processors Out-of-order renaming & instruction scheduling mechanisms • physical register pool model • renaming hardware eliminates false dependences both within a thread (just like a superscalar) & between threads • map thread-specific architectural registers onto a pool of thread-independent physical registers • operands are thereafter called by their physical names • an instruction is issued when its operands become available & a functional unit is free • instruction scheduler not consider thread IDs when dispatching instructions to functional units (unless threads have different priorities) CSE 548P

  23. From Superscalar to SMT Extra pipeline stages for accessing thread-shared register files • 8 threads * 32 registers + renaming registers SMTinstruction fetcher (ICOUNT) • fetch from 2 threads each cycle • count the number of instructions for each thread in the pre-execution stages • pick the 2 threads with the lowest number • in essence fetching from the two highest throughput threads CSE 548P

  24. From Superscalar to SMT Per-thread hardware • small stuff • all part of current out-of-order processors • none endangers the cycle time • other per-thread processor state, e.g., • program counters • return stacks • thread identifiers, e.g., with BTB entries, TLB entries • per-thread bookkeeping for • instruction retirement • trapping • instruction queue flush This is why there is only a 10% increase to Alpha 21464 chip area. CSE 548P

  25. Implementing SMT Thread-shared hardware: • fetch buffers • branch prediction structures • instruction queues • functional units • active list • all caches & TLBs • MSHRs • store buffers This is why there is little single-thread performance degradation (~1.5%). CSE 548P

  26. Architecture Research Concept & potential of Simultaneous Multithreading: ISCA ’95 & ISCA 25th Anniversary Anthology Designing the microarchitecture: ISCA ’96 • straightforward extension of out-of-order superscalars I-fetch thread chooser: ISCA ’96 • 40% faster than round-robin The lockbox for cheap synchronization: HPCA ’98 • orders of magnitude faster • can parallelize previously unparallelizable codes CSE 548P

  27. Architecture Research Software-directed register deallocation: TPDS ’99 • large register-file performance w. small register file Mini-threads: HPCA ’03 • large SMT performance w. small SMTs SMT instruction speculation: TOCS ‘03 • a good thread mix is the most important performance factor CSE 548P

  28. Compiler Research Tuning compiler optimizations for SMT: Micro ‘97 & IJPP ’99 • data decomposition: use cyclic iteration scheduling • tiling: use cyclic tiling; no tile size sweet spot • software speculation: generates too much overhead code on integer programs Communicate last-use info to HW for early register deallocation: TPDS ’99 • now need ¼ the renaming registers Compiling for fewer registers/thread: HPCA ’03 • surprisingly little additional spill code (avg. 3%) CSE 548P

  29. OS Research Analysis of OS behavior on SMT: ASPLOS ‘00 • Kernel-kernel conflicts in I$ & D$ & branch mispredictions ameliorated by SMT instruction issue + thread-sharing in HW OS/runtime support for mini-threads: HPCA ’03 • dedicated server: recompile for fewer registers • multiprogrammed environment: multiple versions OS/runtime support for executing threaded programs: ISCA ’98 & PPoPP ’03 • dynamic memory allocation, synchronization, stack offsetting, page mapping CSE 548P

  30. Others are Now Carrying the Ball Fault detection & recovery Thread-level speculation Instruction & data prefetching Instruction issue hardware design Thread scheduling Single-thread execution Profiling executing threads SMT-CMP hybrids CSE 548P

  31. UW Hank Levy Steve Gribble Dean Tullsen (UCSD) Jack Lo (Transmeta) Sujay Parekh (IBM Yorktown) Brian Dewey (Microsoft) Manu Thambi (Microsoft) Josh Redstone (interviewing) Mike Swift Luke McDowell Steve Swanson Aaron Eakin (HP) Dimitriy Portnov (Google) DEC/Compaq Joel Emer (now Intel) Rebecca Stamm Luiz Barroso (now Google) Kourosh Gharachorloo (now Google) For more info on SMT: http://www.cs.washington.edu/research/smt SMT Collaborators CSE 548P

  32. SMT Performance Apache SMT improves throughput by converting TLP from multiple contexts into ILP Performance increases with more contexts IPC 1 2 4 8 Contexts CSE 548P

  33. A Problem with Large SMTs Register file will not scale in future technologies • e.g., 4-context Alpha 21464: • 3 cycles to access • 4 times the size of the 64KB instruction cache First SMT implementations will be small • limited to 2 hardware contexts • expect lower throughput gains Goal of mini-threads: achieve performance gains of larger SMTs without the register cost CSE 548P

  34. Mini-threads: the Concept Execute more than one thread, called a mini-thread, within each context Mini-threads in a context share the context’s architectural register set CSE 548P

  35. SMT Processor Model Architectural Registers Context 1 PC Context 2 PC CSE 548P

  36. Mini-thread Processor Model Architectural Registers Context 1 PC PC Context 2 PC PC CSE 548P

  37. Mini-thread Processor Model Architectural Registers Context 1 PC PC Context 2 PC PC CSE 548P

  38. Mini-thread Processor Model Context 1 Architectural Registers Context 1 PC PC Context 2 Context 2 PC PC CSE 548P

  39. Mini-threads Executing on SMT The HW side: Add extra PCs to each hardware context • mini-threads execute using their own PC • mini-threads share the context’s architectural register set CSE 548P

  40. Mini-threads Executing on SMT The SW side: applications can choose to trade-off registers for TLP • ignore the PCs and the machine is SMT • use the extra PCs for higher thread-level parallelism • an application is responsible for coordinating register usage across mini-threads within its context CSE 548P

  41. The Register/TLP Trade-off + Throughput benefit of extra mini-threads (more TLP) -- Instruction cost of fewer registers per mini-thread (spill code) +/--Spill code impact on throughput +/--Instruction impact of extra mini-threads CSE 548P

  42. Two Mini-threads/Context Workload Average 90 • 1 Context: 63% 70 50 30 % Increase in IPC 10 --10 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  43. 1 Context 2 Contexts 4 Contexts 8 Contexts Two Mini-threads/Context Workload Average 90 • 1 Context: 63% • 2 Contexts: 40% • 4 Contexts: 26% • 8 Contexts: 9% 70 50 30 % Increase in IPC 10 -10 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  44. 16 Registers per Mini-thread • Only 3% on average! • IPC cost can be greater 20 15 10 5 0 % Change in Dynamic Instructions -5 -10 -15 -20 -25 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  45. 16 Registers per Mini-thread 20 15 • 7% savings in spill code 10 5 0 % Change in Dynamic Instructions -5 -10 -15 -20 -25 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  46. 16 Registers per Mini-thread 20 15 • Kernel: .8% 10 5 0 % Change in Dynamic Instructions -5 -10 -15 -20 -25 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  47. 1 Context 2 Contexts 4 Contexts 8 Contexts Bottom Line for a 2 Mini-threads on SMT Workload Average 100 • 1 Context: 60% • 2 Contexts: 38 • 4 Contexts: 20% • 8 Contexts: -2% 80 60 40 % Speedup 20 0 -20 -40 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  48. 1 Context 2 Contexts 4 Contexts 8 Contexts Bottom Line for a 2 Mini-threads on SMT Workload Average 100 • If free to choose: • 1 Contexts: 60% • 2 Contexts: 38% • 4 Contexts: 22% • 8 Contexts: 6% 80 60 40 % Speedup 20 0 -20 -40 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  49. 1 Context 2 Contexts 4 Contexts 8 Contexts Bottom Line for a 2 Mini-threads on SMT Workload Average 100 83 66 • 1 Contexts: 60% • 2 Contexts: 38% • 4 Contexts: 20% • 8 Contexts: -2% 80 60 43 40 10 % Speedup 20 0 -20 -40 Fmm Barnes Apache Water-spatial Raytrace CSE 548P

  50. Operating System & Runtime Support The issues: • Runtime and/or OS support for managing mini-threads • Compatible register usage between mini-threads & OS/runtime CSE 548P

More Related