720 likes | 995 Views
Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor. Priya Govindarajan CMPE 200. Introduction. Researchers have proposed two alternative microarchitectures that exploit multiple threads of control: simultaneous multithreading SMT [1] chip multiprocessors CMP [2].
E N D
Advanced Topics in Pipelining- SMT and Single-Chip Multiprocessor Priya Govindarajan CMPE 200
Introduction • Researchers have proposed two alternative microarchitectures that exploit multiple threads of control: • simultaneous multithreading SMT [1] • chip multiprocessors CMP [2]
CMP Vs SMT • Why software and hardware trends will favor the CMP microarchitecture. • Conclusion on the performance results from comparison of simulated superscalar, SMT, and CMP microarchitectures.
SMT Discussion Outline • Introduction • Mutithreading MT • Approaches of Multithreading • Motivation for introducing SMT • Implementation of SMT CPU • Performance estimates • Architectural abstraction
Introduction to SMT • SMT processors augment wide (issuing many instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently • Dynamically selecting and executing instructions from many active threads simultaneously. • Higher utilization of the processor’s execution resources • Provides latency tolerance in case a thread stalls due to cache misses or data dependencies. • When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar.
Introduction to SMT • SMT uses the insight that a dynamically scheduled processor already has many of h/w mechanisms needed to support the integrated exploitation of TLP through MT. • MT can be built on top of out-of-order processor by adding a per thread register renaming, PCs and providing capability for instructions from multiple threads to commit.
Mutithreading: Exploiting Thread-Level Parallelism • Multithreading • Multiple threads to share the functional units of a single processor in an overlapping fashion. • The processor must duplicate the independent state of each thread. (register file, a separate PC, page table) • Memory can be shared through the virtual memory mechanisms, which already support multiprocessing • Needs hardware support for changing the threads.
Multithreading…. • Two main approaches to multithreading • Fine-grained multithreading • Coarse-grained multithreading
Switches between threads on each instruction, causing interleaving Interleaving in round-robin. Skipping any threads that r stalled Switches threads only on costly stalls. Fine-grained .. Coarse-grainedmultithreading
Fine-grained multithreading Advantages • Hides throughput losses that arise from both short and long stalls. Disadvantages • Slows down the execution of an individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads.
Coarse-grained multithreading Advantages • Relieves the need to have thread switching be essentially free and is much less likely to slow down the execution of an individual threads
Coarse-grained multithreading Disadvantages • Throughput losses, especially from shorter stalls. • This is because coarse grained issues instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen. • New thread begins executing after the stall must fill the pipeline before instructions will be able to complete.
Simultaneous Multithreading • Is a variation on multithreading that uses the resources of a multiple-issue processors, dynamically scheduled processor to exploit TLP at the same time it exploits ILP. Why ? • Modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. • With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without any dependences among them.
Challenges for SMT processor • Dealing with a larger register file needed to hold multiple contexts • Maintaining low overhead on the clock cycle, particularly in issue , completion • Ensuring cache conflicts by simultaneous execution of multiple threads do not cause significant performance degradation.
SMT • SMT will significantly enhance multistream performance across a wide range of applications without significant hardware cost and without major architectural changes
Instruction Issue Reduced function unit utilization due to dependencies
Superscalar Issue Superscalar leads to more performance, but lower utilization
Simultaneous Multithreading Maximum utilization of function units by independent operations
Fine Grained Multithreading Interleaving – no empty slot Intra-thread dependencies still limit performance
Architectural Abstraction • 1 CPU with 4 Thread Processing Units (TPUs ) • Shared hardware resources
Changes for SMT • Basic pipeline – unchanged • Replicated resources • Program counters • Register maps • Shared resources • Register file (size increased) • Instruction queue Instruction queue • First and second level caches • Translation buffers • Branch predictor
Single-Chip Multiprocessor • CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores. • If an application cannot be effectively decomposed into threads, CMPs will be underutilized.
Comparing Alternative Architectures Issue up to 12 instructions per cycle Super scalar Architecture
Comparing … SMT Architecture 8 separate PCs , executes instructions from 8 diff thread concurrently Multi bank caches
Chip multiprocessor architecture 8 small 2 issue superscalar processors. Depend on TLP
SMT and Memory • Large demands on memory • SMT require more bandwidth from primary cache (MT allows more load and store) • To allow this they have 128-kbye cache • Complex MESI(modified , exclusive, shared and invalid) cache-coherence protocol
CMP and Memory • Eight cores are independent and integrated with their individual pairs of caches – another form of clustering leads to high-frequency design for primary cache system • Small cache size and tight connection to these caches allows single-cycle access. • Need simpler coherence scheme
Quantitative performance.. CPU cores • To keep the processors execution units busy, SMT features • advanced branch prediction • register renaming • out-of-order issue • non blocking data caches. Which makes it inherently complex
number of registers increases Number of ports on each register must increase Exploit ILP using more processors instead of large issue widths within single processor CMP Approach…h/w simple CMP Solution
SMT Approach • Longer cycle times • Long, high capacitance I/O wires span the large buffers, queues and register files • Extensive use of multiplexers and crossbars to interconnect these units adds more capacitance • Delays associates dominate delay along CPU’s critical path • The cycle time impact of these structures can be mitigated by careful design using deep pipelining, by breaking the structures with small,fast clusters of closely related components by short wires. • But deep pipelining increases branch misprediction penalities and clustering tends to reduce the ability of the processor to find and exploit instruction level parallelism.
CMP Solution • Short cycle time to be be targeted with relatively little design effort, since its h/w is naturally clustered- each of the small CPUs is already a very small fast cluster of components. • Since OS allocates a single s/w thread of control to each processor, the partitioning of work among the “clusters” is natural and requires no h/w to dynamically allocate instructions to different clusters • Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast.
SMT and CMP • Architectural point of view, the SMT processor’s flexibility makes it superior. • However, the need to limit the effects of interconnect delays, which are becoming much slower than transistor gate delays, will also drive the billion-transistor chip design. • Interconnect delays will force the microarchitecture to be partitioned into small, localized processing elements. • CMP is much more promising because it is already partitioned into individual processing cores. • Because these cores are relatively simple, they are amenable to speed optimization and can be designed relatively easily.
Compiler support for SMT and CMP • Programmers must find TLP in order to maximize CMP performance • SMT requires programmers to explicitly divide code into threads to get maximum performance but unlike CMP, it can dynamically find more ILP if TLP is limited. • But with multithreaded OS these problems should prove to be less daunting • Having all eight of the CPUs on a single chip allows designers to exploit TLP even when threads communicate frequently
Performance results • A comparison of three architectures indicates that a multiprocessor on a chip will be easiest to implement while still offering excellent performance.
Disadvantages of CMP • When code cannot be MT, only one processor can be targeted to the task • However, a single 2 issue processor on CMP is only moderately slower than superscalar or SMT, since applications with little thread-level parallelism also lack ILP
Conclusion on CMP • CMP is promising candidate for a billion-transistor architecture. • Offers superior performance using simple h/w • Code that can be parallelized into multiple threads, the small CMP cores will perform comparable or better • Easier to design and optimize • SMTs use resources more efficiently than CMP, but more execution units can be included in a CMP of similar area, since less die area need be devoted to wide-issue logic.
D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403. • J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98. • J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997. • Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996. • LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip Multiprocessor. IEEE September1997 • GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd International Symposium on High-Performance Computer Architecture(Feb.). 291–301. • KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND SUK-HAN YOON. On-Chip Multiprocessor with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf • NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77. • OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2–11. • LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU, MICHAEL CHEN, KUNLE OLUKOTUN. The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2) • S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous Multithreading: A Platform for Next-generation Processors. In IEEE Micro, pages 12-18, September/October 1997 • V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998. • goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt • http://www.acm.uiuc.edu/banks/20/6/page4.html • Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/
The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University
Technology Architecture • Transistors are cheap, plentiful and fast • Moore’s law • 100 million transistors by 2000 • Wires are cheap, plentiful and slow • Wires get slower relative to transistors • Long cross-chip wires are especially slow • Architectural implications • Plenty of room for innovation • Single cycle communication requires localized blocks of logic • High communication bandwidth across the chip easier to achieve than low latency
Process Thread Levels of Parallelism Loop Instruction 1 10 100 1K 10K 100K 1M Grain Size (instructions) Exploiting Program Parallelism
Hydra Approach • A single-chip multiprocessor architecture composed of simple fast processors • Multiple threads of control • Exploits parallelism at all levels • Memory renaming and thread-level speculation • Makes it easy to develop parallel programs • Keep design simple by taking advantage of single chip implementation
Outline • Base Hydra Architecture • Performance of base architecture • Speculative thread support • Speculative thread performance • Improving speculative thread performance • Hydra prototype design • Conclusions
The Base Hydra Design • Shared 2nd-level cache • Low latency interprocessor communication (10 cycles) • Separate read and write buses • Single-chip multiprocessor • Four processors • Separate primary caches • Write-through data caches to maintain coherence
apsi pmake Hydra vs. Superscalar 4 Hydra 4 x 2-way issue • ILP only SS 30-50% better than single Hydra processor • ILP & fine thread SS and Hydra comparable • ILP & coarse thread Hydra 1.5–2better • “The Case for a CMP” ASPLOS ‘96 3.5 Superscalar 6-way issue 3 2.5 2 Speedup 1.5 1 0.5 0 swim applu OLTP eqntott MPEG2 tomcatv compress m88ksim