1 / 21

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab. Introduction. Simultaneous Multithreading(SMT)‏

ayala
Download Presentation

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab

  2. Introduction • Simultaneous Multithreading(SMT)‏ • A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unit • Two major impediments to processor utilization • long latencies • limited per-thread parallelism

  3. In this paper • Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor • Show that SMT need not compromise single-thread performance • Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model • Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture

  4. Simultaneous Multithreading Processor Architecture

  5. Simultaneous Multithreading Processor Architecture • A projection of current superscalar design trends 3-5 years into the future • Changes necessary to support simultaneous multithreading • Multiple program counters • Separate return stack for each thread • Per-thread instruction retirement, instruction queue flush, and trap mechanisms • A thread id with each branch target buffer entry • A larger register file

  6. Simultaneous Multithreading Processor Architecture

  7. Hardware Details

  8. Methodology • MIPSI • MIPS-based simulator • executes unmodified Alpha object code • Workload • SPEC92 benchmark suite • five floating point programs, two integer programs, TeX • Multiflow • trace scheduling compiler

  9. Performance of the Base Hardware Design

  10. Performance of the Base Hardware Design • With only single thread, throughput is less than 2% below a superscalar w/o SMT support • Peak throughput is 84% higher than the superscalar • Three problems • IQ size • Fetch throughput • Lack of parallelism

  11. The Fetch Unit • Improve fetch throughput w/o increasing the fetch bandwidth • alg.num1.num2 • alg : Fetch selection method • num1 : # of threads that can fetch in 1 cycle • num2 : max # of instructions fetched per thread in 1 cycle • Partitioning the fetch unit • RR.1.8 • RR.2.4, RR.4.2 • Some hardware addition • RR.2.8 • Additional logic is required

  12. The Fetch Unit

  13. The Fetch Unit • Fetch Policies • BRCOUNT • that are least likely to be on a wrong path • MISSCOUNT • that have the fewest outstanding D cache miss • ICOUNT • with the fewest instructions in decode • IQPOSN • with instructions farther from head of IQ

  14. The Fetch Unit

  15. The Fetch Unit

  16. The Fetch Unit • Unblocking the Fetch Unit • BIGQ • increase IQ’s size as long as we don’t increase the search space • double size, search first 32 entries • ITAG • do I cache tag lookup a cycle early

  17. Choosing Instructions for Issue • Two sources of issue slot waste • Wrong-path instructions • result from mispredicted branches • Optimistically issued instructions • result from cache miss or bank conflict • Issue Algorithms • OPT_LAST • SPEC_LAST • BRANCH_FIRST

  18. What are the Bottlenecks now? • The Issue Bandwidth • not a bottleneck • Instruction Queue Size • not a bottleneck • experiment with larger queues increased throughput by less than 1% • Fetch Bandwidth • prime candidate for bottleneck status • increasing IQ and excess registers increased performance another 7% • Branch Prediction • less sensitive in SMT

  19. What are the Bottlenecks now? • Speculative Execution • not a bottleneck • eliminating will be a issue • Memory Throughput • infinite bandwidth caches will increase throughput only by 3% • Register File Size • no sharp drop-off point • Fetch Throughput is still a bottleneck

  20. Summary • Borrows heavily from conventional superscalar design, requiring little additional hardware support • Minimizes the impact on single-thread performance, running only 2% slower in that scenario • Achieves significant throughput improvements over the superscalar when many threads are running

  21. SMT in modern commercial implementations • Intel Pentium4, 2002 • Hyper-Threading Technology(HTT)‏ • 30% speed improvement • MIPS MT • IBM POWER5, 2004 • two-thread SMT engine • SUN Ultrasparc T1, 2005 • CMT : SMT + CMP(Chip-level multiprocessing)‏

More Related