210 likes | 311 Views
Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab. Introduction. Simultaneous Multithreading(SMT)
E N D
Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab
Introduction • Simultaneous Multithreading(SMT) • A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unit • Two major impediments to processor utilization • long latencies • limited per-thread parallelism
In this paper • Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor • Show that SMT need not compromise single-thread performance • Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model • Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture
Simultaneous Multithreading Processor Architecture • A projection of current superscalar design trends 3-5 years into the future • Changes necessary to support simultaneous multithreading • Multiple program counters • Separate return stack for each thread • Per-thread instruction retirement, instruction queue flush, and trap mechanisms • A thread id with each branch target buffer entry • A larger register file
Methodology • MIPSI • MIPS-based simulator • executes unmodified Alpha object code • Workload • SPEC92 benchmark suite • five floating point programs, two integer programs, TeX • Multiflow • trace scheduling compiler
Performance of the Base Hardware Design • With only single thread, throughput is less than 2% below a superscalar w/o SMT support • Peak throughput is 84% higher than the superscalar • Three problems • IQ size • Fetch throughput • Lack of parallelism
The Fetch Unit • Improve fetch throughput w/o increasing the fetch bandwidth • alg.num1.num2 • alg : Fetch selection method • num1 : # of threads that can fetch in 1 cycle • num2 : max # of instructions fetched per thread in 1 cycle • Partitioning the fetch unit • RR.1.8 • RR.2.4, RR.4.2 • Some hardware addition • RR.2.8 • Additional logic is required
The Fetch Unit • Fetch Policies • BRCOUNT • that are least likely to be on a wrong path • MISSCOUNT • that have the fewest outstanding D cache miss • ICOUNT • with the fewest instructions in decode • IQPOSN • with instructions farther from head of IQ
The Fetch Unit • Unblocking the Fetch Unit • BIGQ • increase IQ’s size as long as we don’t increase the search space • double size, search first 32 entries • ITAG • do I cache tag lookup a cycle early
Choosing Instructions for Issue • Two sources of issue slot waste • Wrong-path instructions • result from mispredicted branches • Optimistically issued instructions • result from cache miss or bank conflict • Issue Algorithms • OPT_LAST • SPEC_LAST • BRANCH_FIRST
What are the Bottlenecks now? • The Issue Bandwidth • not a bottleneck • Instruction Queue Size • not a bottleneck • experiment with larger queues increased throughput by less than 1% • Fetch Bandwidth • prime candidate for bottleneck status • increasing IQ and excess registers increased performance another 7% • Branch Prediction • less sensitive in SMT
What are the Bottlenecks now? • Speculative Execution • not a bottleneck • eliminating will be a issue • Memory Throughput • infinite bandwidth caches will increase throughput only by 3% • Register File Size • no sharp drop-off point • Fetch Throughput is still a bottleneck
Summary • Borrows heavily from conventional superscalar design, requiring little additional hardware support • Minimizes the impact on single-thread performance, running only 2% slower in that scenario • Achieves significant throughput improvements over the superscalar when many threads are running
SMT in modern commercial implementations • Intel Pentium4, 2002 • Hyper-Threading Technology(HTT) • 30% speed improvement • MIPS MT • IBM POWER5, 2004 • two-thread SMT engine • SUN Ultrasparc T1, 2005 • CMT : SMT + CMP(Chip-level multiprocessing)