Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab

Introduction • Simultaneous Multithreading(SMT)‏ • A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unit • Two major impediments to processor utilization • long latencies • limited per-thread parallelism

In this paper • Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor • Show that SMT need not compromise single-thread performance • Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model • Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture

Simultaneous Multithreading Processor Architecture

Simultaneous Multithreading Processor Architecture • A projection of current superscalar design trends 3-5 years into the future • Changes necessary to support simultaneous multithreading • Multiple program counters • Separate return stack for each thread • Per-thread instruction retirement, instruction queue flush, and trap mechanisms • A thread id with each branch target buffer entry • A larger register file

Simultaneous Multithreading Processor Architecture

Hardware Details

Methodology • MIPSI • MIPS-based simulator • executes unmodified Alpha object code • Workload • SPEC92 benchmark suite • five floating point programs, two integer programs, TeX • Multiflow • trace scheduling compiler

Performance of the Base Hardware Design

Performance of the Base Hardware Design • With only single thread, throughput is less than 2% below a superscalar w/o SMT support • Peak throughput is 84% higher than the superscalar • Three problems • IQ size • Fetch throughput • Lack of parallelism

The Fetch Unit • Improve fetch throughput w/o increasing the fetch bandwidth • alg.num1.num2 • alg : Fetch selection method • num1 : # of threads that can fetch in 1 cycle • num2 : max # of instructions fetched per thread in 1 cycle • Partitioning the fetch unit • RR.1.8 • RR.2.4, RR.4.2 • Some hardware addition • RR.2.8 • Additional logic is required

The Fetch Unit

The Fetch Unit • Fetch Policies • BRCOUNT • that are least likely to be on a wrong path • MISSCOUNT • that have the fewest outstanding D cache miss • ICOUNT • with the fewest instructions in decode • IQPOSN • with instructions farther from head of IQ

The Fetch Unit

The Fetch Unit • Unblocking the Fetch Unit • BIGQ • increase IQ’s size as long as we don’t increase the search space • double size, search first 32 entries • ITAG • do I cache tag lookup a cycle early

Choosing Instructions for Issue • Two sources of issue slot waste • Wrong-path instructions • result from mispredicted branches • Optimistically issued instructions • result from cache miss or bank conflict • Issue Algorithms • OPT_LAST • SPEC_LAST • BRANCH_FIRST

What are the Bottlenecks now? • The Issue Bandwidth • not a bottleneck • Instruction Queue Size • not a bottleneck • experiment with larger queues increased throughput by less than 1% • Fetch Bandwidth • prime candidate for bottleneck status • increasing IQ and excess registers increased performance another 7% • Branch Prediction • less sensitive in SMT

What are the Bottlenecks now? • Speculative Execution • not a bottleneck • eliminating will be a issue • Memory Throughput • infinite bandwidth caches will increase throughput only by 3% • Register File Size • no sharp drop-off point • Fetch Throughput is still a bottleneck

Summary • Borrows heavily from conventional superscalar design, requiring little additional hardware support • Minimizes the impact on single-thread performance, running only 2% slower in that scenario • Achieves significant throughput improvements over the superscalar when many threads are running

SMT in modern commercial implementations • Intel Pentium4, 2002 • Hyper-Threading Technology(HTT)‏ • 30% speed improvement • MIPS MT • IBM POWER5, 2004 • two-thread SMT engine • SUN Ultrasparc T1, 2005 • CMT : SMT + CMP(Chip-level multiprocessing)‏

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Presentation Transcript

S u m m e r J o b s

J. I. M. S Auto

Ressing , M; Blettner , M; Klug, S J

Joel M. Mitnick Tim Muris Joseph J. Simons

m-2s m-s m m+s m+2s

M P L S

M L G S

S. K. Droste , M. C. Schweizer , S. Ulbrich , and J. M. H. M. Reul

m 2 /s 2 m/s 2 m 2 /s m/s m s

J. Herler, M. Dirnwöber, L. Schiemer, S. Niedermueller

J. L. Alcazar, L. Diaz, P. Florez, S. Guerriero and M. Jurado

M r s . J a c ks’ C l a s s r o o m

J. Herler, M. Dirnwöber, L. Schiemer, S. Niedermueller

S M L