Simultaneous Multithreading: Maximizing On-Chip Parallelism

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett

Introduction • Simultaneous Multithreading – a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. • The objective of SM is to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread.

Overview • Introduce several SM models • Evaluate the performance of those models relative to superscalar and fine-grain multithreading • Show how to tune the cache hierarchy for SM processors • Demonstrate the potential for performance and real-estate advantages of SM architectures over small-scale, on chip multiprocessors

Simulation Environment • Developed a simulation environment that defines an implementation of SM architecture • Uses emulation-based instructions-level simulation, similar to Tango and g88 • Models the execution pipelines, the memory hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor • Based on the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded execution

Simulation Environment (cont.) • Typical Simulated configuration contains 10 functional units of four types (four integer, two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle.

Simulation Environment (cont.)

Superscalar Bottlenecks • No dominant source of wasted issue bandwidth, therefore, no dominant solution • No single latency-tolerating technique will produce a dramatic increase in the performance of these programs if it only attacks specific types of latencies

SM Machine Models

SM Machine Models (cont.)

SM Machine Models (cont.) • In summary, the results show that simultaneous multithreading surpasses limits on the performance attainable through either single-thread execution or fine-grain multithreading, when run on a wide superscalar. • Simplified implementations of SM with limited per-thread capabilities can still attain high instruction throughput. • These improvements come without any significant tuning of the architecture for multithreaded execution.

Cache Design • Cache sharing caused performance degradation in SM processors. • Different cache configurations were simulated to determine optimum configurations

Cache Design • Two configurations appear to be good choices: • 64s.64s • 64p.64s • Important Note: cache sizes today are larger than those at the time of the paper (1995).

Simultaneous Multithreading vs Single-Chip Multiprocessing • On organizational level, the two are similar: • Multiple register sets • Multiple functional units • High issue bandwidth on a single chip

Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.) • Key difference is the way resources are partitioned and scheduled: • MP statically partitions resources • SM allows partitions to change every cycle • MP and SM tested in similar configurations to compare performance:

Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.)

Conclusion

Pentium 4 • Product Features • Available at 1.50, 1.60, 1.70, 1.80, 1.90 and 2 GHz • Binary compatible with applications running on previous members of the Intel microprocessor line • Intel ® NetBurst™ micro-architecture • System bus frequency at 400 MHz • Rapid Execution Engine: Arithmetic Logic Units (ALUs) run at twice the processor core frequency • Hyper Pipelined Technology • Advance Dynamic Execution • —Very deep out-of-order execution • —Enhanced branch prediction • Level 1 Execution Trace Cache stores 12K micro-ops and removes decoder latency from main execution loops • 8 KB Level 1 data cache • 256 KB Advanced Transfer Cache (on- die, full speed Level 2 (L2) cache) with 8-way associativity and Error Correcting Code (ECC) • 144 new Streaming SIMD Extensions 2 (SSE2) instructions • Enhanced floating point and multimedia unit for enhanced video, audio, encryption, and 3D performance • Power Management capabilities • —System Management mode • —Multiple low-power states • Optimized for 32-bit applications running on advanced 32-bit operating systems • 8-way cache associativity provides improved cache hit rate on load/store operations.

AMD Athlon The AMD Athlon XP processor features a seventh-generation microarchitecture with an integrated, exclusive L2 cache, which supports the growing processor and system bandwidth requirements of emerging software, graphics, I/O, and memory technologies. The high-speed execution core of the AMD Athlon XP processor includes multiple x86 instruction decoders, a dual-ported 128-Kbyte split level-one (L1) cache, an exclusive 256-Kbyte L2 cache, three independent integer pipelines, three address calculation pipelines, and a superscalar, fully pipelined, out-of-order, three-way floating-point engine. The floating-point engine is capable of delivering outstanding performance on numerically complex applications.

AMD Athlon (cont.) The following features summarize the AMD Athlon XP processor QuantiSpeed architecture: • An advanced nine-issue, superpipelined, superscalar x86 processor microarchitecture designed for increased Instructions Per Cycle (IPC) and high clock frequencies • Fully pipelined floating-point unit that executes all x87 (floating-point), MMX, SSE and 3DNow! instructions • Hardware data pre-fetch that increases and optimizes performance on high-end software applications utilizing high-bandwidth system capability • Advanced two-level Translation Look-aside Buffer (TLB) structures for both enhanced data and instruction address translation. The AMD Athlon XP processor with QuantiSpeed architecture incorporates three TLB optimizations: the L1 DTLB increases from 32 to 40 entries, the L2 ITLB and L2 DTLB both use exclusive architecture, and the TLB entries can be speculatively loaded.

Simultaneous Multithreading: Maximizing On-Chip Parallelism

Simultaneous Multithreading: Maximizing On-Chip Parallelism

Presentation Transcript

Simultaneous Multithreading (SMT)

Transient Fault Detection via Simultaneous Multithreading

ChIP on ChIP

SIMULTANEOUS MULTITHREADING

Symbiotic Jobscheduling for a Simultaneous Multithreading Processor

Multicore, parallelism, and multithreading

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Transient Fault Detection and Recovery via Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Simultaneous Multithreading: Multiplying Alpha Performance

On-chip Parallelism

Simultaneous Multithreading (SMT)

On-chip Parallelism

Csci 211 Computer System Architecture Limits on ILP and Simultaneous Multithreading

Computer Architecture Lec 10 –Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Transient Fault Detection via Simultaneous Multithreading

Chip Level Multithreading (CMT)

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors

Csci 211 Computer System Architecture Limits on ILP and Simultaneous Multithreading

Limits to ILP and Simultaneous Multithreading

Improving Database Performance on Simultaneous Multithreading Processors