280 likes | 668 Views
Simultaneous Multithreading: Maximizing On-Chip Parallelism. Presented By: Daron Shrode Shey Liggett. Introduction. Simultaneous Multithreading – a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle.
E N D
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett
Introduction • Simultaneous Multithreading – a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. • The objective of SM is to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread.
Overview • Introduce several SM models • Evaluate the performance of those models relative to superscalar and fine-grain multithreading • Show how to tune the cache hierarchy for SM processors • Demonstrate the potential for performance and real-estate advantages of SM architectures over small-scale, on chip multiprocessors
Simulation Environment • Developed a simulation environment that defines an implementation of SM architecture • Uses emulation-based instructions-level simulation, similar to Tango and g88 • Models the execution pipelines, the memory hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor • Based on the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded execution
Simulation Environment (cont.) • Typical Simulated configuration contains 10 functional units of four types (four integer, two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle.
Superscalar Bottlenecks • No dominant source of wasted issue bandwidth, therefore, no dominant solution • No single latency-tolerating technique will produce a dramatic increase in the performance of these programs if it only attacks specific types of latencies
SM Machine Models (cont.) • In summary, the results show that simultaneous multithreading surpasses limits on the performance attainable through either single-thread execution or fine-grain multithreading, when run on a wide superscalar. • Simplified implementations of SM with limited per-thread capabilities can still attain high instruction throughput. • These improvements come without any significant tuning of the architecture for multithreaded execution.
Cache Design • Cache sharing caused performance degradation in SM processors. • Different cache configurations were simulated to determine optimum configurations
Cache Design • Two configurations appear to be good choices: • 64s.64s • 64p.64s • Important Note: cache sizes today are larger than those at the time of the paper (1995).
Simultaneous Multithreading vs Single-Chip Multiprocessing • On organizational level, the two are similar: • Multiple register sets • Multiple functional units • High issue bandwidth on a single chip
Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.) • Key difference is the way resources are partitioned and scheduled: • MP statically partitions resources • SM allows partitions to change every cycle • MP and SM tested in similar configurations to compare performance:
Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.)
Pentium 4 • Product Features • Available at 1.50, 1.60, 1.70, 1.80, 1.90 and 2 GHz • Binary compatible with applications running on previous members of the Intel microprocessor line • Intel ® NetBurst™ micro-architecture • System bus frequency at 400 MHz • Rapid Execution Engine: Arithmetic Logic Units (ALUs) run at twice the processor core frequency • Hyper Pipelined Technology • Advance Dynamic Execution • —Very deep out-of-order execution • —Enhanced branch prediction • Level 1 Execution Trace Cache stores 12K micro-ops and removes decoder latency from main execution loops • 8 KB Level 1 data cache • 256 KB Advanced Transfer Cache (on- die, full speed Level 2 (L2) cache) with 8-way associativity and Error Correcting Code (ECC) • 144 new Streaming SIMD Extensions 2 (SSE2) instructions • Enhanced floating point and multimedia unit for enhanced video, audio, encryption, and 3D performance • Power Management capabilities • —System Management mode • —Multiple low-power states • Optimized for 32-bit applications running on advanced 32-bit operating systems • 8-way cache associativity provides improved cache hit rate on load/store operations.
AMD Athlon The AMD Athlon XP processor features a seventh-generation microarchitecture with an integrated, exclusive L2 cache, which supports the growing processor and system bandwidth requirements of emerging software, graphics, I/O, and memory technologies. The high-speed execution core of the AMD Athlon XP processor includes multiple x86 instruction decoders, a dual-ported 128-Kbyte split level-one (L1) cache, an exclusive 256-Kbyte L2 cache, three independent integer pipelines, three address calculation pipelines, and a superscalar, fully pipelined, out-of-order, three-way floating-point engine. The floating-point engine is capable of delivering outstanding performance on numerically complex applications.
AMD Athlon (cont.) The following features summarize the AMD Athlon XP processor QuantiSpeed architecture: • An advanced nine-issue, superpipelined, superscalar x86 processor microarchitecture designed for increased Instructions Per Cycle (IPC) and high clock frequencies • Fully pipelined floating-point unit that executes all x87 (floating-point), MMX, SSE and 3DNow! instructions • Hardware data pre-fetch that increases and optimizes performance on high-end software applications utilizing high-bandwidth system capability • Advanced two-level Translation Look-aside Buffer (TLB) structures for both enhanced data and instruction address translation. The AMD Athlon XP processor with QuantiSpeed architecture incorporates three TLB optimizations: the L1 DTLB increases from 32 to 40 entries, the L2 ITLB and L2 DTLB both use exclusive architecture, and the TLB entries can be speculatively loaded.