Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

‘99 ACM/IEEE International Symposium on Computer Architecture Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio

Roadmap • Need for Higher Bandwidth Caches • Multi-Ported Data Caches • Data Decoupling • Motivation • Approach • Implementation Issues • Quantitative Evaluation • Conclusions Cho, Yew, and Lee

Wide-Issue Superscalar Processors • Current Generation • Alpha 21264 • Intel’s Merced • Future Generation (IEEE Computer, Sept. ‘97) • Superspeculative Processors • Trace Processors Cho, Yew, and Lee

Multi-Ported Data Caches • Cache Built with Multi-Ported Cells • Replicated Cache • Alpha 21164 • Interleaved Cache • MIPS R10K • Time-Division Multiplexing • Alpha 21264 Cho, Yew, and Lee

Replicated Cache • Pros. • Simple design • Symmetric read ports • Cons. • Doubled area • Exclusive writes for data coherence Cho, Yew, and Lee

Time-Division Multiplexed Cache • Pros. • True 2-port cache • Cons. • Hardware design complexity • Not scalable beyond 2 ports Cho, Yew, and Lee

Interleaved Cache • Pros. • Scalable • Cons. • Asymmetric ports • Bank conflicts • Constraints in number of banks Cho, Yew, and Lee

Window Logic Complexity • Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97) • More severe for Memory window • Difficult to partition • Thick network needed to connect RSs and LSUs Cho, Yew, and Lee

Data Decoupling • A Divide-and-Conquer approach • Instructions partitioned before entering RS • Narrower networks • Less ports to each cache Cho, Yew, and Lee

Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? Data Decoupling: Operating Issues Cho, Yew, and Lee

Easily Identifiable Hardware Mechanism Simple 1-bit predictor with enough context information works well (>99.9%). Compiler Mechanism Helps reduce required prediction table space for good performance; but not essential. Many of Them 30% of loads, 48% of stores Well-Interleaved Continuous supply of stack references with reasonable window size Case for Decoupling Stack Accesses • Details are found in: • Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Processor Memory System Design”, CSTR #99-004, Univ. of Minnesota. Cho, Yew, and Lee

Data Decoupling: Mechanism • Dynamically Predicting Access Regions for Partitioning Memory Instructions • Utilize Access Region Locality • Refer to context information, e.g., global branch history, call site identifier • Dynamically Verifying Region Prediction • Let TLB (i.e., page table) contain verification information such that memory access is reissued on mispredictions. Cho, Yew, and Lee

Data Decoupling: Mechanism, Cont’d • Access Region Locality Cho, Yew, and Lee

go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Data Decoupling: Mechanism, Cont’d • Dynamic Partitioning Accuracy 8 KB 4 KB Unlimited 2 KB 1 KB Cho, Yew, and Lee

Fast Forwarding Uses offset (used with $sp) to resolve dependence Can shorten latency Access Combining Combines accesses to adjacent locations Can save bandwidth st r3, 8($sp) ... ... ld r4, 8($sp) st r3, 4($sp) st r4, 8($sp) st {r3,r4} {4,8($sp)} Data Decoupling: Optimizations Addr Matched! Cho, Yew, and Lee

Benchmark Programs Cho, Yew, and Lee

Program’s Memory Accesses go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

0 4 8 12 16 Program’s Frame Size Distribution • Stack references tend to access small region. • Average size of dynamic frames was around 3 words. • Average size of static frames was around 7 words. Cho, Yew, and Lee

Base Machine Model Cho, Yew, and Lee

Integer FP Program’s Bandwidth Requirements • Performance suffers greatly with less than 3 cache ports. • We study 3 cases: • Cache has 2 ports • Cache has 3 ports • Cache has 4 ports Cho, Yew, and Lee

0.5K 1K 2K 4K Impact of LVC Size • 2KB and 4KB LVCs achieve high hit rates (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee

Fast Data Forwarding • 2KB and 4KB LVCs achieve high hit rates (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee

(3+1) (3+2) Access Combining • Effective (over 8% improvement) when LVC bandwidth is scarce. • 2-way combining is enough. Cho, Yew, and Lee

Performance of Various Config.’s (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

Performance of 126.gcc (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

Performance of 130.li (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

Performance of 102.swim (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

Other Findings • LVC hit latency has less impact than data cache due to • Many loads hitting in LVAQ • Out-of-order issuing • Addition of LVC reduced conflict misses in • 130.li (by 24%) and 147.vortex (by 7%) • May reduce bandwidth requirements on bus to L2 cache Cho, Yew, and Lee

Overall Performance go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

Conclusions • Superscalar Processors will be around… • But its design complexity will call for architectural solutions. • Memory bandwidth becomes critical. • Data Decoupling is a way to • Decrease hardware complexity of memory issue logic and cache. • Provide additional bandwidth for decoupled stack accesses. Cho, Yew, and Lee

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Presentation Transcript

Superscalar Processors

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar

Lecture 11 : Modern Superscalar Processor Models

SUPERSCALAR ARCHITECTURE

Superscalar Implementation

Designing a superscalar processor simulation

Superscalar Processor Design Superscalar Architecture

Accurately Approximating Superscalar Processor Performance from Traces

A world wide issue By Edward Prendeville

Timing Model of a Superscalar O-o-O processor in HAsim Framework

A Processor

Superscalar Processor

Superscalar Processors

ATLAS Local Trigger Processor

Chapter 7 โพรเซสเซอร์แบบไปป์ลายน์และซุปเปอร์สเกลาร์ Pipeline and Superscalar processor

Dynamical decoupling in solids

Reducing Issue Logic Complexity in Superscalar Microprocessors

Local Decoupling

Multiple Issue Processors: Superscalar and VLIW

A Processor

Decoupled Fetch/Execute Superscalar Processor Engines

Superscalar Processors