300 likes | 421 Views
‘99 ACM/IEEE International Symposium on Computer Architecture. Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor. Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio. Roadmap. Need for Higher Bandwidth Caches
E N D
‘99 ACM/IEEE International Symposium on Computer Architecture Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio
Roadmap • Need for Higher Bandwidth Caches • Multi-Ported Data Caches • Data Decoupling • Motivation • Approach • Implementation Issues • Quantitative Evaluation • Conclusions Cho, Yew, and Lee
Wide-Issue Superscalar Processors • Current Generation • Alpha 21264 • Intel’s Merced • Future Generation (IEEE Computer, Sept. ‘97) • Superspeculative Processors • Trace Processors Cho, Yew, and Lee
Multi-Ported Data Caches • Cache Built with Multi-Ported Cells • Replicated Cache • Alpha 21164 • Interleaved Cache • MIPS R10K • Time-Division Multiplexing • Alpha 21264 Cho, Yew, and Lee
Replicated Cache • Pros. • Simple design • Symmetric read ports • Cons. • Doubled area • Exclusive writes for data coherence Cho, Yew, and Lee
Time-Division Multiplexed Cache • Pros. • True 2-port cache • Cons. • Hardware design complexity • Not scalable beyond 2 ports Cho, Yew, and Lee
Interleaved Cache • Pros. • Scalable • Cons. • Asymmetric ports • Bank conflicts • Constraints in number of banks Cho, Yew, and Lee
Window Logic Complexity • Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97) • More severe for Memory window • Difficult to partition • Thick network needed to connect RSs and LSUs Cho, Yew, and Lee
Data Decoupling • A Divide-and-Conquer approach • Instructions partitioned before entering RS • Narrower networks • Less ports to each cache Cho, Yew, and Lee
Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? Data Decoupling: Operating Issues Cho, Yew, and Lee
Easily Identifiable Hardware Mechanism Simple 1-bit predictor with enough context information works well (>99.9%). Compiler Mechanism Helps reduce required prediction table space for good performance; but not essential. Many of Them 30% of loads, 48% of stores Well-Interleaved Continuous supply of stack references with reasonable window size Case for Decoupling Stack Accesses • Details are found in: • Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Processor Memory System Design”, CSTR #99-004, Univ. of Minnesota. Cho, Yew, and Lee
Data Decoupling: Mechanism • Dynamically Predicting Access Regions for Partitioning Memory Instructions • Utilize Access Region Locality • Refer to context information, e.g., global branch history, call site identifier • Dynamically Verifying Region Prediction • Let TLB (i.e., page table) contain verification information such that memory access is reissued on mispredictions. Cho, Yew, and Lee
Data Decoupling: Mechanism, Cont’d • Access Region Locality Cho, Yew, and Lee
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Data Decoupling: Mechanism, Cont’d • Dynamic Partitioning Accuracy 8 KB 4 KB Unlimited 2 KB 1 KB Cho, Yew, and Lee
Fast Forwarding Uses offset (used with $sp) to resolve dependence Can shorten latency Access Combining Combines accesses to adjacent locations Can save bandwidth st r3, 8($sp) ... ... ld r4, 8($sp) st r3, 4($sp) st r4, 8($sp) st {r3,r4} {4,8($sp)} Data Decoupling: Optimizations Addr Matched! Cho, Yew, and Lee
Benchmark Programs Cho, Yew, and Lee
Program’s Memory Accesses go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee
0 4 8 12 16 Program’s Frame Size Distribution • Stack references tend to access small region. • Average size of dynamic frames was around 3 words. • Average size of static frames was around 7 words. Cho, Yew, and Lee
Base Machine Model Cho, Yew, and Lee
Integer FP Program’s Bandwidth Requirements • Performance suffers greatly with less than 3 cache ports. • We study 3 cases: • Cache has 2 ports • Cache has 3 ports • Cache has 4 ports Cho, Yew, and Lee
0.5K 1K 2K 4K Impact of LVC Size • 2KB and 4KB LVCs achieve high hit rates (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee
Fast Data Forwarding • 2KB and 4KB LVCs achieve high hit rates (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee
(3+1) (3+2) Access Combining • Effective (over 8% improvement) when LVC bandwidth is scarce. • 2-way combining is enough. Cho, Yew, and Lee
Performance of Various Config.’s (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee
Performance of 126.gcc (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee
Performance of 130.li (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee
Performance of 102.swim (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee
Other Findings • LVC hit latency has less impact than data cache due to • Many loads hitting in LVAQ • Out-of-order issuing • Addition of LVC reduced conflict misses in • 130.li (by 24%) and 147.vortex (by 7%) • May reduce bandwidth requirements on bus to L2 cache Cho, Yew, and Lee
Overall Performance go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee
Conclusions • Superscalar Processors will be around… • But its design complexity will call for architectural solutions. • Memory bandwidth becomes critical. • Data Decoupling is a way to • Decrease hardware complexity of memory issue logic and cache. • Provide additional bandwidth for decoupled stack accesses. Cho, Yew, and Lee