320 likes | 448 Views
32nd Annual International Symposium on Microarchitecture. Access Region Locality for High-Bandwidth Processor Memory System Design. Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U. Big Picture. On-Chip D-Cache Bandwidth Problem.
E N D
32nd Annual International Symposium on Microarchitecture Access Region Locality for High-Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U
Big Picture Cho, Yew, and Lee
Wide-Issue Superscalar Processors • Current Generation • Alpha 21264 • Intel’s Merced • Future Generation (IEEE Computer, Sept. ‘97) • Superspeculative Processors • Trace Processors Cho, Yew, and Lee
Multi-Ported Data Cache • Replicated Cache • Alpha 21164 • Time-Division Multiplexed Cache • Alpha 21264 • Interleaved Cache • MIPS R10K Cho, Yew, and Lee
Window Logic Complexity • Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97) • More severe for Memory window • Difficult to partition • Thick network needed to connect RSs and LSUs Cho, Yew, and Lee
Data Decoupling: What is it? • A Divide-and-Conquer approach • Instruction stream partitioned before entering RS • Narrower networks • Less ports to each cache • Needs mechanism for proper partitioning Cho, Yew, and Lee
Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? Data Decoupling: Operating Issues Cho, Yew, and Lee
Access Region: Overview • Access Region R • R = (L, U) • L: Lower Bound on Addr. • U: Upper Bound on Addr. • If (D<A) or (B<C), • Region R and Q are said to be exclusive or non-overlapping. • Locations in exclusive regions are independent. Cho, Yew, and Lee
Access Region and Mem. Instructions Cho, Yew, and Lee
Partitioning Memory Space • One way of partitioning memory space into regions: • Data Region / Heap Region / Stack Region • This work assumes this partitioning. Cho, Yew, and Lee
Partitioning Memory Space, Cont’d • Many accesses are toward Data and Stack regions. • Some programs don’t access the Heap region at all. (%) Cho, Yew, and Lee
Partitioning Memory Space, Cont’d • Accesses to Data region are less bursty than others. • Programs such as ijpeg have clustered region accesses. • Window Size = 32 Cho, Yew, and Lee
Partitioning Memory Space, Cont’d • W/ a large window, Stack accesses become less bursty. • Data and Stack regions have quite stable, constant demand. • Window Size = 64 Cho, Yew, and Lee
1.8% 1.9% 50.4% 51.1% 1.6% 16.2% 45.4% 31.6% go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Partitioning Memory Space, Cont’d • Many instructions access a single region (~98%). • Multi-region-accessing instructions account for 0 ~ 9.6% of dynamic memory references. Cho, Yew, and Lee
Access Region Locality • “A memory reference instruction typically accesses a single region at run time” • Only about 2% of all static memory instructions access more than a single region. • “(Thus) the region it accesses is highly predictable” • Simple predictors with a small look-up table achieve high prediction accuracy. Cho, Yew, and Lee
Predicting Regions: Unlimited Case • One predictor per memory instruction • Predictor types: • 1-bit history saver (0: Data, 1: Stack) • 2-bit saturating counter Cho, Yew, and Lee
Predicting Regions: Adding Context • Run-time context • Caller’s ID (CID): in Link Register • Global Branch History (GBH) • Hybrid of above Cho, Yew, and Lee
Predicting Regions: Utilizing Static Info. • Some instructions’ access regions are revealed through architecture and compiler conventions: • Use of Stack Pointer ($SP) or Frame Pointer ($FP)suggests that the region is Stack. • Use of Global Pointer ($GP)suggests that the region is non-Stack. • For others, assume non-Stack. • Directly exporting some high-level region information from compiler to processor may improve prediction accuracy. Cho, Yew, and Lee
Region Pred. Result: Unlimited Case w/ GBH • 1-bit predictors do better than 2-bit predictors (not shown). • Hybrid context bits achieve the best prediction rate on average. w/ CID Simple 1-bit w/ Hybrid Static go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee
Predicting Regions: Limited-Size ARPT • Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): • Table Entries Initialized to 0’s • 1 to denote stack access • Decoding information exploited to save ARPT space Cho, Yew, and Lee
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Region Prediction Result: ARPT • Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. • Compiler hints relieve pressure due to smaller sizes. 8 KB 4 KB Unlimited 2 KB 1 KB Cho, Yew, and Lee
Dynamic Data Decoupling Cho, Yew, and Lee
Dynamic Data Decoupling, Cont’d • Dynamically predicting access regions to classify memory instructions: • Utilize Access Region Prediction Table (ARPT). • Utilize any region information revealed through instruction decoding. • Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. • Dynamically Verifying Region Prediction • Let TLB (i.e., page table) contain verification information such that memory access is reissued on mis-predictions. Cho, Yew, and Lee
Base Machine Model Cho, Yew, and Lee
Overall Performance • Over (2+0) conf. go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee
Conclusions • Access Region Locality says • Memory instructions access few regions at run time. • Accessed regions are accurately predictable. • Access Region Locality leads to Access Region Prediction techniques. • Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches. Cho, Yew, and Lee
0.5K 1K 2K 4K Impact of LVC Size • 2KB and 4KB LVCs achieve high hit rates. (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee