230 likes | 366 Views
Decoupled Architecture for Data Prefetching. <chang@cs.wisc.edu> <xuk@cs.wisc.edu>. Jichuan Chang Kai Xu. Outline. Motivation Design and Evaluation Results Conclusions. Motivation. Processor-memory performance gap Prefetching helps, but it has overhead.
E N D
Decoupled Architecture for Data Prefetching <chang@cs.wisc.edu> <xuk@cs.wisc.edu> Jichuan Chang Kai Xu CS752
Outline • Motivation • Design and Evaluation • Results • Conclusions CS752
Motivation • Processor-memory performance gap • Prefetching helps, but it has overhead. • Transistor is cheap, will a coprocessor help? Main Processor Prefetching CoProcessor Info Flow Cache Prefetch Requests Data L1-L2 Internal Bus CS752
Why a dedicated coprocessor? • Simple • It simplifies the design of main processor. • Powerful • It can (hopefully) exploit complex algorithms; • It handles computation overhead (i.e. pattern computation, address computation). • Flexible • It can (hopefully) adapt to different situations; • It can implement different algorithms. • But are these true? CS752
Info Flow Main Processor Prefetching CoProcessor Tables RPT, PPW, CT, History, … … Cache Stream Buffer Prefetch Requests Data Bus The Generic Design ALU What ? When ? Where ? CS752
Data Prefetching Techniques • Regular Access Prefetching • Tagged Next Block Lookahead [Smith 82] • Exploit sequential access pattern; • Stride Prefetching [Baer & Chen 91] • Exploit stride access pattern; • Dependency-based Prefetching [Roth, et al 98] • Discover Linked-Data-Structure access pattern • Dead Block Correlation [Lai, et al 01] • History based correlation prediction • Stream Buffer [Joppi 90] • Reduce cache pollution CS752
Simulation Settings • SimpleScalar v3.0 • Modified sim-outorder to implement • information sharing between MP and PCP; • Modified cache module to implement • Prefetching schemes (between L1 and L2 cache), • Prefetch queue (len = 16); Bus sharing/contention, • Stream buffer. • Memory Parameters • L1 Data Cache: 4KB, 32B line, 4-way associative; • L2 Cache: 64KB, 64B line, 4-way associative; • Stream buffer: 8 entries, fully associated, 1 cycle hit; • Hit latency (cycle): L1 = 1 L2 = 12 Mem = 70 (2*); • Pipelined bus: bus contention/latency are modeled. CS752
Benchmarks • From Spec95 • gcc • compress • swim • tomcatv • Microbenchmark • Matrix multiplication (128 X 128 double) • Binary tree (1M nodes, similar to treeadd) CS752
Results (IPC) CS752
Results (Miss Ratio) CS752
L1-L2 Traffic Increase CS752
Results (Delay Tolerance) • How many cycles of delay can PCP tolerate? • More delay • Less useful (can’t get back before demand references) • More pollution (due to outdated information) • Less prefetches (due to bus contention) • To avoid pollution, impl. prefetch queue as circular buffer. • Overwrite outdated entries when queue is full. • The major effect of larger delay will be less prefetches. • Hard to model memory behavior in SimpleScalar • Predetermine latency, no wake-up, no MSHR. CS752
Delay tolerance • Preliminary result • For almost all schemes on all benchmarks: • PCP can tolerant 8 cycles of delay CS752
Can we integrate different schems? • Different applications need different schems • Brute force approach • Use both tagged and stride prefetching • Good speedup, but much more memory traffic. • Adapt prefetching policy dynamically? • Share the same hardware table • Using similar matching schemes • Hard to reconfigure/flush when context-swithes • Use separate tables • More hardware • Similar to tournament predictor (just a thought) CS752
Conclusions • PCP helps performance! (2-30% speedup) • PCP handles prefetching, can tolerates some delays. • Different schemes work for different applications • Requires different information (from different places); • PCP should be placed close to the info source; • Not easy to integrate different schemes. • Limitation of our approach • PCP not fully utilized. • Relies on tables (caches/queues/buffers) • DBCP requires large history table (7.6 M memory)! • Delay is critical to performance • It limits the complexity of prefetch schemes, • It also determines where to place PCP. CS752
Future Work • To evaluate more prefetching schemes • Dependency-based prefetching, etc. • PCP Running Ahead • Probably with the help of trace cache; • To fully utilize PCP; • Need chkpt/rollback mechanisms. • CoProcessor to Support Other Functionalities • Branch prediction, power mgmt. • PCP for Multiprocessor • Suitable for One-Block-Lookahead. • Need to change CC protocol. CS752
Thank You! Questions? CS752
Backup Slides Gauges
Tagged Prefetching CS752
Stride Prefetching • Recurrence Prediction Table (RPT) • Organized like a cache, indexed by PC • (Data addresses, stride, state) • State Machine CS752
Dependency-based Prefetching • Potential Producer Window • Correlation Table • One Step Ahead • Jump Pointer Generation/Maintenance CS752