Decoupled Architecture for Data Prefetching

Decoupled Architecture for Data Prefetching <chang@cs.wisc.edu> <xuk@cs.wisc.edu> Jichuan Chang Kai Xu CS752

Outline • Motivation • Design and Evaluation • Results • Conclusions CS752

Motivation • Processor-memory performance gap • Prefetching helps, but it has overhead. • Transistor is cheap, will a coprocessor help? Main Processor Prefetching CoProcessor Info Flow Cache Prefetch Requests Data L1-L2 Internal Bus CS752

Why a dedicated coprocessor? • Simple • It simplifies the design of main processor. • Powerful • It can (hopefully) exploit complex algorithms; • It handles computation overhead (i.e. pattern computation, address computation). • Flexible • It can (hopefully) adapt to different situations; • It can implement different algorithms. • But are these true? CS752

Info Flow Main Processor Prefetching CoProcessor Tables RPT, PPW, CT, History, … … Cache Stream Buffer Prefetch Requests Data Bus The Generic Design ALU What ? When ? Where ? CS752

Data Prefetching Techniques • Regular Access Prefetching • Tagged Next Block Lookahead [Smith 82] • Exploit sequential access pattern; • Stride Prefetching [Baer & Chen 91] • Exploit stride access pattern; • Dependency-based Prefetching [Roth, et al 98] • Discover Linked-Data-Structure access pattern • Dead Block Correlation [Lai, et al 01] • History based correlation prediction • Stream Buffer [Joppi 90] • Reduce cache pollution CS752

Simulation Settings • SimpleScalar v3.0 • Modified sim-outorder to implement • information sharing between MP and PCP; • Modified cache module to implement • Prefetching schemes (between L1 and L2 cache), • Prefetch queue (len = 16); Bus sharing/contention, • Stream buffer. • Memory Parameters • L1 Data Cache: 4KB, 32B line, 4-way associative; • L2 Cache: 64KB, 64B line, 4-way associative; • Stream buffer: 8 entries, fully associated, 1 cycle hit; • Hit latency (cycle): L1 = 1 L2 = 12 Mem = 70 (2*); • Pipelined bus: bus contention/latency are modeled. CS752

Benchmarks • From Spec95 • gcc • compress • swim • tomcatv • Microbenchmark • Matrix multiplication (128 X 128 double) • Binary tree (1M nodes, similar to treeadd) CS752

Results (IPC) CS752

Results (Miss Ratio) CS752

Results (Prefetch Accuracy) CS752

L1-L2 Traffic Increase CS752

Results (Delay Tolerance) • How many cycles of delay can PCP tolerate? • More delay • Less useful (can’t get back before demand references) • More pollution (due to outdated information) • Less prefetches (due to bus contention) • To avoid pollution, impl. prefetch queue as circular buffer. • Overwrite outdated entries when queue is full. • The major effect of larger delay will be less prefetches. • Hard to model memory behavior in SimpleScalar • Predetermine latency, no wake-up, no MSHR. CS752

Delay tolerance • Preliminary result • For almost all schemes on all benchmarks: • PCP can tolerant 8 cycles of delay CS752

Can we integrate different schems? • Different applications need different schems • Brute force approach • Use both tagged and stride prefetching • Good speedup, but much more memory traffic. • Adapt prefetching policy dynamically? • Share the same hardware table • Using similar matching schemes • Hard to reconfigure/flush when context-swithes • Use separate tables • More hardware • Similar to tournament predictor (just a thought) CS752

Conclusions • PCP helps performance! (2-30% speedup) • PCP handles prefetching, can tolerates some delays. • Different schemes work for different applications • Requires different information (from different places); • PCP should be placed close to the info source; • Not easy to integrate different schemes. • Limitation of our approach • PCP not fully utilized. • Relies on tables (caches/queues/buffers) • DBCP requires large history table (7.6 M memory)! • Delay is critical to performance • It limits the complexity of prefetch schemes, • It also determines where to place PCP. CS752

Future Work • To evaluate more prefetching schemes • Dependency-based prefetching, etc. • PCP Running Ahead • Probably with the help of trace cache; • To fully utilize PCP; • Need chkpt/rollback mechanisms. • CoProcessor to Support Other Functionalities • Branch prediction, power mgmt. • PCP for Multiprocessor • Suitable for One-Block-Lookahead. • Need to change CC protocol. CS752

Thank You! Questions? CS752

Backup Slides Gauges

Tagged Prefetching CS752

Stride Prefetching • Recurrence Prediction Table (RPT) • Organized like a cache, indexed by PC • (Data addresses, stride, state) • State Machine CS752

Dependency-based Prefetching • Potential Producer Window • Correlation Table • One Step Ahead • Jump Pointer Generation/Maintenance CS752

Decoupled Architecture for Data Prefetching

Decoupled Architecture for Data Prefetching

Presentation Transcript

Low-Cost Adaptive Data Prefetching

Efficient Metadata Management for Irregular Data Prefetching

Prefetching for RC

Prefetching

Sprint: Speculative Prefetching of Remote Data

CS7810 Prefetching

The 1st JILP Data Prefetching Championship (DPC-1) Enhancement for Accurate Stream Prefetching

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Decoupled Software Pipelining

Software Prefetching

A Performance-Correctness Explicitly-Decoupled Architecture

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Data Architecture for Business Architects

A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Assist Threads for Data Prefetching in IBM XL Compilers

Web Prefetching

Prefetching Techniques

Electronic Data Interchange Decoupled

Lecture 25: Advanced Data Prefetching Techniques

A Taxonomy of Data Prefetching Mechanisms