Adaptive History-Based Memory Schedulers for Enhanced System Performance

Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin

Memory system performance is not increasing as fast as CPU performance Latency: Use caches, prefetching, … Bandwidth: Use parallelism inside memory system Memory Bottleneck

DRAM Bank 0 Read Bank 0 Bank 1 Read Bank 0 Bank 2 bank conflict Bank 3 Read Bank 1 Read Bank 0 Read Bank 1 better order Read Bank 0 How to Increase Memory Command Parallelism? • Similar to instruction scheduling, can reorder commands for higher bandwidth time

Inside the Memory System not FIFO Read Queue FIFO Memory Queue DRAM arbiter caches Write Queue Memory Controller not FIFO the arbiter schedules memory operations

Our Work • Study memory command scheduling in the context of the IBM Power5 • Present new memory arbiters • 20% increased bandwidth • Very little cost: 0.04% increase in chip area

Outline • The Problem • Characteristics of DRAM • Previous Scheduling Methods • Our approach • History-based schedulers • Adaptive history-based schedulers • Results • Conclusions

Understanding the Problem:Characteristics of DRAM • Multi-dimensional structure • Banks, rows, and columns • IBM Power5: ranks and ports as well • Access time is not uniform • Bank-to-Bank conflicts • Read after Write to the same rank conflict • Write after Read to different port conflict • …

Previous Scheduling Approaches: FIFO Scheduling caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue

Memoryless Scheduling Adapted from Rixner et al, ISCA2000 caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue long delay

DC B A C 5 8 D is better D 3 7 What we really want • Keep the pipeline full; don’t hold commands in the reorder queues until conflicts are totally resolved • Forward them to memory queue in an order to minimize future conflicts • To do this we need to know history of the commands memory queue Read/Write Queues arbiter

Another Goal: Match Application’s Memory Command Behavior • Arbiter should select commands from queues roughly in the ratio in which the application generates them • Otherwise, read or write queue may be congested • Command history is useful here too

Our Approach: History-Based Memory Schedulers Benefits: • Minimize contention costs • Consider multiple constraints • Match application’s memory access behavior • 2 Reads per Write? • 1 Read per Write? • … • The Result: less congested memory system, i.e. more bandwidth

How does it work? • Use a Finite State Machine (FSM) • Each state in the FSM represents one possible history • Transitions out of a state are prioritized • At any state, scheduler selects the available command with the highest priority • FSM is generated at design time

An Example available commands in reorder queues next state First Preference current state Second Preference Third Preference Fourth Preference most appropriate command to memory

How to determine priorities? • Two criteria: • A: Minimize contention costs • B: Satisfy program’s Read/Write command mix • First Method : Use A, break ties with B • Second Method : Use B, break ties with A • Which method to use? • Combine two methods probabilistically (details in the paper)

Limitation of the History-Based Approach • Designed for one particular mix of Read/Writes • Solution: Adaptive History-Based Schedulers • Create multiple state machines: one for each Read/Write mix • Periodically select most appropriate state machine

Arbiter1 Arbiter2 Arbiter3 Read Counter Write Counter Cycle Counter Arbiter Selection Logic select Adaptive History-Based Schedulers 2R:1W1R:1W 1R:2W

Evaluation • Used a cycle accurate simulator for the IBM Power5 • 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port • Evaluated and compared our approach with previous approaches with data intensive applications: Stream, NAS, and microbenchmarks

The IBM Power5 • 2 cores on a chip • SMT capability • Large on-chip L2 cache • Hardware prefetching • 276 million transistors Memory Controller (1.6% of chip area)

Results 1: Stream Benchmarks

Results 2: NAS Benchmarks (1 core active)

Results 3: Microbenchmarks

12 concurrent commands caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue

DRAM Utilization Memoryless Approach Our Approach Number of Active Commands in DRAM

Why does it work? detailed analysis in the paper Read Queue Memory Queue DRAM arbiter caches Write Queue Memory Controller Low Occupancy in ReorderQueues Busy Memory System Full Memory Queue Full Reorder Queues

Other Results • We obtain >95% performance of the perfect DRAM configuration (no conflicts) • Results with higher frequency, and no data prefetching are in the paper • History size of 2 works well

Conclusions • Introduced adaptive history-based schedulers • Evaluated on a highly tuned system, IBM Power5 • Performance improvement Over FIFO : Stream 63% NAS 11% Over Memoryless : Stream 19% NAS 5% • Little cost: 0.04% chip area increase

Conclusions (cont.) • Similar arbiters can be used in other places as well, e.g. cache controllers • Can optimize for other criteria, e.g. power or power+performance.

Thank you

Adaptive History-Based Memory Schedulers for Enhanced System Performance

Adaptive History-Based Memory Schedulers for Enhanced System Performance

Presentation Transcript

Clock Driven Schedulers

Schedulers

Adaptive Software Transactional Memory

Processes and Schedulers

Adaptive Tree-based Convergecast Protocol

LTE: Schedulers

Adaptive memory: Survival processing enhances retention

Adaptive Web-based Learning

Adaptive Self-Tuning Memory in DB2

History and Memory

UIL Art Memory/History

Adaptive History-Based Memory Schedulers

Processes, Schedulers, Threads

History, Memory and Nostalgia

Java-Based Adaptive Web Caching

History and Memory

Memory-Based Reasoning

Graph-based Adaptive Diagnosis

Adaptive Self-Tuning Memory in DB2

Distributed Crossbar Schedulers

Java-Based Adaptive Web Caching