290 likes | 311 Views
Adaptive History-Based Memory Schedulers. Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin. Memory system performance is not increasing as fast as CPU performance Latency: Use caches, prefetching, … Bandwidth: Use parallelism inside memory system. Memory Bottleneck.
E N D
Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin
Memory system performance is not increasing as fast as CPU performance Latency: Use caches, prefetching, … Bandwidth: Use parallelism inside memory system Memory Bottleneck
DRAM Bank 0 Read Bank 0 Bank 1 Read Bank 0 Bank 2 bank conflict Bank 3 Read Bank 1 Read Bank 0 Read Bank 1 better order Read Bank 0 How to Increase Memory Command Parallelism? • Similar to instruction scheduling, can reorder commands for higher bandwidth time
Inside the Memory System not FIFO Read Queue FIFO Memory Queue DRAM arbiter caches Write Queue Memory Controller not FIFO the arbiter schedules memory operations
Our Work • Study memory command scheduling in the context of the IBM Power5 • Present new memory arbiters • 20% increased bandwidth • Very little cost: 0.04% increase in chip area
Outline • The Problem • Characteristics of DRAM • Previous Scheduling Methods • Our approach • History-based schedulers • Adaptive history-based schedulers • Results • Conclusions
Understanding the Problem:Characteristics of DRAM • Multi-dimensional structure • Banks, rows, and columns • IBM Power5: ranks and ports as well • Access time is not uniform • Bank-to-Bank conflicts • Read after Write to the same rank conflict • Write after Read to different port conflict • …
Previous Scheduling Approaches: FIFO Scheduling caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue
Memoryless Scheduling Adapted from Rixner et al, ISCA2000 caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue long delay
DC B A C 5 8 D is better D 3 7 What we really want • Keep the pipeline full; don’t hold commands in the reorder queues until conflicts are totally resolved • Forward them to memory queue in an order to minimize future conflicts • To do this we need to know history of the commands memory queue Read/Write Queues arbiter
Another Goal: Match Application’s Memory Command Behavior • Arbiter should select commands from queues roughly in the ratio in which the application generates them • Otherwise, read or write queue may be congested • Command history is useful here too
Our Approach: History-Based Memory Schedulers Benefits: • Minimize contention costs • Consider multiple constraints • Match application’s memory access behavior • 2 Reads per Write? • 1 Read per Write? • … • The Result: less congested memory system, i.e. more bandwidth
How does it work? • Use a Finite State Machine (FSM) • Each state in the FSM represents one possible history • Transitions out of a state are prioritized • At any state, scheduler selects the available command with the highest priority • FSM is generated at design time
An Example available commands in reorder queues next state First Preference current state Second Preference Third Preference Fourth Preference most appropriate command to memory
How to determine priorities? • Two criteria: • A: Minimize contention costs • B: Satisfy program’s Read/Write command mix • First Method : Use A, break ties with B • Second Method : Use B, break ties with A • Which method to use? • Combine two methods probabilistically (details in the paper)
Limitation of the History-Based Approach • Designed for one particular mix of Read/Writes • Solution: Adaptive History-Based Schedulers • Create multiple state machines: one for each Read/Write mix • Periodically select most appropriate state machine
Arbiter1 Arbiter2 Arbiter3 Read Counter Write Counter Cycle Counter Arbiter Selection Logic select Adaptive History-Based Schedulers 2R:1W1R:1W 1R:2W
Evaluation • Used a cycle accurate simulator for the IBM Power5 • 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port • Evaluated and compared our approach with previous approaches with data intensive applications: Stream, NAS, and microbenchmarks
The IBM Power5 • 2 cores on a chip • SMT capability • Large on-chip L2 cache • Hardware prefetching • 276 million transistors Memory Controller (1.6% of chip area)
Results 2: NAS Benchmarks (1 core active)
12 concurrent commands caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue
DRAM Utilization Memoryless Approach Our Approach Number of Active Commands in DRAM
Why does it work? detailed analysis in the paper Read Queue Memory Queue DRAM arbiter caches Write Queue Memory Controller Low Occupancy in ReorderQueues Busy Memory System Full Memory Queue Full Reorder Queues
Other Results • We obtain >95% performance of the perfect DRAM configuration (no conflicts) • Results with higher frequency, and no data prefetching are in the paper • History size of 2 works well
Conclusions • Introduced adaptive history-based schedulers • Evaluated on a highly tuned system, IBM Power5 • Performance improvement Over FIFO : Stream 63% NAS 11% Over Memoryless : Stream 19% NAS 5% • Little cost: 0.04% chip area increase
Conclusions (cont.) • Similar arbiters can be used in other places as well, e.g. cache controllers • Can optimize for other criteria, e.g. power or power+performance.