Improving Memory Bank-Level Parallelism in the Presence of Prefetching

Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin * Electrical and Computer Engineering Carnegie Mellon University

Main Memory System • Crucial to high performance computing • Made of DRAM chips • Multiple banks → Each bank can be accessed independently

Memory Bank-Level Parallelism (BLP) Overlapped time DRAM DRAM bank 0 bank 1 Bank 0 Req B0 DRAM system Bank 1 Req B1 DRAM controller Data bus Req B0 Data for Req B0 Data for Req B1 Req B1 Older Time DRAM throughput increased DRAM request buffer

Memory Latency-Tolerance Mechanisms • Out-of-order execution, prefetching, runahead etc. • Increase outstanding memory requests on the chip • Memory-Level Parallelism (MLP) [Glew’98] • Hope many requests will be serviced in parallel in the memory system • Higher performance can be achieved when BLP is exposed to the DRAM controller

Problems • On-chip buffers e.g., Miss Status Holding Registers (MSHRs) are limited in size • Limit the BLP exposed to the DRAM controller • E.g., requests to the same bank fill up MSHRs • In CMPs, memory requests from different cores are mixed together in DRAM request buffers • Destroy the BLP of each application running on CMPs Request Issue policies are critical to BLP exploited by DRAM controller

Goals and Proposal Goals 1. Maximize the BLP exposed from each core to the DRAM controller→ Increase DRAM throughput for useful requests 2. Preserve the BLP of each application in CMPs→ Increase system performance BLP-Aware Prefetch Issue (BAPI):Decides the order in which prefetches are sent from prefetcher to MSHRs BLP-Preserving Multi-core Request Issue (BPMRI):Decides the order in which memory requests are sent from each core to DRAM request buffers

DRAM BLP-Aware Request Issue Policies • BLP-Aware Prefetch Issue (BAPI) • BLP-Preserving Multi-core Request Issue (BPMRI)

What Can Limit DRAM BLP? • Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’06] • MSHRs keep track of all outstanding misses for a core → Total number of demand/prefetch requests ≤ total number of MSHR entries • Complex, latency-critical, and power-hungry→ Not scalable Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller

What Can Limit DRAM BLP? • FIFO (Intel Core) Overlapped time To DRAM Bank 0 Dem B0 Pref B0 α β β Bank 1 Pref B1 Pref B1 Bank 0 Bank 1 DRAM request buffers DRAM service time 2 requests 0 request 1 request 1 request • BLP-aware Prefetch request buffer Overlapped time MSHRs Full α: Dem B0 Pref B1 Pref B1 Increasing the number of requests ≠ high DRAM BLP Bank 0 Dem B0 Pref B0 Older β: Pref B1 Pref B1 Saved time Pref B0 Pref B0 Bank 1 Pref B1 Pref B1 DRAM service time Core Simple issue policy improves DRAM BLP

BLP-Aware Prefetch Issue (BAPI) • Sends prefetches to MSHRs based on current BLP exposed in the memory system • Sends a prefetch mapped to the least busy DRAM bank • Adaptively limits the issue of prefetches based on prefetch accuracy estimation • Low prefetch accuracy → Fewer prefetches issued to MSHRs • High prefetch accuracy → Maximize BLP

Implementation of BAPI • FIFO prefetch request buffer per DRAM bank • Stores prefetches mapped to the corresponding DRAM bank • MSHR occupancy counter per DRAM bank • Keeps track of the number of outstanding requests to the corresponding DRAM bank • Prefetch accuracy register • Stores the estimated prefetch accuracy periodically

BAPI Policy Every prefetch issue cycle • Make the oldest prefetch to each bank valid only if the bank’s MSHR occupancy counter≤ prefetch send threshold • Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value

Adaptivity of BAPI • Prefetch Send Threshold • Reserves MSHR entries for prefetches to different banks • Adjusted based on prefetch accuracy • Low prefetch accuracy → low prefetch send threshold • High prefetch accuracy → high prefetch send threshold

DRAM BLP-Aware Request Issue Policies • BLP-Aware Prefetch Issue (BAPI) • BLP-Preserving Multi-core Request Issue (BPMRI)

BLP Destruction in CMP Systems • DRAM request buffers are shared by multiple cores • To exploit the BLP of a core, the BLP should be exposed to DRAM request buffers • BLP potential of a core can be destroyed by the interference from other cores’ requests Request issue policy from each core to DRAM request buffers affects BLP of each application

Why is DRAM BLP Destroyed? • Round-robin To DRAM Bank 0 Req B0 Req A0 Bank 1 Req B1 Req A1 Older Time Core A Stall Bank 0 Bank 1 DRAM controller Core B Stall DRAM request buffers Request issuer • BLP-Preserving Serializes requests from each core Bank 0 Req B0 Req A0 Req A0 Req B1 Req A0 Req B1 Req B0 Req B0 Req A1 Req A1 Older Req B1 Bank 1 Req A1 Time Increased cycles for Core B Core A Stall Core A Core B Core B Stall Saved cycles for Core A Issue policy should preserve DRAM BLP

BLP-Preserving Multi-Core Request Issue (BPMRI) • Consecutively sends requests from one core to DRAM request buffers • Limits the maximum number of consecutive requests sent from one core • Prevent starvation of memory non-intensive applications • Prioritizes memory non-intensive applications • Impact of delaying requests from memory non-intensive application > Impact of delaying requests from memory intensive application

Implementation of BPMRI • Last-level (L2) cache miss counter per core • Stores the number of L2 cache misses from the core • Rank register per core • Fewer L2 cache misses → higher rank • More L2 cache misses → lower rank

BPMRI Policy Every request issue cycle Ifconsecutive requests from selected core ≥ request send threshold thenselected core ← highest ranked core issue oldest request from selected core

Simulation Methodology • x86 cycle accurate simulator • Baseline processor configuration • Per core • 4-wide issue, out-of-order, 128-entry ROB • Stream prefetcher (prefetch degree: 4, prefetch distance: 64) • 32-entry MSHRs • 512KB 8-way L2 cache • Shared • On-chip, demand-first FR-FCFS memory controller(s) • 1, 2, 4 DRAM channels for 1, 4, 8-core systems • 64, 128, 512-entry DRAM request buffers for 1, 4 and 8-core systems • DDR3 1600 DRAM, 15-15-15ns, 8KB row buffer

Simulation Methodology • Workloads • 14 most memory-intensive SPEC CPU 2000/2006 benchmarks for single-core system • 30 and 15 SPEC 2000/2006 workloads for 4 and 8-core CMPs • Pseudo-randomly chosen multiprogrammed • BAPI’s prefetch send threshold: • BPMRI’s request send threshold: 10 • Prefetch accuracy estimation and rank decision are made every 100K cycles

Performance of BLP-Aware Issue Policies 8.5% 13.6% 13.8% 4-core 8-core 1-core

Hardware Storage Cost for 4-core CMP • Total storage: 94,440 bits (11.5KB) • 0.6% of L2 cache data storage • Logic is not on the critical path • Issue decision can be made slower than processor cycle

Conclusion • Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller • BLP-Aware Prefetch Issue • Increases the BLP of useful requests from each core exposed to DRAM controller • BLP-Preserving Multi-core Request Issue • Ensures requests from the same core can be serviced in parallel by DRAM controller • Simple, low-storage cost • Significantly improve DRAM throughput and performance for both single and multi-core systems • Applicable to other memory technologies

Questions?

Improving Memory Bank-Level Parallelism in the Presence of Prefetching

Improving Memory Bank-Level Parallelism in the Presence of Prefetching

Presentation Transcript

Instruction Level Parallelism

Shared Memory Parallelism

Instruction-level Parallelism

Improving Memory Bank-Level Parallelism in the Presence of Prefetching

Instruction Level Parallelism

Nested Parallelism in Transactional Memory

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Instruction-Level Parallelism

Instruction-Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Application-level Prefetching

Instruction-Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism: Loop Level Parallelism

Instruction-Level Parallelism

Instruction-level Parallelism