1 / 26

Power Efficient DRAM Speculation

Power Efficient DRAM Speculation. Nidhi Aggarwal † , Jason F. Cantin ‡ , Mikko H. Lipasti † , and James E. Smith †. † University of Wisconsin-Madison 1415 Engineering Drive Madison, WI 53705. ‡ International Business Machines, 11400 Burnet Road Austin, TX 78758.

vega
Download Presentation

Power Efficient DRAM Speculation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Power Efficient DRAM Speculation Nidhi Aggarwal†, Jason F. Cantin‡, Mikko H. Lipasti†, and James E. Smith† †University of Wisconsin-Madison 1415 Engineering Drive Madison, WI 53705 ‡ International Business Machines, 11400 Burnet Road Austin, TX 78758 The 14th Annual International Symposium on High Performance Computer Architecture February 20th, 2008

  2. Overview Power Efficient DRAM Speculation: • Utilizes Region Coherence Arrays to identify requests likely to result in cache-to-cache transfers • Does not access DRAM speculatively for these requests • Reduces DRAM power and energy consumption HPCA 2008

  3. Problem DRAM power consumption is a growing problem • Large and increasing portion of the total system power in the mid-range and high-end markets • E.g., DRAM power in Niagara ~ 22% of system power Many systems access DRAM speculatively for performance + Reduces latency • Wastes DRAM power • Wastes DRAM bandwidth HPCA 2008

  4. Opportunity Not all requests use data from DRAM • Depending on: • Number, size, and associativity of caches in the system • Number of processors • Amount of sharing in the application and OS • Protocols optimized for cache-to-cache transfers, e.g., IBM Power6 There is no need to access DRAM if a request will not use the data Coarse-Grain Coherence Tracking can help detect these requests HPCA 2008

  5. Example DRAM Read in progress Address Data Example: Useful DRAM Read Example: Unused DRAM Read Memory Controller DRAM latency is overlapped with the snoop. DRAM power is wasted. Data Response: Hit Response: Miss Processor A Processor B Request Data

  6. Unused DRAM Reads 29% of DRAM requests are unused reads HPCA 2008

  7. Background Coarse-Grain Coherence Tracking: • Memory is divided into coarse-grain regions • Aligned, power-of-two multiple of cache line size • Can range from two lines to a physical page • A structure is added to each processor’s cache hierarchy to monitor the coherence of regions • Region Coherence Arrays(Cantin et al. ISCA’05) • RegionScout Filters (Moshovos, ISCA’05) HPCA 2008

  8. Background • The RCA is used to avoid broadcast snoops • Requests to data not currently shared • Reduces latency, snoop traffic • The RCA is also used to filter broadcast snoops from other processors • Reduces power, tag lookup bandwidth • Though RCAs were designed to detect non-shared data, they also accurately detect shared data HPCA 2008

  9. Regions have “unknown” external state if there is not a valid entry in the RCA Regions have “externally-clean” state if other processors may have clean copies of lines Regions have “externally-dirty” state if other processors may have modifiable copies of lines Terminology HPCA 2008

  10. Unused DRAM Reads 29% of DRAM reads unused, and to externally-dirty regions HPCA 2008

  11. Unused DRAM Reads 76% 59% 36% 33% 15% HPCA 2008

  12. Approach Utilize information from Region Coherence Arrays to identify requests likely to obtain data from other processors’ caches • Set a bit in the memory request to inform the memory controller not to speculatively access DRAM Buffer requests in the memory controller until snoop response arrives Use snoop response to validate prediction • If other processor will provide the data, drop the request • If not, perform DRAM read, incurring a latency penalty HPCA 2008

  13. Power-Efficient DRAM Speculation DRAM Read in progress Address Data Example: Unused Read, predicted to be unused Example: Useful Read, predicted to be unused Memory Controller Read buffered Data Latency Added Power saved Response: Hit Response: Miss Processor A Processor B Request Data

  14. Policies • Baseline: All read requests speculatively access DRAM • Base-NoSpec: No read requests speculatively access DRAM • Shen-CRP: Read requests do not speculatively access DRAM if there is a tag • match on an invalid frame in the cache • PEDS-DKD: “Delay Known Dirty” –Read requests speculatively access DRAM • unless the region state is externally-dirty • PEDS-DLD: “Delay Likely Dirty” —Read requests speculatively access DRAM • unless the region is externally-dirty, or was externally-dirty in the past • (special state added to the RCA) • PEDS-DNC: “Delay Not Clean” –Only requests to a region that is externally-clean • (or has been) speculatively access DRAM. Special state added to • RCA • PEDS-DAS: “Delay All Snoops” –No broadcast reads speculatively access DRAM HPCA 2008

  15. Overhead One additional bit in the memory request packet • Tag good/bad candidates for a speculative DRAM access One additional region state for some policies • PEDS-DLD and PEDS-DNC Space in memory controller queues to buffer requests until the snoop response arrives • Optional HPCA 2008

  16. Simulator PHARMsim: • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors (1.5GHz) • Two-level cache hierarchy with split L1, unified L2 caches • Separate address and data networks, shared memory controller • RCA with same # of sets / associativity as L2 cache, 512B regions DRAMSim: • Detailed DRAM timing/power model • Models DRAM power at the rank level • 8GB Micron DDR200, dual channel HPCA 2008

  17. Workloads Scientific Benchmarks • Barnes • Ocean • Raytrace • Radiosity Multiprogrammed Workloads • SPECint95rate • SPECint2000rate Commercial Workloads • TPC-W • TPC-B • TPC-H • SPECweb99 • SPECjbb2000 HPCA 2008

  18. Comparison –Reads Performed ~33% reduction ~15% reduction ~28% reduction HPCA 2008

  19. Comparison –DRAM Power ~31% reduction HPCA 2008

  20. ~10% less reduction due to latency Comparison –DRAM Energy HPCA 2008

  21. Comparison –Execution Time 7.4% increase RCA Alone HPCA 2008

  22. Comparison –Time Between Requests 3.7x 2.3x 2.3x HPCA 2008

  23. Future Work Add more bits to memory requests to enable the memory controller to better prioritize requests Combining PEDS with other DRAM power management techniques, e.g.: • A Comprehensive Approach to DRAM Power Management, Hur and Lin, HPCA’08 • Memory Controller Policies for DRAM Power Management, Fan, Ellis, and Lebeck ISLPED’01 Combining PEDS with DRAM scheduling techniques • Memory Access Scheduling, Rixner et al., ISCA’00 HPCA 2008

  24. Conclusion Power Efficient DRAM Speculation: Reduces DRAM power consumption • Filters unnecessary DRAM reads • Reduces DRAM utilization  less dynamic power • Increases time between requests  less standby power Small performance impact • Few memory requests delayed unnecessarily • Fewer DRAM reads  less contention for other requests Reduces DRAM energy consumption HPCA 2008

  25. Something to think about… DRAM power may soon dominate system power • And thus cooling costs, operating costs, and battery life This does not bode well for micro-architectural techniques • Speculative DRAM accesses • Prefetching • Run-ahead execution • Hardware/Software Parallelization Research needs to focus more on this • Beyond using existing low-power modes • Beyond filtering speculative accesses • Beyond just inserting another level of cache

  26. The End HPCA 2008

More Related