260 likes | 342 Views
Power Efficient DRAM Speculation. Nidhi Aggarwal † , Jason F. Cantin ‡ , Mikko H. Lipasti † , and James E. Smith †. † University of Wisconsin-Madison 1415 Engineering Drive Madison, WI 53705. ‡ International Business Machines, 11400 Burnet Road Austin, TX 78758.
E N D
Power Efficient DRAM Speculation Nidhi Aggarwal†, Jason F. Cantin‡, Mikko H. Lipasti†, and James E. Smith† †University of Wisconsin-Madison 1415 Engineering Drive Madison, WI 53705 ‡ International Business Machines, 11400 Burnet Road Austin, TX 78758 The 14th Annual International Symposium on High Performance Computer Architecture February 20th, 2008
Overview Power Efficient DRAM Speculation: • Utilizes Region Coherence Arrays to identify requests likely to result in cache-to-cache transfers • Does not access DRAM speculatively for these requests • Reduces DRAM power and energy consumption HPCA 2008
Problem DRAM power consumption is a growing problem • Large and increasing portion of the total system power in the mid-range and high-end markets • E.g., DRAM power in Niagara ~ 22% of system power Many systems access DRAM speculatively for performance + Reduces latency • Wastes DRAM power • Wastes DRAM bandwidth HPCA 2008
Opportunity Not all requests use data from DRAM • Depending on: • Number, size, and associativity of caches in the system • Number of processors • Amount of sharing in the application and OS • Protocols optimized for cache-to-cache transfers, e.g., IBM Power6 There is no need to access DRAM if a request will not use the data Coarse-Grain Coherence Tracking can help detect these requests HPCA 2008
Example DRAM Read in progress Address Data Example: Useful DRAM Read Example: Unused DRAM Read Memory Controller DRAM latency is overlapped with the snoop. DRAM power is wasted. Data Response: Hit Response: Miss Processor A Processor B Request Data
Unused DRAM Reads 29% of DRAM requests are unused reads HPCA 2008
Background Coarse-Grain Coherence Tracking: • Memory is divided into coarse-grain regions • Aligned, power-of-two multiple of cache line size • Can range from two lines to a physical page • A structure is added to each processor’s cache hierarchy to monitor the coherence of regions • Region Coherence Arrays(Cantin et al. ISCA’05) • RegionScout Filters (Moshovos, ISCA’05) HPCA 2008
Background • The RCA is used to avoid broadcast snoops • Requests to data not currently shared • Reduces latency, snoop traffic • The RCA is also used to filter broadcast snoops from other processors • Reduces power, tag lookup bandwidth • Though RCAs were designed to detect non-shared data, they also accurately detect shared data HPCA 2008
Regions have “unknown” external state if there is not a valid entry in the RCA Regions have “externally-clean” state if other processors may have clean copies of lines Regions have “externally-dirty” state if other processors may have modifiable copies of lines Terminology HPCA 2008
Unused DRAM Reads 29% of DRAM reads unused, and to externally-dirty regions HPCA 2008
Unused DRAM Reads 76% 59% 36% 33% 15% HPCA 2008
Approach Utilize information from Region Coherence Arrays to identify requests likely to obtain data from other processors’ caches • Set a bit in the memory request to inform the memory controller not to speculatively access DRAM Buffer requests in the memory controller until snoop response arrives Use snoop response to validate prediction • If other processor will provide the data, drop the request • If not, perform DRAM read, incurring a latency penalty HPCA 2008
Power-Efficient DRAM Speculation DRAM Read in progress Address Data Example: Unused Read, predicted to be unused Example: Useful Read, predicted to be unused Memory Controller Read buffered Data Latency Added Power saved Response: Hit Response: Miss Processor A Processor B Request Data
Policies • Baseline: All read requests speculatively access DRAM • Base-NoSpec: No read requests speculatively access DRAM • Shen-CRP: Read requests do not speculatively access DRAM if there is a tag • match on an invalid frame in the cache • PEDS-DKD: “Delay Known Dirty” –Read requests speculatively access DRAM • unless the region state is externally-dirty • PEDS-DLD: “Delay Likely Dirty” —Read requests speculatively access DRAM • unless the region is externally-dirty, or was externally-dirty in the past • (special state added to the RCA) • PEDS-DNC: “Delay Not Clean” –Only requests to a region that is externally-clean • (or has been) speculatively access DRAM. Special state added to • RCA • PEDS-DAS: “Delay All Snoops” –No broadcast reads speculatively access DRAM HPCA 2008
Overhead One additional bit in the memory request packet • Tag good/bad candidates for a speculative DRAM access One additional region state for some policies • PEDS-DLD and PEDS-DNC Space in memory controller queues to buffer requests until the snoop response arrives • Optional HPCA 2008
Simulator PHARMsim: • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors (1.5GHz) • Two-level cache hierarchy with split L1, unified L2 caches • Separate address and data networks, shared memory controller • RCA with same # of sets / associativity as L2 cache, 512B regions DRAMSim: • Detailed DRAM timing/power model • Models DRAM power at the rank level • 8GB Micron DDR200, dual channel HPCA 2008
Workloads Scientific Benchmarks • Barnes • Ocean • Raytrace • Radiosity Multiprogrammed Workloads • SPECint95rate • SPECint2000rate Commercial Workloads • TPC-W • TPC-B • TPC-H • SPECweb99 • SPECjbb2000 HPCA 2008
Comparison –Reads Performed ~33% reduction ~15% reduction ~28% reduction HPCA 2008
Comparison –DRAM Power ~31% reduction HPCA 2008
~10% less reduction due to latency Comparison –DRAM Energy HPCA 2008
Comparison –Execution Time 7.4% increase RCA Alone HPCA 2008
Comparison –Time Between Requests 3.7x 2.3x 2.3x HPCA 2008
Future Work Add more bits to memory requests to enable the memory controller to better prioritize requests Combining PEDS with other DRAM power management techniques, e.g.: • A Comprehensive Approach to DRAM Power Management, Hur and Lin, HPCA’08 • Memory Controller Policies for DRAM Power Management, Fan, Ellis, and Lebeck ISLPED’01 Combining PEDS with DRAM scheduling techniques • Memory Access Scheduling, Rixner et al., ISCA’00 HPCA 2008
Conclusion Power Efficient DRAM Speculation: Reduces DRAM power consumption • Filters unnecessary DRAM reads • Reduces DRAM utilization less dynamic power • Increases time between requests less standby power Small performance impact • Few memory requests delayed unnecessarily • Fewer DRAM reads less contention for other requests Reduces DRAM energy consumption HPCA 2008
Something to think about… DRAM power may soon dominate system power • And thus cooling costs, operating costs, and battery life This does not bode well for micro-architectural techniques • Speculative DRAM accesses • Prefetching • Run-ahead execution • Hardware/Software Parallelization Research needs to focus more on this • Beyond using existing low-power modes • Beyond filtering speculative accesses • Beyond just inserting another level of cache
The End HPCA 2008