130 likes | 314 Views
Lecture 13: DRAM Innovations. Today: energy efficiency, row buffer management, scheduling. Latency and Power Wall. Power wall: 25-40% of datacenter power can be attributed to the DRAM system Latency and power can be both improved by employing
E N D
Lecture 13: DRAM Innovations • Today: energy efficiency, row buffer management, • scheduling
Latency and Power Wall • Power wall: 25-40% of datacenter power can be • attributed to the DRAM system • Latency and power can be both improved by employing • smaller arrays; incurs a penalty in density and cost • Latency and power can be both improved by increasing • the row buffer hit rate; requires intelligent mapping of • data to rows, clever scheduling of requests, etc. • Power can be reduced by minimizing overfetch – either • read fewer chips or read parts of a row; incur penalties • in area or bandwidth
Overfetch • Overfetch caused by multiple factors: • Each array is large (fewer peripherals more density) • Involving more chips per access more data • transfer pin bandwidth • More overfetch more prefetch; helps apps with • locality • Involving more chips per access less data loss • when a chip fails lower overhead for reliability
Selective Bitline Activation • Additional logic per array so that only relevant bitlines • are read out • Essentially results in finer-grain partitioning of the DRAM • arrays • Two papers in 2010: Udipi et al., ISCA’10, Cooper-Balis and Jacob, IEEE Micro
Rank Subsetting • Instead of using all chips in a rank to read out 64-bit • words every cycle, form smaller parallel ranks • Increases data transfer time; reduces the size of the • row buffer • But, lower energy per row read and compatible with • modern DRAM chips • Increases the number of banks and hence promotes • parallelism (reduces queuing delays) • Mini-Rank, MICRO’08; MC-DIMM, SC’09
Row Buffer Management • Open Page policy: maximizes row buffer hits, minimizes • energy • Close Page policy: helps performance when there is • limited locality • Hybrid policies: can close a row buffer after it has served • its utility; lots of ways to predict utility: time, accesses, • locality counters for a bank, etc.
Micro-Pages Sudan et al., ASPLOS’10 • Organize data across banks to maximize locality in a • row buffer • Key observation: most locality is restricted to a small • portion of an OS page • Such hot micro-pages are identified with hardware • counters and co-located on the same row • Requires hardware indirection to a page’s new location • Works well only if most activity is confined to a few • micro-pages
Scheduling Policies • The memory controller must manage several timing • constraints and issue a command when all resources • are available • It must also maximize row buffer hit rates, fairness, and • throughput • Reads are typically given priority over writes; the write • buffer must be drained when it is close to full; changing • the direction of the bus requires 5-10 ns delay • Basic policies: FCFS, First-Ready-FCFS (prioritize row • buffer hits)
STFM Mutlu and Moscibroda, MICRO’07 • When multiple threads run together, threads with row • buffer hits are prioritized by FR-FCFS • Each thread has a slowdown: S = Talone / Tshared, where T is • the number of cycles the ROB is stalled waiting for memory • Unfairness is estimated as Smax / Smin • If unfairness is higher than a threshold, thread priorities • override other priorities (Stall Time Fair Memory scheduling) • Estimation of Talone requires some book-keeping: does an • access delay critical requests from other threads?
PAR-BS Mutlu and Moscibroda, ISCA’08 • A batch of requests (per bank) is formed: each thread can • only contribute R requests to this batch; batch requests • have priority over non-batch requests • Within a batch, priority is first given to row buffer hits, then • to threads with a higher “rank”, then to older requests • Rank is computed based on the thread’s memory intensity; • low-intensity threads are given higher priority; this policy • improves batch completion time • By using rank, requests from a thread are serviced in • parallel; hence, parallelism-aware batch scheduling
TCM Kim et al., MICRO 2010 • Organize threads into latency-sensitive ad bw-sensitive • clusters based on memory intensity; former gets higher • priority • Within bw-sensitive cluster, priority is based on rank • Rank is determined based on “niceness” of a thread and • the rank is periodically shuffled with insertion shuffling or • random shuffling (the former is used if there is a big gap in • niceness) • Threads with low row buffer hit rates and high bank level • parallelism are considered “nice” to others
Title • Bullet