1 / 31

Richard Wells ECE 7810 April 21, 2009

Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM-System Performance?. Richard Wells ECE 7810 April 21, 2009. Reservations. The paper is old Presented at ISCA 2001 Only considers uniprocessor systems

dolan
Download Presentation

Richard Wells ECE 7810 April 21, 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM-System Performance? Richard Wells ECE 7810 April 21, 2009

  2. Reservations • The paper is old • Presented at ISCA 2001 • Only considers uniprocessor systems • They draw some conclusions that while valid are focused on their research goals • Papers relating to our groups project are not prevalent in recent years, except one already presented at the architecture reading club.

  3. Overview • Investigate DRAM system organization parameters to determine bottleneck • Determine synergy or antagonism between groups of parameters • Empirically determine the optimal DRAM system configuration

  4. Methodologies to increase system performance • Concurrent transactions • Reducing latency • Reduce system overhead

  5. Previous approaches to reduce memory system overhead • DRAM Component • Increase bandwidth • Current “tack” taken by the PC industry • Reduce DRAM latency • ESDRAM • SRAM cache for the full row buffer • Allows precharge to begin immediately after access • FCRAM • Subdivide internal bank by activating only a portion of each wordline

  6. Previous approaches to reduce memory system overhead (cont.) • Reduce capacitance on word access to 30 ns (2001) • MoSys • Subdivides storage into a large number of very small banks • Reduces latency of DRAM core to nearly that of SRAM • VCDRAM • Set-associative SRAM buffer that holds a number of sub-pages

  7. The Jump • DRAM oriented approaches do reduce application execution time • Because zero latency DRAM doesn’t reduce the overhead of memory system to zero, bus transactions are considered • Other factors considered • Turnaround time • Queuing delays • Inefficiencies due to asymmetric read/write requests • Multiprocessor - Arbitration and Cache coherence would add to overhead

  8. CPU – DRAM Channel • Access reordering (cited Impulse group here at the U) • Compacts sparse data into densely-packed bus transactions • Reduces the number of bus transactions • Possibly reduces duration of bus transaction

  9. Increasing concurrency • Different banks on the same channel • Independent channels to different banks • Pipelined requests • Split-transaction bus

  10. Decreasing channel latency • Due to channel contention • Back to back read requests • Read arriving during precharge • Narrow channels • Large data burst size

  11. Addressing System Overhead • Bus turnaround time • Dead cycles due to asymmetric read/write shapes • Queuing overhead • Coalescing queued requests • Dynamic re-prioritization of requests

  12. Timing Assumptions • 10 ns address • 70 ns until burst starts on a read • 40 ns until a write can start

  13. Split Transaction Bus Assumptions • Overlapping Supported • Back-to-back reads • Back-to-back read/write pairs

  14. Burst Ordering, Coalescing • Critical-burst first, non-critical burst second, writes last • Coalesce writes followed by reads

  15. Bit Addressing & Page Policy • Bit assignments chosen to exploit page mode and maximize degree of memory concurrency • Most significant bits identify the smallest-scale component in the system • Least significant bits identify the largest-scale component in the system • Allows sequential addresses to be stripped across channels maximizing concurrency • Close-page auto-precharge policy

  16. Simulation Environment • SimpleScalar (used in 6810) • 2 GHz clock • L1 caches 64Kb/64Kb, 2-way set associative • L2 cache unified 1Mb, 4-way set associative, 10 cycle access time • Lock-up free cache using miss status holding register (MSHR)

  17. Timing Calculations • CPU + DRAM determined by running a second simulation with perfect primary memory (available on next cycle)

  18. Results – Degrees of Freedom • Bus Speed: 800 MHz • Bus width: 1, 2, 4, 8 bytes • Channels: 1, 2, 4 • Banks/Channel: 1, 2, 4, 8 • Queue Size: infinite, 0, 1, 2, 8, 16, 32 • Turnaround: 0, 1 cycles • R/W shapes: symmetric, asymmetric

  19. Results – Execution Times • Assumes infinite request queue • System parameters can lead to widely varying CPI

  20. Results – Turnaround and Banks • Turnaround only accounts for 5% of system related overhead • Banks/Channel accounts for 1.2x – 2x variation – shows concurrency is important • Latency accounts for over about 50% of CPI

  21. Results – Burst Length vs. BW • Accounts for 10-30% of execution time • Wider channels have optimal performance with larger bursts • Narrow channels have optimal performance with smaller bursts

  22. Results - Concurrency

  23. Results – Concurrency (Cont.) • Increasing the number of banks typically increases performance, but not always much • Many narrow channels is risky because application might not have much inherent concurrency • Optimal 1 channel x 4 bytes x 64 byte burst, 2 channel x 2 bytes x 64 byte burst, 1 channel x 4 bytes x 128 byte burst • Performance varies depending on the concurrency of the benchmark

  24. Results – Concurrency (Cont.) • “We find that, in a uniprocessor setting, concurrency is very important, but it is not more important than latency. . . . However, we find that if, in an attempt to increase support for concurrent transactions, one interleaves very small bursts or fragments the DRAM bus into multiple channels, one does so at the expense of latency, and this expense is too great for the levels of concurrency being produced.”

  25. Results – Request Queue Size

  26. Results – Request Queue Size • How queuing benefits system performance • Sub-blocks of different read requests can be interleaved • Writes can be buffered until read-burst traffic has died down • Read and write requests may be coalesced • Applications with significant write activity see more benefit from queuing • Bzip has many more writes than GCC • Anomalies attributed to requests with temporal locality go to the same bank. With a small queue they are delayed.

  27. Conclusions • Tuning system level parameters can improve the memory system performance by 40% • Bus turnaround – 5-10% • Banks – 1.2x – 2x • Burst length vs. bandwidth – 10%-30% • Concurrency • Smaller bursts to allow for interleaving is not a good idea because it limits concurrency

  28. Our Project • To evaluate the effect of mat array size on power and latency of the DRAM chips. • Simulators • Cacti • DRAMSim • Simics • Predicted Results • Positive • Decreased memory latency • Decreased power profile • DIMM parallelism increase • Negative • Decreased row buffer hit rates • Decreased memory capacity (for same chip area) • Increase the important cost/bit metric

  29. How project relates to the paper • Trying to decrease the memory system bottlenecks • Although we have evaluated bottlenecks differently • Jacob indirectly showed the importance of minimizing DRAM latency • DRAM latency was largest portion of CPI so Amdahl’s law would justify reducing latency • Both our solutions could work together synergistically

  30. Additional thoughts • The current path of DRAM innovation has limitations • DRAM chips and DIMMs need to undergo fundamental changes, of which this could be a step • Helps power efficiency • Can balance with cost effectiveness • Partially addresses the memory gap

  31. Questions • Questions?

More Related