1 / 36

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance. Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009).

sadie
Download Presentation

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009)

  2. An Brief Intro Of ICT, CAS ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500.

  3. Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

  4. Importance of I/O operations • I/O are ubiquitous • Load binary files:Disk Memory • Brower web, media stream:NetworkMemory… • I/O are important • Many commercial applications are I/O intensive: • Database, Internet applications etc.

  5. State-of-the-Art I/O Technologies • I/O Bues: 20GB/s • PCI-Express 2.0 • HyperTransport 3.0 • QuickPath Interconnect • I/O Devices • RAID: 400MB/s • 10GE: 1.25GB/s

  6. A Typical Computer Architecture NIC

  7. Direct Memory Access (DMA) • DMA is an essential feature of I/O operation in all modern computers • DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU. •  Many I/O devices use DMA • Including disk drive controllers, graphics cards, network cards, sound cards and GPUs

  8. Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

  9. DMA in Computer Architecture NIC

  10. An Example of Disk Read: DMA Receiving Operation Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ ⑤ DMA Engine User Buffer • Cache Access Latency: ~20 Cycles • Memory Access Latency:~200 Cycles

  11. Potential Improvement of DMA Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ ⑤ User Buffer DMA Engine • This is a typical Shared-Cache Scheme

  12. Problems of Shared-CacheScheme • Cache Pollution • CacheThrashing • Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application

  13. Rethink DMAMechanism • The Nature of DMA • There is a producer-consumer relationship between CPU and DMA engine • Memory plays a role of transient place for I/O data transferred between processor and I/O device • Corollaries  • Once I/O data is produced, it will be consumed • I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse). •  Characterizations of I/O data are different from CPU data •  It may not be appropriate to store I/O data and CPU data together

  14. Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

  15. DMA Cache Proposal • A Dedicated Cache • Storing I/O data • Capable of exchanging data with processor’s last level cache (LLC) •  Reduce overhead of I/O data movement DMA

  16. DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching CPU Cache State Diagram DMA Cache State Diagram • DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system • We are researching multiprocessor platform…

  17. DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching DMA • Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism

  18. Data Path: CPU Read CPU read cmd data Mem Ctrl DMA Ctrl Cache Ctrl Cache Ctrl Miss in LLC & Hit in DMA Cache Hit in DMA cache? Memory I/O Device DMA Cache Last Level Cache Snoop Ctrl Snoop Ctrl System Bus

  19. Data Path: DMA Read cmd data Mem Ctrl DMA Ctrl Miss in DMA Cache & Hit in LLC Cache Ctrl Cache Ctrl Hit in LLC? Memory I/O Device Last Level Cache DMA Cache Snoop Ctrl Snoop Ctrl System Bus DMA read

  20. DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • An LRU-like Replace Policy • Invalid Block • Clean Block • Dirty Block

  21. DMA Cache Design Issue • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • Adopt Write-Allocate Policy • Both Write-Back or Write Through policies are available

  22. DMA Cache Design Issue • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • Adopt straightforward sequential prefetching • Prefetching trigged by cache miss • Fetch 4 blocks one time

  23. Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

  24. Memory Trace Collection • Hyper Memory Trace Tool (HMTT) • Capable of Collecting all memory requests • Provide APIs for injecting tags into memory trace to identify high-level system operations

  25. FPGA Emulation • L2 Cache from Godson-2F • DDR2 Memory Controller from Godson-2F • DDR2 DIM model from Micron Technology • Xtreme system from Cadence Memory trace L2 Cache DMA Cache MemCtrl DDR2 Dram

  26. Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

  27. Experimental Setup • Configurations • Snoop Cache (2MB) • Shared Cache (2MB) • DMA Cache • 256KB + prefetch • 256KB w/o prefetch • 128KB + prefetch • 128KB w/o prefetch • 64KB + prefetch • 64KB w/o prefetch • 32KB + prefetch • 32KB w/o prefetch • Machine • AMD Opteron • 2GBMemory • 1 GE NIC • IDE disk • Benchmark • FileCopy • TPC-H • SPECWeb2005

  28. Characterization of DMA • The portions of DMA memory reference varies depending on applications • The sizes of DMA requests varies depending on application

  29. Normalized Speedup • Baseline is snoop cache scheme • DMA cache schemes exhibits better performance than others

  30. DMA Write & CPU Read Hit Rate • Both shared cache and DMA cache exhibit high hit rates • Then, where do cycle go for shared cache scheme?

  31. Breakdown of Normalized Total Cycles

  32. % of DMA Writes causing Dirty Block Replacement • Those DMA writes cause cache pollution and thrashing problem • The 256KB DMA cache is able to significantly eliminate  these phenomena

  33. % of Valid Prefetched Blocks • DMA caches can exhibit an impressive high prefetching accuracy • This is because I/O data has very regular access pattern.

  34. Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

  35. Conclusions and Ongoing Work • The Nature of DMA • There is a producer-consumer relationship between CPU and DMA engine • Memory plays a role of transient place for I/O data transferred between processor and I/O device • We propose a DMA cache scheme and its design issues. • Experimental results show that DMA cache can significantly improve I/O performance. • Ongoing Work • The impact of multiprocessor, multiple DMA channels for DMA cache • In theory, a shared cache with an intelligent replacement policy can achieve the effect of DMA cache scheme. • Godson-3 has integrated an dedicate cache management policy for I/O data.

  36. THANKS!Q&A?

More Related