360 likes | 481 Views
Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance. Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009).
E N D
Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009)
An Brief Intro Of ICT, CAS ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500.
Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work
Importance of I/O operations • I/O are ubiquitous • Load binary files:Disk Memory • Brower web, media stream:NetworkMemory… • I/O are important • Many commercial applications are I/O intensive: • Database, Internet applications etc.
State-of-the-Art I/O Technologies • I/O Bues: 20GB/s • PCI-Express 2.0 • HyperTransport 3.0 • QuickPath Interconnect • I/O Devices • RAID: 400MB/s • 10GE: 1.25GB/s
Direct Memory Access (DMA) • DMA is an essential feature of I/O operation in all modern computers • DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU. • Many I/O devices use DMA • Including disk drive controllers, graphics cards, network cards, sound cards and GPUs
Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work
An Example of Disk Read: DMA Receiving Operation Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ ⑤ DMA Engine User Buffer • Cache Access Latency: ~20 Cycles • Memory Access Latency:~200 Cycles
Potential Improvement of DMA Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ ⑤ User Buffer DMA Engine • This is a typical Shared-Cache Scheme
Problems of Shared-CacheScheme • Cache Pollution • CacheThrashing • Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application
Rethink DMAMechanism • The Nature of DMA • There is a producer-consumer relationship between CPU and DMA engine • Memory plays a role of transient place for I/O data transferred between processor and I/O device • Corollaries • Once I/O data is produced, it will be consumed • I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse). • Characterizations of I/O data are different from CPU data • It may not be appropriate to store I/O data and CPU data together
Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work
DMA Cache Proposal • A Dedicated Cache • Storing I/O data • Capable of exchanging data with processor’s last level cache (LLC) • Reduce overhead of I/O data movement DMA
DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching CPU Cache State Diagram DMA Cache State Diagram • DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system • We are researching multiprocessor platform…
DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching DMA • Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism
Data Path: CPU Read CPU read cmd data Mem Ctrl DMA Ctrl Cache Ctrl Cache Ctrl Miss in LLC & Hit in DMA Cache Hit in DMA cache? Memory I/O Device DMA Cache Last Level Cache Snoop Ctrl Snoop Ctrl System Bus
Data Path: DMA Read cmd data Mem Ctrl DMA Ctrl Miss in DMA Cache & Hit in LLC Cache Ctrl Cache Ctrl Hit in LLC? Memory I/O Device Last Level Cache DMA Cache Snoop Ctrl Snoop Ctrl System Bus DMA read
DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • An LRU-like Replace Policy • Invalid Block • Clean Block • Dirty Block
DMA Cache Design Issue • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • Adopt Write-Allocate Policy • Both Write-Back or Write Through policies are available
DMA Cache Design Issue • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • Adopt straightforward sequential prefetching • Prefetching trigged by cache miss • Fetch 4 blocks one time
Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work
Memory Trace Collection • Hyper Memory Trace Tool (HMTT) • Capable of Collecting all memory requests • Provide APIs for injecting tags into memory trace to identify high-level system operations
FPGA Emulation • L2 Cache from Godson-2F • DDR2 Memory Controller from Godson-2F • DDR2 DIM model from Micron Technology • Xtreme system from Cadence Memory trace L2 Cache DMA Cache MemCtrl DDR2 Dram
Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work
Experimental Setup • Configurations • Snoop Cache (2MB) • Shared Cache (2MB) • DMA Cache • 256KB + prefetch • 256KB w/o prefetch • 128KB + prefetch • 128KB w/o prefetch • 64KB + prefetch • 64KB w/o prefetch • 32KB + prefetch • 32KB w/o prefetch • Machine • AMD Opteron • 2GBMemory • 1 GE NIC • IDE disk • Benchmark • FileCopy • TPC-H • SPECWeb2005
Characterization of DMA • The portions of DMA memory reference varies depending on applications • The sizes of DMA requests varies depending on application
Normalized Speedup • Baseline is snoop cache scheme • DMA cache schemes exhibits better performance than others
DMA Write & CPU Read Hit Rate • Both shared cache and DMA cache exhibit high hit rates • Then, where do cycle go for shared cache scheme?
% of DMA Writes causing Dirty Block Replacement • Those DMA writes cause cache pollution and thrashing problem • The 256KB DMA cache is able to significantly eliminate these phenomena
% of Valid Prefetched Blocks • DMA caches can exhibit an impressive high prefetching accuracy • This is because I/O data has very regular access pattern.
Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work
Conclusions and Ongoing Work • The Nature of DMA • There is a producer-consumer relationship between CPU and DMA engine • Memory plays a role of transient place for I/O data transferred between processor and I/O device • We propose a DMA cache scheme and its design issues. • Experimental results show that DMA cache can significantly improve I/O performance. • Ongoing Work • The impact of multiprocessor, multiple DMA channels for DMA cache • In theory, a shared cache with an intelligent replacement policy can achieve the effect of DMA cache scheme. • Godson-3 has integrated an dedicate cache management policy for I/O data.