Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009)

An Brief Intro Of ICT, CAS ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500.

Overview • Background • Nature of DMA Mechanism • DMA CacheScheme • Research Methodology • Evaluations • Conclusions and Ongoing Work

Importance of I/O operations • I/O are ubiquitous • Load binary files：Disk Memory • Brower web, media stream：NetworkMemory… • I/O are important • Many commercial applications are I/O intensive： • Database, Internet applications etc.

State-of-the-Art I/O Technologies • I/O Bues: 20GB/s • PCI-Express 2.0 • HyperTransport 3.0 • QuickPath Interconnect • I/O Devices • RAID: 400MB/s • 10GE: 1.25GB/s

A Typical Computer Architecture NIC

Direct Memory Access (DMA) • DMA is an essential feature of I/O operation in all modern computers • DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU. • Many I/O devices use DMA • Including disk drive controllers, graphics cards, network cards, sound cards and GPUs

DMA in Computer Architecture NIC

An Example of Disk Read: DMA Receiving Operation Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ ⑤ DMA Engine User Buffer • Cache Access Latency： ~20 Cycles • Memory Access Latency：~200 Cycles

Potential Improvement of DMA Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ ⑤ User Buffer DMA Engine • This is a typical Shared-Cache Scheme

Problems of Shared-CacheScheme • Cache Pollution • CacheThrashing • Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application

Rethink DMAMechanism • The Nature of DMA • There is a producer-consumer relationship between CPU and DMA engine • Memory plays a role of transient place for I/O data transferred between processor and I/O device • Corollaries • Once I/O data is produced, it will be consumed • I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse). •  Characterizations of I/O data are different from CPU data •  It may not be appropriate to store I/O data and CPU data together

DMA Cache Proposal • A Dedicated Cache • Storing I/O data • Capable of exchanging data with processor’s last level cache (LLC) •  Reduce overhead of I/O data movement DMA

DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching CPU Cache State Diagram DMA Cache State Diagram • DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system • We are researching multiprocessor platform…

DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching DMA • Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism

Data Path: CPU Read CPU read cmd data Mem Ctrl DMA Ctrl Cache Ctrl Cache Ctrl Miss in LLC & Hit in DMA Cache Hit in DMA cache? Memory I/O Device DMA Cache Last Level Cache Snoop Ctrl Snoop Ctrl System Bus

Data Path: DMA Read cmd data Mem Ctrl DMA Ctrl Miss in DMA Cache & Hit in LLC Cache Ctrl Cache Ctrl Hit in LLC? Memory I/O Device Last Level Cache DMA Cache Snoop Ctrl Snoop Ctrl System Bus DMA read

DMA Cache Design Issues • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • An LRU-like Replace Policy • Invalid Block • Clean Block • Dirty Block

DMA Cache Design Issue • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • Adopt Write-Allocate Policy • Both Write-Back or Write Through policies are available

DMA Cache Design Issue • Cache Coherence • Data Path • Replacement Policy • Write Policy • Prefetching • Adopt straightforward sequential prefetching • Prefetching trigged by cache miss • Fetch 4 blocks one time

Memory Trace Collection • Hyper Memory Trace Tool (HMTT) • Capable of Collecting all memory requests • Provide APIs for injecting tags into memory trace to identify high-level system operations

FPGA Emulation • L2 Cache from Godson-2F • DDR2 Memory Controller from Godson-2F • DDR2 DIM model from Micron Technology • Xtreme system from Cadence Memory trace L2 Cache DMA Cache MemCtrl DDR2 Dram

Experimental Setup • Configurations • Snoop Cache (2MB) • Shared Cache (2MB) • DMA Cache • 256KB + prefetch • 256KB w/o prefetch • 128KB + prefetch • 128KB w/o prefetch • 64KB + prefetch • 64KB w/o prefetch • 32KB + prefetch • 32KB w/o prefetch • Machine • AMD Opteron • 2GBMemory • 1 GE NIC • IDE disk • Benchmark • FileCopy • TPC-H • SPECWeb2005

Characterization of DMA • The portions of DMA memory reference varies depending on applications • The sizes of DMA requests varies depending on application

Normalized Speedup • Baseline is snoop cache scheme • DMA cache schemes exhibits better performance than others

DMA Write & CPU Read Hit Rate • Both shared cache and DMA cache exhibit high hit rates • Then, where do cycle go for shared cache scheme?

Breakdown of Normalized Total Cycles

% of DMA Writes causing Dirty Block Replacement • Those DMA writes cause cache pollution and thrashing problem • The 256KB DMA cache is able to significantly eliminate these phenomena

% of Valid Prefetched Blocks • DMA caches can exhibit an impressive high prefetching accuracy • This is because I/O data has very regular access pattern.

Conclusions and Ongoing Work • The Nature of DMA • There is a producer-consumer relationship between CPU and DMA engine • Memory plays a role of transient place for I/O data transferred between processor and I/O device • We propose a DMA cache scheme and its design issues. • Experimental results show that DMA cache can significantly improve I/O performance. • Ongoing Work • The impact of multiprocessor, multiple DMA channels for DMA cache • In theory, a shared cache with an intelligent replacement policy can achieve the effect of DMA cache scheme. • Godson-3 has integrated an dedicate cache management policy for I/O data.

THANKS！Q&A?

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance

Presentation Transcript

High Performance Cluster Computing Architectures and Systems

Fiscal Management @ UGA:

IEC5310 Computer Architecture Chapter 4 Exploiting ILP with Software Approach

Chapter 3

CERTIFIED GUIDING LION PROGRAM

CS 211: Computer Architecture

Spurious Relationship

Performance Evaluation

Quantitative Performance Analysis

Entity Relationship Diagram

Architecting and Exploiting Asymmetry in Multi-Core Architectures

UPDEA Scientific Committee Workshop Leonardo Pittorino Chief Engineer: Tx Performance

Using Performance Improvement to Improve Patient Outcomes

Exploiting Multithreaded Architectures to Improve Data Management Operations

Reproductive System…

High Performance Web Sites 14 rules for faster-loading pages

Chapter 5

Exploiting NoSQL Like Never Before