Memory System Performance of High End SMPs, PCs and Clusters of PCs

Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer SystemsETHZ - Swiss Institute of TechnologyCH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/

Memory Systems • Low End designs in PCs: • extremely low cost • standard I/O interface • High End designs in “Killer” Workstations: • well engineered memory systems • support for additional datastreams • better I/O busses • Are Low End SMPs the universal compute nodes for parallel and distributed systems?

Contribution • The answer is probably the memory system performance. • How significant are the differences in memory system performance? • Limitations of Low End memory systems • for local computation (e.g. in scientific applications) • for inter-node communication (e.g. in databases)

Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): • Categories • Access pattern, stride (spatial locality) • Working set (temporal locality) • Value • Transfer bandwidth (large amount of data) • Same chart resulting from one microbenchmark • Local and Remote transfers • compute and communicate accesses

Measurement Problems Some parameter combinations are hard tomeasure, even with carefully tuned C code: • Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related. • Compilers occasionally generate suboptimal instruction schedules for loads / stores.

Pentium Pro FX one processor 200 MHz 600 600 500 500 400 400 Load bandwidth (MByte/s) 300 Load bandwidth (MBytes/sec) 300 200 200 100 L1 100 0 1 2 L2 3 4 0 5 6 7 8 12 1 K 0.5 K 15 2 K 4 K 16 24 8 K DRAM 31 16 K 32 32 K 48 64 K 63 64 128 K 96 256 K 1 M 512 K 2 M 127 4 M 128 8 M 16 M Local Load Access: Pentium Pro PC Access pattern (stride between 64bit words) Working set

SGI Origin 10000 one processor 195 MHz 1600 1600 1400 1400 1200 1200 1000 1000 Load bandwidth (MByte/s) 800 Load bandwidth (MBytes/sec) 800 600 600 400 L1 400 200 200 0 1 L2 2 0 3 4 5 6 7 8 1 K 0.5 K 2 K 12 4 K 15 8 K 16 24 16 K 31 32 K 64 K 32 48 128 K 63 256 K 1 M 64 512 K 96 2 M 4 M 8 M 127 128 16 M 32 M 64 M Local Load Access: SGI Origin Access pattern (stride between 64bit words) Working set

DEC Alpha 8400 one processor 300 MHz 1200 1200 1000 1000 800 800 Load bandwidth (MByte/s) 600 Load bandwidth (MBytes/sec) 600 400 400 L1 200 200 L2 0 1 2 0 3 4 L3 5 6 7 .5k 1k 8 2k 4k 12 8k 15 16 16k 24 32k 31 64k 32 48 128k 63 256k 1M 512k 64 2M 96 4M 8M 127 16M 128 32M 64M Local Load Access: DEC 8400 Access pattern (stride between 64bit words) Working set

Sun Ultra Enterprise one Ultra SPARC II 248 MHz 700 700 600 600 500 500 400 Load bandwidth (MByte/s) 400 300 Load bandwidth (MBytes/sec) 300 200 200 100 L1 100 0 1 2 3 L2 4 5 0 6 7 8 12 15 1 K 0.5 K 16 2 K 4 K 24 8 K 31 DRAM 32 48 16 K 63 32 K 64 K 64 96 128 K 256 K 1 M 127 512 K 128 2 M 4 M 8 M 16 M Local Load Access: Sun Enterprise Access pattern (stride between 64bit words) Working set

Cray T3E one processor 300 MHz 1200 1200 1000 1000 800 800 Load bandwidth (MByte/s) 600 Load bandwidth (MBytes/sec) 600 400 400 200 L1 200 0 L2 1 2 3 4 0 5 6 7 8 DRAM 12 1 K 0.5 K 15 2 K 4 K 16 8 K 24 31 16 K 32 32 K 48 64 K 63 64 128 K 96 256 K 1 M 512 K 2 M 127 4 M 128 8 M 16 M Local Load Access: SGI Cray T3E Access pattern (stride between 64bit words) Working set

Comparison - Local Access

Performance in an SMP setting • Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors • Topics of interest: • small working sets in caches: performance remains same • large working sets in memory: interesting differences • behavior for even/uneven strides • “Gather copy stream” (strided load / contiguous store)

Local Copy: Pentium Pro SMP

Local Copy: SGI Origin CC-NUMA

Local Copy: DEC 8400 SMP

Local Copy: Sun Enterprise SMP

Remote in Parallel Computers Parallel & Network Symmetric Computers Multiprocessors SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs Processor Caches Memory P P P P P P C C C C C C M M M Bus/Network M M Network P C M

t 80 128 70 s 60 50 l Remote Copy bandwidth (Mbyte/s) 40 l 30 l 20 l l l l l l l l l l l 10 s s s s s s s s s s s s s 0 1 2 3 4 5 6 7 8 12 16 24 32 48 64 Access pattern (stride between 64bit words) local copy l remote copy by Myrinet remote copy by SCI s Remote Transfers: CoPsPentium Pro with SCI / Myrinet t t t

Remote Transfers: SGI Origin

Remote Transfers: DEC 8400

Remote Transfers: SGI Cray T3E

Comparison - Remote Transfers

Improvement of PC Chipsets • Intel 440 BXAGP Chip Set400 MHz / 100 MHz • Intel 440 LXAGP Chip Set233 MHz / 66 MHz • Intel 440 FXNatoma Chip Set200 MHz / 66 MHz

Conclusion • ECT-Characterizations for different memory systems: • T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) • CoPs Intel P6 SMPs and Clusters • High End SMP vs. Low End SMP: • Less than half performance on two processor PCs. • Fast communication puts high demands on the memory system: • Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and CoPs • Adding more commodity microprocessors processors without reinforcing the memory system is therefore questionable.

Memory System Performance of High End SMPs, PCs and Clusters of PCs

Memory System Performance of High End SMPs, PCs and Clusters of PCs

Presentation Transcript

Portable PCs

PCS-Tender

PCS

Review of PCS Billing

PCS Champions

Determining the # Of PCs

PCs

PCS 406

Review of PCS Billing

PCS Demographics

PCS

PCS 406

PCS 2010

Tablet PCs

2 pcs

PCS

PCS Entitlement

PCS 2010

PCS Champions