240 likes | 417 Views
Memory System Performance of High End SMPs, PCs and Clusters of PCs. Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/. Memory Systems. Low End designs in PCs:
E N D
Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer SystemsETHZ - Swiss Institute of TechnologyCH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/
Memory Systems • Low End designs in PCs: • extremely low cost • standard I/O interface • High End designs in “Killer” Workstations: • well engineered memory systems • support for additional datastreams • better I/O busses • Are Low End SMPs the universal compute nodes for parallel and distributed systems?
Contribution • The answer is probably the memory system performance. • How significant are the differences in memory system performance? • Limitations of Low End memory systems • for local computation (e.g. in scientific applications) • for inter-node communication (e.g. in databases)
Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): • Categories • Access pattern, stride (spatial locality) • Working set (temporal locality) • Value • Transfer bandwidth (large amount of data) • Same chart resulting from one microbenchmark • Local and Remote transfers • compute and communicate accesses
Measurement Problems Some parameter combinations are hard tomeasure, even with carefully tuned C code: • Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related. • Compilers occasionally generate suboptimal instruction schedules for loads / stores.
Pentium Pro FX one processor 200 MHz 600 600 500 500 400 400 Load bandwidth (MByte/s) 300 Load bandwidth (MBytes/sec) 300 200 200 100 L1 100 0 1 2 L2 3 4 0 5 6 7 8 12 1 K 0.5 K 15 2 K 4 K 16 24 8 K DRAM 31 16 K 32 32 K 48 64 K 63 64 128 K 96 256 K 1 M 512 K 2 M 127 4 M 128 8 M 16 M Local Load Access: Pentium Pro PC Access pattern (stride between 64bit words) Working set
SGI Origin 10000 one processor 195 MHz 1600 1600 1400 1400 1200 1200 1000 1000 Load bandwidth (MByte/s) 800 Load bandwidth (MBytes/sec) 800 600 600 400 L1 400 200 200 0 1 L2 2 0 3 4 5 6 7 8 1 K 0.5 K 2 K 12 4 K 15 8 K 16 24 16 K 31 32 K 64 K 32 48 128 K 63 256 K 1 M 64 512 K 96 2 M 4 M 8 M 127 128 16 M 32 M 64 M Local Load Access: SGI Origin Access pattern (stride between 64bit words) Working set
DEC Alpha 8400 one processor 300 MHz 1200 1200 1000 1000 800 800 Load bandwidth (MByte/s) 600 Load bandwidth (MBytes/sec) 600 400 400 L1 200 200 L2 0 1 2 0 3 4 L3 5 6 7 .5k 1k 8 2k 4k 12 8k 15 16 16k 24 32k 31 64k 32 48 128k 63 256k 1M 512k 64 2M 96 4M 8M 127 16M 128 32M 64M Local Load Access: DEC 8400 Access pattern (stride between 64bit words) Working set
Sun Ultra Enterprise one Ultra SPARC II 248 MHz 700 700 600 600 500 500 400 Load bandwidth (MByte/s) 400 300 Load bandwidth (MBytes/sec) 300 200 200 100 L1 100 0 1 2 3 L2 4 5 0 6 7 8 12 15 1 K 0.5 K 16 2 K 4 K 24 8 K 31 DRAM 32 48 16 K 63 32 K 64 K 64 96 128 K 256 K 1 M 127 512 K 128 2 M 4 M 8 M 16 M Local Load Access: Sun Enterprise Access pattern (stride between 64bit words) Working set
Cray T3E one processor 300 MHz 1200 1200 1000 1000 800 800 Load bandwidth (MByte/s) 600 Load bandwidth (MBytes/sec) 600 400 400 200 L1 200 0 L2 1 2 3 4 0 5 6 7 8 DRAM 12 1 K 0.5 K 15 2 K 4 K 16 8 K 24 31 16 K 32 32 K 48 64 K 63 64 128 K 96 256 K 1 M 512 K 2 M 127 4 M 128 8 M 16 M Local Load Access: SGI Cray T3E Access pattern (stride between 64bit words) Working set
Performance in an SMP setting • Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors • Topics of interest: • small working sets in caches: performance remains same • large working sets in memory: interesting differences • behavior for even/uneven strides • “Gather copy stream” (strided load / contiguous store)
Remote in Parallel Computers Parallel & Network Symmetric Computers Multiprocessors SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs Processor Caches Memory P P P P P P C C C C C C M M M Bus/Network M M Network P C M
t 80 128 70 s 60 50 l Remote Copy bandwidth (Mbyte/s) 40 l 30 l 20 l l l l l l l l l l l 10 s s s s s s s s s s s s s 0 1 2 3 4 5 6 7 8 12 16 24 32 48 64 Access pattern (stride between 64bit words) local copy l remote copy by Myrinet remote copy by SCI s Remote Transfers: CoPsPentium Pro with SCI / Myrinet t t t
Improvement of PC Chipsets • Intel 440 BXAGP Chip Set400 MHz / 100 MHz • Intel 440 LXAGP Chip Set233 MHz / 66 MHz • Intel 440 FXNatoma Chip Set200 MHz / 66 MHz
Conclusion • ECT-Characterizations for different memory systems: • T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) • CoPs Intel P6 SMPs and Clusters • High End SMP vs. Low End SMP: • Less than half performance on two processor PCs. • Fast communication puts high demands on the memory system: • Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and CoPs • Adding more commodity microprocessors processors without reinforcing the memory system is therefore questionable.