1 / 24

Memory System Performance of High End SMPs, PCs and Clusters of PCs

Memory System Performance of High End SMPs, PCs and Clusters of PCs. Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/. Memory Systems. Low End designs in PCs:

dana-tran
Download Presentation

Memory System Performance of High End SMPs, PCs and Clusters of PCs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer SystemsETHZ - Swiss Institute of TechnologyCH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/

  2. Memory Systems • Low End designs in PCs: • extremely low cost • standard I/O interface • High End designs in “Killer” Workstations: • well engineered memory systems • support for additional datastreams • better I/O busses • Are Low End SMPs the universal compute nodes for parallel and distributed systems?

  3. Contribution • The answer is probably the memory system performance. • How significant are the differences in memory system performance? • Limitations of Low End memory systems • for local computation (e.g. in scientific applications) • for inter-node communication (e.g. in databases)

  4. Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): • Categories • Access pattern, stride (spatial locality) • Working set (temporal locality) • Value • Transfer bandwidth (large amount of data) • Same chart resulting from one microbenchmark • Local and Remote transfers • compute and communicate accesses

  5. Measurement Problems Some parameter combinations are hard tomeasure, even with carefully tuned C code: • Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related. • Compilers occasionally generate suboptimal instruction schedules for loads / stores.

  6. Pentium Pro FX one processor 200 MHz 600 600 500 500 400 400 Load bandwidth (MByte/s) 300 Load bandwidth (MBytes/sec) 300 200 200 100 L1 100 0 1 2 L2 3 4 0 5 6 7 8 12 1 K 0.5 K 15 2 K 4 K 16 24 8 K DRAM 31 16 K 32 32 K 48 64 K 63 64 128 K 96 256 K 1 M 512 K 2 M 127 4 M 128 8 M 16 M Local Load Access: Pentium Pro PC Access pattern (stride between 64bit words) Working set

  7. SGI Origin 10000 one processor 195 MHz 1600 1600 1400 1400 1200 1200 1000 1000 Load bandwidth (MByte/s) 800 Load bandwidth (MBytes/sec) 800 600 600 400 L1 400 200 200 0 1 L2 2 0 3 4 5 6 7 8 1 K 0.5 K 2 K 12 4 K 15 8 K 16 24 16 K 31 32 K 64 K 32 48 128 K 63 256 K 1 M 64 512 K 96 2 M 4 M 8 M 127 128 16 M 32 M 64 M Local Load Access: SGI Origin Access pattern (stride between 64bit words) Working set

  8. DEC Alpha 8400 one processor 300 MHz 1200 1200 1000 1000 800 800 Load bandwidth (MByte/s) 600 Load bandwidth (MBytes/sec) 600 400 400 L1 200 200 L2 0 1 2 0 3 4 L3 5 6 7 .5k 1k 8 2k 4k 12 8k 15 16 16k 24 32k 31 64k 32 48 128k 63 256k 1M 512k 64 2M 96 4M 8M 127 16M 128 32M 64M Local Load Access: DEC 8400 Access pattern (stride between 64bit words) Working set

  9. Sun Ultra Enterprise one Ultra SPARC II 248 MHz 700 700 600 600 500 500 400 Load bandwidth (MByte/s) 400 300 Load bandwidth (MBytes/sec) 300 200 200 100 L1 100 0 1 2 3 L2 4 5 0 6 7 8 12 15 1 K 0.5 K 16 2 K 4 K 24 8 K 31 DRAM 32 48 16 K 63 32 K 64 K 64 96 128 K 256 K 1 M 127 512 K 128 2 M 4 M 8 M 16 M Local Load Access: Sun Enterprise Access pattern (stride between 64bit words) Working set

  10. Cray T3E one processor 300 MHz 1200 1200 1000 1000 800 800 Load bandwidth (MByte/s) 600 Load bandwidth (MBytes/sec) 600 400 400 200 L1 200 0 L2 1 2 3 4 0 5 6 7 8 DRAM 12 1 K 0.5 K 15 2 K 4 K 16 8 K 24 31 16 K 32 32 K 48 64 K 63 64 128 K 96 256 K 1 M 512 K 2 M 127 4 M 128 8 M 16 M Local Load Access: SGI Cray T3E Access pattern (stride between 64bit words) Working set

  11. Comparison - Local Access

  12. Performance in an SMP setting • Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors • Topics of interest: • small working sets in caches: performance remains same • large working sets in memory: interesting differences • behavior for even/uneven strides • “Gather copy stream” (strided load / contiguous store)

  13. Local Copy: Pentium Pro SMP

  14. Local Copy: SGI Origin CC-NUMA

  15. Local Copy: DEC 8400 SMP

  16. Local Copy: Sun Enterprise SMP

  17. Remote in Parallel Computers Parallel & Network Symmetric Computers Multiprocessors SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs Processor Caches Memory P P P P P P C C C C C C M M M Bus/Network M M Network P C M

  18. t 80 128 70 s 60 50 l Remote Copy bandwidth (Mbyte/s) 40 l 30 l 20 l l l l l l l l l l l 10 s s s s s s s s s s s s s 0 1 2 3 4 5 6 7 8 12 16 24 32 48 64 Access pattern (stride between 64bit words) local copy l remote copy by Myrinet remote copy by SCI s Remote Transfers: CoPsPentium Pro with SCI / Myrinet t t t

  19. Remote Transfers: SGI Origin

  20. Remote Transfers: DEC 8400

  21. Remote Transfers: SGI Cray T3E

  22. Comparison - Remote Transfers

  23. Improvement of PC Chipsets • Intel 440 BXAGP Chip Set400 MHz / 100 MHz • Intel 440 LXAGP Chip Set233 MHz / 66 MHz • Intel 440 FXNatoma Chip Set200 MHz / 66 MHz

  24. Conclusion • ECT-Characterizations for different memory systems: • T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) • CoPs Intel P6 SMPs and Clusters • High End SMP vs. Low End SMP: • Less than half performance on two processor PCs. • Fast communication puts high demands on the memory system: • Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and CoPs • Adding more commodity microprocessors processors without reinforcing the memory system is therefore questionable.

More Related