1 / 22

Dell Research In HPC

Dell Research In HPC. GridPP7 1 st July 2003. Steve Smith HPC Business Manager Dell EMEA steve_l_smith@dell.com. Traditional . Traditional . HPC Architecture. HPC Architecture. The Changing Models of. Proprietary. Proprietary. High Performance Computing. RISC. RISC. Vector.

jena
Download Presentation

Dell Research In HPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dell Research In HPC GridPP7 1st July 2003 Steve Smith HPC Business Manager Dell EMEA steve_l_smith@dell.com

  2. Traditional Traditional HPC Architecture HPC Architecture The Changing Models of Proprietary Proprietary High Performance Computing RISC RISC Vector Vector Custom Custom Current Current HPC Architecture HPC Architecture Future HPC Future HPC Applications Applications Architecture Architecture OS OS Shared Shared Middleware Cluster, SMP, Blades Cluster, SMP, Blades Middleware Resources Resources Hardware Hardware Standard Based Standard Based Grid Clusters Clusters SMP SMP Distributed Distributed Rich Client Rich Client Applications Applications © Copyright 2002-2003 Intel Corporation

  3. HPCC Building Blocks Parallel Benchmarks (NAS, HINT, Linpack…) and Parallel Applications Benchmark MPI/Pro MPICH MVICH PVM Middleware Windows Linux OS Elan TCP VIA GM Protocol Fast Ethernet Gigabit Ethernet Myrinet Quadrics Interconnect PowerEdge & Precision (IA32 & IA64) Platform

  4. HPCC Components and Research Topics - Custom application benchmarks - Standard benchmarks - Performance studies Vertical Solutions: application Prototyping / Sizing - Energy/Petroleum - Life Science - Automotives – Manufacturing and Design - Reliable PVFS - GFS , GPFS … - Storage Cluster Solutions Resource Monitoring / Management Resource dynamic allocation Checkpoint restarting and Job redistributing Cluster monitoring Load analysis and Balancing • Remote access • Web-based GUI Application Benchmark Cluster File System Compilers and math library Performance tools - MPI analyzer / profiler - Debugger - Performance analyzer and optimizer Job Scheduler Node Monitoring & Management Development Tools Cluster monitoring Distributed System Performance Monitoring Workload analysis and Balancing • Remote access • Web-based GUI Middleware / API MPI 2.0 / Fault Tolerant MPI MPICH, MPICH-GM, MPI/LAM, PVM Operating Systems Cluster Hardware Software Monitoring & Management Interconnect Technologies - FE, GbE, 10GE… (RDMA) - Myrinet, Quadrics, Scali - Infiniband Interconnect Protocols Interconnects Hardware Remote installation / configuration PXE support System Imager LinuxBIOS Cluster Installation Management Hardware IA-32, IA64 (Processor / Platform) comparison Standard rack mounted, blade and brick servers / workstations Platform Hardware

  5. 128-node Configuration with Myrinet

  6. HPCC Technology Roadmap TOP500 (June 2004) TOP500 (Nov 2002) TOP500 (June 2003) TOP500 (Nov 2003) Grid Engine (GE) Condor-G Platform Computing Lustre File System 2.0 Global File System Lustre File System 1.0 Qluster MPICH-G2 Cycle Stealing ADIC PVFS2 NFS Myrinet hybrid switch Globus Toolkit 10GbE Scali Quadrics Ganglia Myrinet 2000 IB Prototyping Clumon (NCSA) iSCSI Financial: MATLAB Manufacturing: Fluent, LS-DYNA, Nastran Life Science: BLASTs Energy: Eclipse, LandMark VIP Vertical Solutions Data Grid Grid Computing Cluster Monitoring File Systems Middleware Interconnects Platform Baselining Yukon 2P 2U Everglades 2P 1U Big Bend 2P 1U Q3 FY03 Q4 FY03 Q1 FY04 Q2 FY04 Q3 FY04 Q4 FY04 Q1 FY05 Q2 FY05 Q3 FY05

  7. In the box scalability of Xeon Servers 71 %Scalability in the box

  8. In the BOX – XEON (533 MHz FSB) Scaling http://www.cs.utexas.edu/users/flame/goto/ 32 %Performance improvement

  9. Goto Comparison on Myrinet 37%Improvement with Goto’s library

  10. Goto Comparison on Gigabit Ethernet 64 Nodes / 128 Processors 25%Improvement with Goto’s library

  11. Process-to-Processor Mapping Node 1 Node 2 CPU1 CPU2 CPU2 Switch CPU1 Process 4 Process 1 Process 3 Process 2 Round Robin (Default) Node 1 Node 2 CPU1 CPU2 CPU2 Switch CPU1 Process 1 Process 2 Process 3 Process 4 Process Mapped

  12. Messages Count for HPL 16-process Run

  13. Messages Length of HPL 16-process Run

  14. HPL Results on the XEON Cluster Fast Ethernet Gigabit Ethernet Myrinet Balanced system designed for HPL type of applications Size-major is 7% better than Round Robin Size-major is 35% better than Round Robin

  15. Reservoir Simulation – messages statistics

  16. Reservoir Simulation – Process Mapping – Gigabit Ethernet 11% improvement with GigE

  17. How Hyper-Threading Technology Works First Thread/Task Second Thread/Task Execution Resource Utilization Time Both Threads/Tasks without Hyper-Threading Technology Time saved, up to 30% Both Threads/Tasks with Hyper-Threading Technology Greater resource utilization equals greater performance © Copyright 2002-2003 Intel Corporation

  18. HPL Performance Comparison on a 16-node Dual-Xeon 2.4 GHz cluster Linpack Performance Results Hyper-threading provides ~6% Improvement on a 16 node 32 processors cluster 90 80 70 60 50 GFLOPS 40 16x4 processes with HT on 30 16x2 processes without HT 20 16x2 processes with HT on 10 16x1 processes without HT 0 2000 4000 6000 10000 14000 20000 28000 40000 48000 56000 Problem size

  19. NPB-FT (Fast Fourier Transformation) 9000 8000 without HT with HT 7000 6000 Cache (L2) misses increased • Without HT: 68% • With HT: 76% 5000 Mop/sec 4000 3000 2000 1000 0 1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT) 32x2 (32x4 with HT) Number of nodes X Number of processors Configuration

  20. NPB-EP (Embarrassingly Parallel) without HT with HT 1000 900 800 700 • EP requires almost no communication • SSE and x87 utilization increased • Without HT: 94% • With HT: 99% 600 Mop/sec 500 400 300 200 100 0 1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT) 32x2 (32x4 with HT) Configuration Number of nodes X Number of processors

  21. Observations • Computational intensive applications with fine-tuned floating-point operations have less chance to be improved in performance from Hyper-Threading, because the CPU resources could already be highly utilized • Cache-friendly applications might suffer from Hyper-Threading enabled, because processes running on logical processors might be competing for the shared cache access, which might result in performance degradation • Communication-bound or I/O-bound parallel applications may benefit from Hyper-Threading, if the communication and computation can be performed in an interleaving fashion between processes. • The current version of Linux OS’s support on Hyper-Threading is limited, which could cause performance degradation significantly if Hyper-Threading is not applied properly. • To the OS, the logical CPUs are almost undistinguishable from physical CPUs • The current Linux scheduler treats each logical CPU as a separate physical CPU - which does not maximize multiprocessing performance • A patch for better HT support is available (Source: "fully HT-aware scheduler" support – 2.5.31-BK-curr , by Ingo Molnar)

  22. Thank You Steve Smith HPC Business Manager Dell EMEA steve_l_smith@dell.com

More Related