Dell Research In HPC

Dell Research In HPC GridPP7 1st July 2003 Steve Smith HPC Business Manager Dell EMEA steve_l_smith@dell.com

Traditional Traditional HPC Architecture HPC Architecture The Changing Models of Proprietary Proprietary High Performance Computing RISC RISC Vector Vector Custom Custom Current Current HPC Architecture HPC Architecture Future HPC Future HPC Applications Applications Architecture Architecture OS OS Shared Shared Middleware Cluster, SMP, Blades Cluster, SMP, Blades Middleware Resources Resources Hardware Hardware Standard Based Standard Based Grid Clusters Clusters SMP SMP Distributed Distributed Rich Client Rich Client Applications Applications © Copyright 2002-2003 Intel Corporation

HPCC Building Blocks Parallel Benchmarks (NAS, HINT, Linpack…) and Parallel Applications Benchmark MPI/Pro MPICH MVICH PVM Middleware Windows Linux OS Elan TCP VIA GM Protocol Fast Ethernet Gigabit Ethernet Myrinet Quadrics Interconnect PowerEdge & Precision (IA32 & IA64) Platform

HPCC Components and Research Topics - Custom application benchmarks - Standard benchmarks - Performance studies Vertical Solutions: application Prototyping / Sizing - Energy/Petroleum - Life Science - Automotives – Manufacturing and Design - Reliable PVFS - GFS , GPFS … - Storage Cluster Solutions Resource Monitoring / Management Resource dynamic allocation Checkpoint restarting and Job redistributing Cluster monitoring Load analysis and Balancing • Remote access • Web-based GUI Application Benchmark Cluster File System Compilers and math library Performance tools - MPI analyzer / profiler - Debugger - Performance analyzer and optimizer Job Scheduler Node Monitoring & Management Development Tools Cluster monitoring Distributed System Performance Monitoring Workload analysis and Balancing • Remote access • Web-based GUI Middleware / API MPI 2.0 / Fault Tolerant MPI MPICH, MPICH-GM, MPI/LAM, PVM Operating Systems Cluster Hardware Software Monitoring & Management Interconnect Technologies - FE, GbE, 10GE… (RDMA) - Myrinet, Quadrics, Scali - Infiniband Interconnect Protocols Interconnects Hardware Remote installation / configuration PXE support System Imager LinuxBIOS Cluster Installation Management Hardware IA-32, IA64 (Processor / Platform) comparison Standard rack mounted, blade and brick servers / workstations Platform Hardware

128-node Configuration with Myrinet

HPCC Technology Roadmap TOP500 (June 2004) TOP500 (Nov 2002) TOP500 (June 2003) TOP500 (Nov 2003) Grid Engine (GE) Condor-G Platform Computing Lustre File System 2.0 Global File System Lustre File System 1.0 Qluster MPICH-G2 Cycle Stealing ADIC PVFS2 NFS Myrinet hybrid switch Globus Toolkit 10GbE Scali Quadrics Ganglia Myrinet 2000 IB Prototyping Clumon (NCSA) iSCSI Financial: MATLAB Manufacturing: Fluent, LS-DYNA, Nastran Life Science: BLASTs Energy: Eclipse, LandMark VIP Vertical Solutions Data Grid Grid Computing Cluster Monitoring File Systems Middleware Interconnects Platform Baselining Yukon 2P 2U Everglades 2P 1U Big Bend 2P 1U Q3 FY03 Q4 FY03 Q1 FY04 Q2 FY04 Q3 FY04 Q4 FY04 Q1 FY05 Q2 FY05 Q3 FY05

In the box scalability of Xeon Servers 71 %Scalability in the box

In the BOX – XEON (533 MHz FSB) Scaling http://www.cs.utexas.edu/users/flame/goto/ 32 %Performance improvement

Goto Comparison on Myrinet 37%Improvement with Goto’s library

Goto Comparison on Gigabit Ethernet 64 Nodes / 128 Processors 25%Improvement with Goto’s library

Process-to-Processor Mapping Node 1 Node 2 CPU1 CPU2 CPU2 Switch CPU1 Process 4 Process 1 Process 3 Process 2 Round Robin (Default) Node 1 Node 2 CPU1 CPU2 CPU2 Switch CPU1 Process 1 Process 2 Process 3 Process 4 Process Mapped

Messages Count for HPL 16-process Run

Messages Length of HPL 16-process Run

HPL Results on the XEON Cluster Fast Ethernet Gigabit Ethernet Myrinet Balanced system designed for HPL type of applications Size-major is 7% better than Round Robin Size-major is 35% better than Round Robin

Reservoir Simulation – messages statistics

Reservoir Simulation – Process Mapping – Gigabit Ethernet 11% improvement with GigE

How Hyper-Threading Technology Works First Thread/Task Second Thread/Task Execution Resource Utilization Time Both Threads/Tasks without Hyper-Threading Technology Time saved, up to 30% Both Threads/Tasks with Hyper-Threading Technology Greater resource utilization equals greater performance © Copyright 2002-2003 Intel Corporation

HPL Performance Comparison on a 16-node Dual-Xeon 2.4 GHz cluster Linpack Performance Results Hyper-threading provides ~6% Improvement on a 16 node 32 processors cluster 90 80 70 60 50 GFLOPS 40 16x4 processes with HT on 30 16x2 processes without HT 20 16x2 processes with HT on 10 16x1 processes without HT 0 2000 4000 6000 10000 14000 20000 28000 40000 48000 56000 Problem size

NPB-FT (Fast Fourier Transformation) 9000 8000 without HT with HT 7000 6000 Cache (L2) misses increased • Without HT: 68% • With HT: 76% 5000 Mop/sec 4000 3000 2000 1000 0 1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT) 32x2 (32x4 with HT) Number of nodes X Number of processors Configuration

NPB-EP (Embarrassingly Parallel) without HT with HT 1000 900 800 700 • EP requires almost no communication • SSE and x87 utilization increased • Without HT: 94% • With HT: 99% 600 Mop/sec 500 400 300 200 100 0 1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT) 32x2 (32x4 with HT) Configuration Number of nodes X Number of processors

Observations • Computational intensive applications with fine-tuned floating-point operations have less chance to be improved in performance from Hyper-Threading, because the CPU resources could already be highly utilized • Cache-friendly applications might suffer from Hyper-Threading enabled, because processes running on logical processors might be competing for the shared cache access, which might result in performance degradation • Communication-bound or I/O-bound parallel applications may benefit from Hyper-Threading, if the communication and computation can be performed in an interleaving fashion between processes. • The current version of Linux OS’s support on Hyper-Threading is limited, which could cause performance degradation significantly if Hyper-Threading is not applied properly. • To the OS, the logical CPUs are almost undistinguishable from physical CPUs • The current Linux scheduler treats each logical CPU as a separate physical CPU - which does not maximize multiprocessing performance • A patch for better HT support is available (Source: "fully HT-aware scheduler" support – 2.5.31-BK-curr , by Ingo Molnar)

Thank You Steve Smith HPC Business Manager Dell EMEA steve_l_smith@dell.com

Dell Research In HPC

Dell Research In HPC

Presentation Transcript

HPC clusters in Research Computing

Dell in China

Dell in China

Changing HPC in Japan

Dell in Brazil

HPC in linguistic research

HPC in the Cloud

Shadow: Simple HPC for Systems Security Research

HPC in Angola?

Empowering efficient HPC with Dell

HPC

GPU in HPC

HPC in molecular modelling

Compiler Research in HPC Lab

HPC challenges in Switzerland

Storage Systems in HPC

What’s Working in HPC

HPC Research @ UNM: X10’ding Graph Analysis