250 likes | 367 Views
Current and Future for NT Clustering with HPVM. Philip M. Papadopoulos Department of Computer Science and Engineering University of California, San Diego. Outline. NT Clustering - Our clusters, Software What’s new in the latest version of HPVM 1.9 Looking at performance
E N D
Current and Future for NT Clustering with HPVM Philip M. Papadopoulos Department of Computer Science and Engineering University of California, San Diego JPC4 - Oak Ridge, TN
Outline • NT Clustering - Our clusters, Software • What’s new in the latest version of HPVM 1.9 • Looking at performance • Gratuitous bandwidth and latency • Iowa State results (Luecke, Raffin, Coyle) • Futures for HPVM • Natural upgrade paths (Windows 2K, Lanai 7, …) • Adding Dynamics JPC4 - Oak Ridge, TN
Why NT? • Technical reasons • Good support of SMP systems • System is designed to be threaded at all levels • User-scheduled ultra-lightweight threads (NT Fibers) are very powerful • Integrated/Extensible performance monitoring system • Well-supported device driver development environment JPC4 - Oak Ridge, TN
Remote Access (NT is Challenged) • Myth: You can’t do things remotely in NT • Fact: You can, it just doesn’t have a unified remote abstraction like rsh/ssh. (Think Client/Server ) • Remote manipulation of registry (regini.exe) • Remote administrative access to file system • Ability to create remote threads (CreateRemoteThread) • Ability to start/stop services (sc.exe) • Too many interfaces! One must essentially learn new tools to perform (scripted) remote admin. • NT Terminal Server and Win2K improve access, but still fall short of X-Windows. JPC4 - Oak Ridge, TN
Hardware/Software Environment • Our clusters • 64 Dual Processor Pentium IIs • 32 HP Kayak. 300MhZ, 384MB, 100 GB disk • 32 HP LPr NetServer 450Mhz, 1024MB, 36GB disk • Myrinet – Lanai4 32-bit PCI cards all 64 Machines • Giganet – Hardware VIA only on NetServers • NT Terminal Server 4.0 on all nodes • LSF for managing/starting parallel jobs • HPVM is the “clusterware” JPC4 - Oak Ridge, TN
High Performance Virtual Machines PI: Andrew A. Chien, co-PIs: Daniel Reed, David Padua Students: Scott Pakin, Mario Lauria*, Louis Giannini, Paff Liu*, Geta Sampemane, Kay Connelly, and Andy Lavery Research Staff: Philip Papadopoulos, Greg Bruno, Caroline Papadopoulos*, Mason Katz*, Greg Koenig, and Qian Liu *Funded from other sources URL: http://www-csag.ucsd.edu/projects/hpvm.html DARPA #E313, AFOSR F30602-96-1-0286 JPC4 - Oak Ridge, TN
What is HPVM? • High-performance (MPP-class) thread-safe communication • A layered set of APIs (not just MPI) that allow applications to obtain a significant fraction of HW performance • A small number of services that allow distributed processed to find out and communicate with each other • Device driver support for Myrinet. Vendor for VIA • Focus/contribution has been effective layering • Especially short message performance. JPC4 - Oak Ridge, TN
Supported APIs • FM (Fast Messages) • Core messaging layer. Reliable, in-order delivery • MPI – MPICH 1.0 based • SHMEM – put/get interface (Similar to Cray) • BSP – Bulk Synchronous Parallel (Oxford) • Global Arrays - Global abstraction for matrix operations. (PNNL) • TCGMSG – Theoretical Chemistry Group Messaging JPC4 - Oak Ridge, TN
SHMEM Global Arrays MPI BSP Fast Messages Myrinetor VIA Shared Memory (SMP) Libraries/Layering • All libraries layered on top of FM • Semantics are active-message like • FM designed to build other libraries, FM level not desirable for applications • Designed for efficient gather/scatter and header processing JPC4 - Oak Ridge, TN
What’s New in HPVM 1.9 • Better Performance (ref v. 1.1 @NCSA) • 25% Bandwidth (80MB/s 100+MB/s) • 14% Latency reduction 10s 8.6 s • Three transports • Shared Memory Transport + [Myrinet,VIA] • Standalone desktop version uses shared mem • Integration with NT Performance Monitor • Improved configuration/installation • BSP API added JPC4 - Oak Ridge, TN
Performance Basics (Ver 1.9) • Basics • Myrinet • FM: 100+MB/sec, 8.6 µsec latency • MPI: 91MB/sec @ 64K, 9.6 µsec latency • Approximately 10% overhead • Giganet (VIA) • FM: 81MB/sec, 14.7 µsec latency • MPI: 77MB/sec, 18.6 µsec latency • 5% BW overhead, 26% latency! • Shared Memory Transport • FM: 195MB/sec, 3.13 µsec latency • MPI: 85MB/sec, 5.75 µsec latency • Our software structure requires 2 mem copies/packet :-( JPC4 - Oak Ridge, TN
Gratuitous Bandwidth Graphs • FM bandwidth usually a good indicator of deliverable bandwidth • High BW attained for small messages • N1/2 ~ 512 Bytes JPC4 - Oak Ridge, TN
“Nothing is more humbling or more revealing than having others use your software.” JPC4 - Oak Ridge, TN
Iowa State Performance Results • “Comparing the Communication Performance and Scalability of a Linux and a NT Cluster of PCs, a Cray Origin 2000, an IBM SP and a Cray T3E-600” • Glenn R. Luecke, Bruno Raffin and James J. Coyle, Iowa State • Machines • 64 Node NT SuperCluster, NCSA, Dual PIII 550, HPVM 1.1 • 64 Node AltaCluster, ABQ HPCC Dual PII 450, GM • O2K, 64 Node, Eagan MN, Dual 300MhZ R12000 • T3E-600, 512 proc, Eagan MN, Alpha EV5 300MhZ • IBM SP, 250 proc, Maui, (96 were 160MhZ) • They ran MPI benchmarks for 8 byte, 10000 Byte, 1MB JPC4 - Oak Ridge, TN
Right Shift - 8 Byte Messages Time (ms) # processors • FM optimization for short messages JPC4 - Oak Ridge, TN
Right Shift - 10000 Bytes Messages Time (ms) • FM: starts at 25MB/sec and drops to 12MB/sec above 64 nodes JPC4 - Oak Ridge, TN
Right Shift - 1MB Messages Time (ms) • Change at 64 processors prompted Shared Memory Transport in HPVM1.9 • Curve flattened (better scalability) • Recently (last week), found a fairness issue in FM Lanai Control Program JPC4 - Oak Ridge, TN
MPI Barrier - 8 Bytes Time (ms) • FM Significantly faster at 128 Procs (4x - 9x) JPC4 - Oak Ridge, TN
MPI Barrier - 10000 Bytes Time (ms) • FM 2.5x slower than T3E, 2x Slower than O2K JPC4 - Oak Ridge, TN
Interpreting These Numbers • Concentration on short message performance puts clusters on par with (expensive) traditional supers • Longer message performance not as competitive. Version 1.9 addresses some issues • Lends some understanding of large application performance on NT SuperCluster JPC4 - Oak Ridge, TN
Future HPVM Development • (Obvious) things that will happen • Support of Windows 2000 • Alpha NT -- move towards 64 bit code base • Support for new Myrinet Lanai 7 Hardware • HPVM development will move into support role for other projects • Agile Objects: High-performance OO computing • Federated Clusters • Tracking NCSA SuperCluster hardware curve JPC4 - Oak Ridge, TN
Current State for Reference • HPVM supports multiple processes/node, multiple process groups/cluster • Inter-group communication not supported • In-order reliable messaging guaranteed by • Credit-based flow control scheme • Static scheme is simple but inflexible • Only one route between any pair of processes • Even if multiple routes available, only one used • Comm within cluster very fast, outside is not • Speed comes from many static constraints JPC4 - Oak Ridge, TN
Designed and Now Implementing • Dynamic flow control scheme for better scalability • Support larger clusters • Multiple routes and out-of-order packet re-sequence • Allow parallel paths for high-performance WAN connections • Support inter-group communication • Driven by agile objects need for remote method invocation/ client-server interactions • Support “Federated Clusters” • Integration into Grid. Bring performance of cluster outside of the machine room JPC4 - Oak Ridge, TN
Is Linux in HPVM’s Future? • Maybe ;-) • Critical technical hurdle is finding a user-scheduled lightweight thread package • NT version makes use of “Fibers” • Major impediment is time/project driver JPC4 - Oak Ridge, TN
Summary • HPVM gives good relative and absolute performance • HPVM moving past the “numbers game” • Concentrate on overall usability • Integration into Grid • Software will continue development but takes on a support role for driving projects • Check out www-csag.ucsd.edu JPC4 - Oak Ridge, TN