260 likes | 276 Views
Microsoft HPC Server 2008. A brief overview with emphasis on cluster performance. Eric Lantz ( elantz@microsoft.com ) Lead Program Manager , HPC Team Microsoft Corp. Fab Tillier ( ftillier@microsoft.com ) Developer, HPC Team Microsoft Corp. HPC Server 2008.
E N D
Microsoft HPC Server 2008 A brief overview with emphasis on cluster performance Eric Lantz (elantz@microsoft.com) Lead Program Manager , HPC Team Microsoft Corp. Fab Tillier (ftillier@microsoft.com ) Developer, HPC Team Microsoft Corp.
HPC Server 2008 A Brief Overview of this second release from Microsoft’s HPC team.
Some Applicable Market Data • IDC Cluster Study (113 sites, 303 clusters, 29/35/36 GIA split) • Industry self-reports average of 85 nodes per cluster • When needing more computing power: • ~50% buy a new cluster, ~50% add nodes to existing cluster • When purchasing: • 61% buy direct from vendor, 67% have integration from vendor • 51% use a standard benchmark in purchase decision • Premium paid for lower network latency as well as power and cooling solutions • Applications Study (IDC Cluster Study, IDC App Study (250 codes, 112 vendors, 11 countries) - Visits) • Application usage • Apps use 4-128 CPUs and are majority In-house developed • Majority multi-threaded • Only 15% use whole cluster • In practice 82% are run at 32 processors or below • Excel running in parallel is an application of broad interest • Top challenges for implementing clusters: • Facility issues with power and cooling • System management capability • Complexity implementing parallel algorithms • Interconnect latency • Complexity of system purchase and deployment Sources: 2006 IDC Cluster Study, HECMS, 2006 Microsoft HEWS Study
Key HPC Server 2008 Features • Systems Management • New admin console based on System Center UI framework integrates every aspect of cluster management • Monitoring heat map allows viewing cluster status at-a-glance • High availability for multiple head nodes • Improved compute node provisioning using Windows Deployment Services • Built-in system diagnostics and cluster reporting • Job Scheduling • Integration with the Windows Communication Foundation, allowing SOA application developers to harness the power of parallel computing offered by HPC solutions • Job scheduling granularity at processor core, processor socket, and compute node levels • Support for Open Grid Forum’s HPC-Basic Profile interface • Networking and MPI • Network Direct, providing dramatic RDMA network performance improvements for MPI applications • Improved Network Configuration Wizard • New shared memory MS-MPI implementation for multicore servers • MS-MPI integrated with Event Tracing for Windows and Open Trace Format translation • Storage • Improved iSCSI SAN and Server Message Block (SMB) v2 support in Windows Server 2008 • New parallel file system support and vendor partnerships for clusters with high-performance storage needs • New memory cache vendor partnerships
End-To-End Approach To Performance • Multi-Core is Key • Big improvements in MS-MPI shared memory communications • NetworkDirect • A new RDMA networking interface built for speed and stability • Devs can't tune what they can't see • MS-MPI integrated with Event Tracing for Windows • Perf takes a village • Partnering for perf • Regular Top500 runs • Performed by the HPCS2008 product team on a permanent, scale-testing cluster
Multi-Core is KeyBig improvements in MS-MPI shared memory communications • MS-MPI automatically routes between • Shared Memory: Between processes on a single [multi-proc] node • Network: TCP, RDMA (WinsockDirect, NetworkDirect) • MS-MPIv1 monitored incoming shmem traffic by aggressively polling [for low latency] which caused: • Erratic latency measurements • High CPU utilization • MS-MPIv2 uses entirely new shmem approach • Direct process-to-process copy to increase shm throughput. • Advanced algorithms to get the best shm latency while keeping CPU utilization low. Prelim shmem results
User Mode Kernel Mode NetworkDirectA new RDMA networking interface built for speed and stability Socket-Based App MPI App • Priorities • Equal toHardware-Optimized stacks for MPImicro-benchmarks • Focus on MPI-Only Solution for CCSv2 • Verbs-based design for close fit with native, high-perf networking interfaces • Coordinated w/ Win Networking team’s long-term plans • Implementation • MS-MPIv2 capable of 4 networking paths: • Shared Memory between processors on a motherboard • TCP/IP Stack (“normal” Ethernet) • Winsock Direct for sockets-based RDMA • New NetworkDirect interface • HPC team partnering with networking IHVs to develop/distribute drivers for this new interface MS-MPI Windows Sockets (Winsock + WSD) RDMA Networking TCP/Ethernet Networking Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware WinSock Direct Provider NetworkDirect Provider Mini-port Driver TCP IP NDIS Kernel By-Pass Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Hardware Driver User Mode Access Layer Networking Hardware (ISV) App CCP Component OS Component IHV Component
Devs can't tune what they can't seeMS-MPI integrated with Event Tracing for Windows Trace Control & Clock Sync MS-MPI Trace settings (mpitrace.mof) mpiexec.exe -trace args • Single, time-correlated log of: OS, driver, MPI, and app events • CCS-specific additions • High-precision CPU clock correction • Log consolidation from multiple compute nodes into a single record of parallel app execution • Dual purpose: • Performance Analysis • Application Trouble-Shooting • Trace Data Display • Visual Studio & Windows ETW tools • !Soon! Vampir Viewer for Windows logman.exe Windows ETW Infrastructure Trace Log File Convert to text Live feed Consolidate Trace files at end of job Windows ETW Infrastructure MS-MPI Trace Log Files Trace Log Files Trace Log Files Trace Log File
Perf takes a village(Partnering for perf) • Networking Hardware vendors • NetworkDirect design review • NetworkDirect & WinsockDirect provider development • Windows Core Networking Team • Commercial Software Vendors • Win64 best practices • MPI usage patterns • Collaborative performance tuning • 3 ISVs and counting • 4 benchmarking centers online • IBM, HP, Dell, SGI
Regular Top500 runs • MS HPC team just completed a 3rd entry to the Top500 list • Using our dev/test scale cluster (Rainier) • Currently #116 on Top500 • Best efficiency of any Clovertown with SDR IB (77.1%) • Learnings incorporated into white papers & CCS product • Location: • Microsoft Tukwila Data center (22 miles from Redmond campus) • Configuration: • 260 Dell Blade Servers • 1 Head node • 256 compute nodes • 1 IIS server • 1 File Server • App/MPI: Infiniband • Private: Gb-E • Public: Gb-E • Each compute node has two quad-core Intel 5320 Clovertown, 1.86GHz, 8GB RAM • Total • 2080 Cores • 2+TB RAM
What is Network Direct? What Verbs should look like for Windows: • Service Provider Interface (SPI) • Verbs Specifications are not APIs! • Aligned with industry-standard Verbs • Some changes for simplicity • Some changes for convergence of IB and iWARP • Windows-centric design • Leverage Windows asynchronous I/O capabilities
ND SPI Traits • Explicit resource management • Application manages memory registrations • Applications manages CQ to Endpoint bindings • Only asynchronous data transfers • Initiate requests on an Endpoint • Get request results from the associated CQ • Application can use event driven and/or polling I/O model • Leverage Win32 asynchronous I/O for event driven operation • No kernel transitions for polling mode • “Simple” Memory Management Model • Memory Registrations are used for local access • Memory Windows are used for remote access • IP Addressing • No proprietary address management required
ND SPI Model • Collection of COM interfaces • No COM runtime dependency • Use the interface model only • Follows model adopted by the UMDF • Thread-less providers • No callbacks • Aligned with industry standard Verbs • Facilitates IHV adoption
Why COM Interfaces? • Well understood programming model • Easily extensible via IUnknown::QueryInterface • Allows retrieving any interface supported by an object • Object oriented • C/C++ language independent • Callers and providers can be independently implemented in C or C++ without impact on one another • Interfaces support native code syntax - no wrappers
Asynchronous Operations • Win32 Overlapped operations used for: • Memory Registration • CQ Notification • Connection Management • Client controls threading and completion mechanism • I/O Completion Port or GetOverlappedResult • Simpler for kernel drivers to support • IoCompleteRequest – I/O manager handles the rest.
Microsoft HPC web site - HPC Server 2008 (beta) Available Now!! http://www.microsoft.com/hpc Network Direct SPI documentation, header and test executables In the HPC Server 2008 (beta) SDK http://www.microsoft.com/hpc Microsoft HPC Community Site http://windowshpc.net/default.aspx Argonne National Lab’s MPI website http://www-unix.mcs.anl.gov/mpi/ CCS 2003 Performance Tuning Whitepaper http://www.microsoft.com/downloads/details.aspx?FamilyID=40cd8152-f89d-4abf-ab1c-a467e180cce4&DisplayLang=en Or go to http://www.microsoft.com/downloads and search for CCS Performance References
Socrates software boosts performance by 30% on Microsoft cluster to achieve 77.1% overall cluster efficiency
Performance improvement was demonstrated with exactly the same hardware and is attributed to : • Improved networking performance of MS-MPI’s NetworkDirect interface • Entirely new MS-MPI implementation for shared memory communications • Tools and scripts to optimize process placement and tune the Linpack parameters for this 256-node, 2048-processor cluster • Windows Server 2008 improvements in querying completion port status • Use of Visual Studio’s Profile Guided Optimization (POGO) on the Linpack, MS-MPI, and the ND provider binaries