350 likes | 521 Views
High-Performance Computing on the Windows Server Platform . Marvin Theimer Software Architect Windows Server HPC Group hpcinfo @ microsoft.com Microsoft Corporation. Session Outline. Brief introduction to HPC Definitions Market trends Overview of V1 version of Windows Server 2003 CCE
E N D
High-Performance Computing on the Windows Server Platform Marvin Theimer Software ArchitectWindows Server HPC Grouphpcinfo @ microsoft.com Microsoft Corporation
Session Outline • Brief introduction to HPC • Definitions • Market trends • Overview of V1 version of Windows Server 2003 CCE • Features • System architecture • Key challenges for future HPC systems • Too many factors affect performance • Grid computing economics • Data management
Defining High Performance Computing (HPC) HPC Definition: Using compute resources to solve computationally intensive problems Different Platforms for Achieving Results HPC Role in Science Computational Modeling Sensors Persist (DB, FS, ..) Technical andScientificComputing HPC Use Mining Interpretation
Cluster HPC Scenario Head Node User Mgmt Cluster Mgmt Job Mgmt Resource Mgmt Web service Job Policy, reports User Web page Admin Management Input Cmd line Job Sensors, Workflow, Computation Data Data mining, Visualization, Workflow Remote query DB or FS Cluster Node High speed, low latency interconnect (1GE, Infiniband, Myricom) Job Mgr MPI User App Resource Mgr Node Mgr
Top 500 Supercomputer Trends Clusters over 50% Industry usage is rising GigE is gaining IA is winning
Commoditized HPC Systems are Affecting Every Vertical • Leverage Volume Markets of Industry Standard Hardware and Software. • Rapid Procurement, Installation and Integration of systems • Cluster Ready Applications Accelerating Market Growth • Engineering • Bioinformatics • Oil & Gas • Finance • Entertainment • Government/Research The convergence of affordable high performance hardware and commercial apps is making supercomputing a mainstream market
Cheap, Interactive HPC Systems Are Making Supercomputing Personal Grids of personal & departmental clusters Personal workstations & departmental servers Minicomputers Mainframes
The Evolving Nature of HPC IT Mgr Manual, batchexecution Interactive Computation and Visualization SQL
Windows based HPC Today Technical Solution • Partner Driven Solution Stack Ecosystem • Partnerships with ISV to develop on Windows • Partnership with Cornell Theory Center LSF PBSPro DataSynapse MSTI Management Parallel Applications Applications MPI/Pro MPICH-1.2 WMPI MPI-NT Middleware WINDOWS Visual Studio OS TCP Protocol Gigabit Ethernet Fast Ethernet Interconnect Intel (32bit & 64bit) & AMD x64 Platform
What Windows-based HPC needs to provide Users require: • An integrated supported solution stack leveraging the Windows infrastructure • Simplified job submission, status and progress monitoring • Maximum compute performance and scalability • Simplified environment from desktops to HPC clusters Administrators require: • Ease of setup and deployment • Better cluster monitoring and management for maximum resource utilization • Flexible, extensible, policy-driven job scheduling and resource allocation • High availability • Secure process startup and complete cleanup Developers Require: • Programming environment that enables high productivity • Availability of optimized compilers (Fortran) and math libraries • Parallel debugger, profiler, and visualization tools • Parallel programming models (MPI)
V1 Plans • Introduce compute cluster solution • Windows Server 2003 Compute Cluster Edition based on Windows Server 2003 SP1 x64 Standard Edition • Features for Job Management, IT Admin and Developers • Build partner eco-system around the Windows Server Compute Cluster Edition from day one • Establish Microsoft credibility in the HPC community • Create worldwide Centers of Innovation
Technologies Platform • Windows Server 2003 SP1 64 bit Edition • x64 processors (Intel EM64T & AMD Opteron) • Ethernet, Ethernet over RDMA and Infiniband support Administration • Prescriptive, simplified cluster setup and administration • Scripted, image-based compute node management • Active Directory based security, impersonation and delegation • Cluster-wide job scheduling and resource management Development • MPICH-2 from Argonne National Labs • Cluster scheduler accessible via DCOM, http, and Web Services • Visual Studio 2005 – Compilers, Parallel Debugger • Partner delivered compilers and libraries
Windows HPC Environment Microsoft Operations Manager Head Node Active Directory User Mgmt Cluster Mgmt Job Mgmt Resource Mgmt Web service Job Policy, reports User Web page Admin Management Input Cmd line Job Sensors, Workflow, Computation Windows Server 2003, Compute Cluster Edition Data Data mining, Visualization, Workflow Remote query DB or FS Cluster Node High speed, low latency interconnect (Ethernet over RDMA, Infiniband) Job Mgr MPI User App Resource Mgr Node Mgr
Architectural Overview User Workstation Cluster Data Application Job Scripts WSE Head Node COM Windows XP Job Sched UI HTTP Job Scheduler X86/64 Disk GigE IIS6 RIS AD MSDE WSE3 Whidbey Developer Workstation Application SFU HPC SDK MPI Sched WS Policy API Windows Server 2003 CCE COM Compilers Libs WSE Whidbey WSE HTTP Cluster Nodes Cluster Nodes Windows XP Node Manager Node Manager X86/64 Disk GigE HPC Application HPC Application MPI-2 MPI-2 Legend MPI-2 MPI-2 TCP SHM WSD/SDP TCP SHM WSD/SDP Application Windows Server 2003 CCE Windows Server 2003 CCE 3rd Party GigE/RDMA Infiniband GigE/RDMA Infiniband Windows OS MS Component HPC Component
Difficult to Tune Performance • Example: Tightly-coupled MPI applications: • Very sensitive to network performance characteristics • Communication times measured in microseconds: O(10 usecs) for interconnects such as Infiniband; O(100 usecs) for GigE • OS network stack is a significant factor: Things like RDMA can make a big difference • Excited about the prospects of industry-standard RDMA hardware • We are working with InfiniBand and GigE vendors to ensure our stack supports them • Driver quality is an important facet • We are supporting the OpenIB initiative • Considering the creation of a WHQL program for InfiniBand • Very sensitive to mismatched node performance • Random OS activities can add millisecond delays to microsecond communication times
Need self-tuning systems • Application configuration has a significant impact • Incorrect assumptions about hardware/communications architecture can dramatically affect performance • Choice of communication strategy • Choice of communication granularity • … • Tuning is an end-to-end issue: • OS support • ISV library support • ISV application support
Computational Grid Economics • What $1 will buy you (roughly): • Computers cost $1000 (roughly) • 1 cpu day (~ 10 Tera-ops) == $1 • (roughly, assuming 3 yr use cycle) • 10TB network transfer costs == $1 • (roughly, assuming 1Gbps interconnect) • Internet bandwidth costs roughly 100 $/mbps/month (not including routers and management) • 1GB network transfer costs == $1 (roughly) • Some observations: • HPC cluster communication is 10,000x cheaper than WAN communication • Break-even point for instructions computed per byte transferred: • Cluster: O(1) instrs/byte • WAN: O(10,000) instrs/byte
Computational Grid Economics Implications • “Small data, high compute” applications work well across the Internet, such as SETI@home and Folding@home • MPI-style parallel, distributed applications work well in clusters and across LANs, but are uneconomic and do not work well in wide-area settings • Data analysis is usually best done by moving the programs to the data, not the data to the programs. • Move questions and answers, not petabyte-scale datasets • The Internet is NOT the cpu backplane (Internet-2 will not change this)
Exploding Data Sizes • Experimental data: TBs PBs • Modeling data: • Today: 10’s to 100’s of GB is the common case • Tomorrow: TBs • Near-future example: CFD simulation of a turbine engine • 10**9 mesh nodes, each containing 16 double-precision variables • 128 GB / time-step • Simulate 1000’s of time steps 100’s TBs / simulation • Archived for future reference
Whole-System Modeling and Workflow • Today: mostly about computation • Stand-alone static simulations of individual parts/phenomena • Mostly batch • Simple workflows: short, deterministic pipelines (though some are massively parallel) • Future: mostly about data that is produced and consumed by computational steps • Dynamic whole-system modeling via multiple, interacting simulations • More complex workflows (don't yet know how complex) • More interactive analysis • More sharing
Whole-System Modeling Example: Turbine Engine • Interacting simulations • CFD simulation of dynamic airflow through turbine • FE stress analysis of engine & wing parts • "Impedance" issues between various simulations (time steps, meshes, ...) • Serial workflow steps • Crack analysis of engine & wing parts • Visualization of results
Interactive Workflow Example • Base CFD simulation produces huge output • Points of interest may not be easy to find • Find and then focus on important details • Data analysis/mining of output • Restart simulation at a desired point in time/space. • Visualize simulation from that point forward. • Modify simulation from that point forward (e.g. higher fidelity)
Data Analysis and Mining • Traditional approach: • Keep data in flat files • Write C or Perl programs to compute specific analysis queries • Problems with this approach: • Imposes significant development times • Scientists must reinvent DB indexing and query technologies • Results from the astronomy community: • Relational databases can yield speed-ups of one to two orders of magnitude • SQL + application/domain-specific stored procedures greatly simplify creation of analysis queries
Combining Simulation with Experimental Data: Drug Discovery • Clinical trial database describes toxicity & side effects observed for tested drugs. • Simulation searches for candidate compounds that have a desired effect on a biological system. • Clinical data searched for drugs that contain a candidate compound or "near neighbor"; toxicity results retrieved and used to decide if the candidate compound should be rejected or not.
Sharing • Simulations (or ensembles of simulations) mostly done in isolation • No sharing except for archival output • Some coarse-grained sharing • Check-out/check-in of large components • Example: automotive design • Check-out component • CAE-based design & simulation of component • Check-in with design rule checking step • Data warehouses typically only need coarse-grained update granularity • Bulk or coarse-grained updates • Modeling & simulations done in the context of particular versions of the data • Audit trails and reproducible workflows becoming increasingly important
Data Management Needs • Cluster file systems and/or parallel DBs to handle I/O bandwidth needs of large, parallel, distributed applications • Data warehouses for experimental data and archived simulation output • Coarse-grained geographic replication to accommodate distributed workforces and workflows • Indexing and query capabilities to do data mining & analysis • Audit trails, workflow recorders, etc.
Windows HPC Roadmap V1 Introduce complete Windows based compute cluster solution • Complete development platform – SFU, MPI, Compilers, Parallel Debugger • Integrated cluster setup and management • Secure and programmable job scheduling and management • Enhanced development support • Harnessing unused desktop cycles for HPC • Meta scheduling to integrate desktops and forests of clusters • Efficient management and manipulation of large datasets and workflows V2 Apply Microsoft’s core innovation to HPC Long-term vision Revolutionize scientific and technical computation • Ease of parallel programming • Exploitation of hardware features (e.g. GPUs and multi-core chips) • Integration with high-level tools (e.g. Excel) • Support for domain-specific platforms
Call To Action • IHVs • Develop Winsock Direct drivers for your RDMA cards • Automatically let our MPI stack take advantage of low latency • Develop support for diskless scenarios (e.g. iScsi) • OEMs • Offer turn-key clusters • Pre-wired for management and RDMA networks • Support “boot from net” diskless scenarios • Support WS-Management • Consider noise and power requirements for personal and workgroup configurations
Community Resources • Windows Hardware & Driver Central (WHDC) • www.microsoft.com/whdc/default.mspx • Technical Communities • www.microsoft.com/communities/products/default.mspx • Non-Microsoft Community Sites • www.microsoft.com/communities/related/default.mspx • Microsoft Public Newsgroups • www.microsoft.com/communities/newsgroups • Technical Chats and Webcasts • www.microsoft.com/communities/chats/default.mspx • www.microsoft.com/webcasts • Microsoft Blogs • www.microsoft.com/communities/blogs
Related WinHEC Sessions • TWNE05005: Winsock Direct Value Proposition-Partner Concepts • TWNE05006: Implementing Convergent Networking-Partner Concepts
To Learn More • Microsoft: • Microsoft HPC website: http://www.microsoft.com/hpc/ • Other Sites: • CTC Activities: http://cmssrv.tc.cornell.edu/ctc/winhpc/ • 3rd Party Windows Cluster Resource Centre www.windowsclusters.org • HPC related-links web site: http://www.microsoft.com/windows2000/hpc/miscresources.asp • Some useful articles & presentations: • “Supercomputing in the Third Millenium”, by George Spix: http://www.microsoft.com/windows2000/hpc/supercom.asp • Introduction of the book “Beowulf Cluster Computing with Windows” by Thomas Sterling, Gordon Bell, and Janusz Kowalik • “Distributed Computing Economics”, by Jim Gray: MSR-TR-2003-24: http://research.microsoft.com/research/pubs/view.aspx?tr_id=655 • “Web Services, Large Databases, and what Microsoft is doing in the Grid Computing Space”, presentation by Jim Gray: http://research.microsoft.com/~Gray/talks/WebServices_Grid.ppt • Send questions to hpcinfo @ microsoft.com