360 likes | 625 Views
High Performance Computing With Microsoft Compute Cluster Solution . Kyril Faenov (kyrilf@microsoft.com) DAT301 Director of High Performance Computing Microsoft Corporation. High Performance Computing.
E N D
High Performance Computing With Microsoft Compute Cluster Solution Kyril Faenov (kyrilf@microsoft.com) DAT301 Director of High Performance Computing Microsoft Corporation
High Performance Computing • Cutting edge problems in science, engineering and business always require capabilities beyond those of standalone computers • Market pressures demand accelerated innovation cycle, overall cost reduction and thorough outcome modeling • Aircraft design utilizing composite materials • Vehicle fuel efficiency and safety improvements • Simulations of enzyme catalysis, protein folding • Targeted material and drug design • Simulation of nanoscale electronic devices • Financial portfolio risk modeling • Digital content creation and enhancement • Supply chain modeling and optimization • Long term climate projections Volume economics of of industry standard hardware and commercial software applications are rapidly bringing HPC capabilities to broader number of users
IT Mgr Manual, batchexecution Interactive Computation and Visualization SQL Evolution Of HPC Applications And Systems
Microsoft Compute Cluster Solution Head Node Active Directory Job Mgmt Cluster Mgmt Scheduling Resource Mgmt Desktop App Job Policy, reports Admin Console User Admin User Console Management Cmd line Input Cmd line Job Data DB/FS Node Manager High speed, low latency interconnect Job Execution User App MPI
CCS Product Summary • What is it? • A solution that allows compute intensive applications to easily and cost effectively scale their performance using compute clusters • Core Platform • Based on Windows Server 2003 SP1 64 bit Edition • Ethernet, Infiniband and other interconnect support leveraging Winsock Direct • Administration • Prescriptive, simplified cluster setup and administration • Scripted, image-based compute node management • Active Directory based security • Scalable, extensible job scheduling and resource management • Development • Cluster scheduler programmable via .NET and DCOM • MPI2 stack with performance and security enhancements for parallel applications • Visual Studio 2005 – OpenMP, Parallel Debugger
Models Of Parallel Programming • Data Parallel • Shared memory (load, store, lock, unlock... ) • Already present in Windows today! • Message Passing (send, receive, broadcast... ) • MPI support in Computer Cluster Solution • Directive-based (compiler needs help... ) • OpenMP support in Visual Studio 2005 • Transparent (compiler works magic... ) • Holy grail • Task parallel • Data-flow and Vector
Parallel MPI Job Serial Job Parameter Sweep Job Task Task Task Task Task Proc Proc IPC Proc Proc Proc Proc Task Flow Job Task Task Task Task Job/Task Conceptual Model
About MPI • Early HPC systems (Intel’s NX, IBM’s EUI, etc) were not portable • The MPI Forum organized in 1992 with broad participation by • vendors: IBM, Intel, TMC, SGI, Convex, Meiko • portability library writers: PVM, p4 • users: application scientists and library writers • MPI is a standard specification, there are many implementations • MPICH and MPICH2 reference implementations from Argonne • MS-MPI based on (and compatible with) MPICH2 • Other implementations include LAM-MPI, OpenMPI, MPI-Pro, WMPI • Why did MS HPC team choose MPI? • MPI has emerged as de-facto standard for parallel programming • MPI consists of 3 parts • Full-featured API of 160+ functions • Secure process launch and communication runtime • Command-line (mpiexec) to launch jobs
MS-MPI Leverages Winsock Direct User Mode HPC Application MPI Switch trafficbased on sub-net WinSock DLL Winsock Switch IBw/ RDMA GigE w/ RDMA IB WinSock Provider DLL Ethernet GigE RDMA WinSock Provider DLL User API (verbs based) Manage hardware resources in user space (eg., Send and receive queues) User Host Channel Adapter Driver TCP IP Kernel Mode NDIS Miniport (GigE) Miniport (IPoIB) Kernel API (verbs based) OS component Virtual Bus Driver Host Channel Adapter Driver IHV-provided component Networking Hardware
Programming with MPI Communicators Groups of nodes used for communications MPI_COMM_WORLD is your friend Rank (a node’s ID) Target communications Segregate work Collective Operations Collect and reduce data in a single call sum, min, max, and/or, etc Fine control of comms and buffers if you like MPI and derived data types Launching Jobs MPIexec arguments # of processors required Names of specific compute nodes to use Launch and working directories Environment variables to set for this job Global values (for all compute nodes- not just the launch node) Point to files of command line arguments env MPICH_NETMASK to control network used for this MPI job Fundamental MPI Features
Example: Calculate Pi 1 1 n intervals
Each line represents 1000’s of messages Parallel Execution Visualization 1000 x Detailed view shows opportunities for optimization
Job Scheduler Stack Jobs/Tasks Client Node Admission Head Node Allocation Activation Compute Node
End-To-End Security Kerberos Scheduler Node Mgr Client Secure channel Secure channel credential credential Logon as user Data Protection API ActiveDirectory credential MSDE Spawn Logon token Task Data DB/FS LSA Automatic Ticket renewal Kerberos
Community Resources At PDC, go see • FUN302: Programming with Concurrency – Multithreading Best Practices (9/13 2:45pm) • FUN405: Programming with Concurrency – Multithreading on Windows (9/13 4:15pm) • FUN323: Microsoft Research – Future Possibilities in Concurrency (9/16 8:00am) • Bob Muglia’s Windows Server keynote (9/15 8:30am) • Product Pavilion – meet HPC team members • Visit the Hands On Lab to try the demos yourself! To Learn More • Microsoft Compute Cluster Solution Beta 1 – Released Today! • http://connect.microsoft.com/availableprograms.aspx • Microsoft HPC website • http://www.microsoft.com/hpc/ • Public newsgroup • nntp://microsoft.public.windows.hpc/ • MPICH home and documentation • http://www-unix.mcs.anl.gov/mpi/mpich/
© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
log transistors/die log CPU clock freq 5 B<10 GHz >30%/y ! 100 M 3 GHz <10%/y 10,000 1 MHz 2015 2003 1975 Welcome to the Multi-Core Era • How can programmers benefit from concurrency today? • How is concurrency supported and used in the Microsoft platform? • What techniques is Microsoft Research investigating for programming future highly-parallel hardware?
Example: Calculate pi #include "mpi.h" #include <math.h> int main(int argc, char *argv[]) {int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; Initialize “Correct” pi Start MPI Get # procs assigned to this job Get proc # of this proc On proc 0, ask user for number of intervals Compute Send # of intervals to all procs
Example: Calculate pi (2) h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize(); return 0; } Sum this proc’s share of the intervals Sum all proc intervals Report On proc 0, print the estimated value of pi and the deviation from the “correct” value
Job Submission: serial job C:\> job submit /stdout:\\hn\users\foo\^%CCP_JOBID^%.txt myapp.exe Job submitted, ID:4 Escape Character job submit options: [/scheduler:host] [/jobname:JobName] [/numprocessors:min[-max]] [/runtime:{[DAYS:]HOURS:]MINUTES|infinite} [/priority:{Highest|AboveNormal|Normal|BelowNormal|Lowest} [/projectname:name] [/askednodes:node1[,node2[,...]] [/exclusive:{true | false}] [/name:TaskName] [/rerunnable:{true | false}] [/checkpointable:{true | false}] [/runtillcancelled:{true | false}] [/stdin:file] [/stdout:file] [/stderr:file] [/lic:feature1:amt1 /lic:feature2:amt2 ... /lic:featureN:amtN] [/workdir:folder] command [arguments]
Job Submission: MPI job C:\> job submit /numprocessors:4-8 mpiexec –hosts ^%CCP_NODES^% myapp.exe C:\> job submit /numcpus:4-8 mytask.bat mytask.bat myjob.bat # setup environment variables set Path=“C:\program files\vendor\bin” set LM_LICENSE_FILE=“c:\program files\vendor\license.bat” # Stage the input files . . . # Invoke MPI mpiexec –hosts %CCP_NODES% myapp.exe arg1 arg2 … # Stage out the results . . .
Job Submission: Parametric Sweep # Create a job container $str = `job new /numprocessors:4-8`; if ($str =~ /ID: (\d+)/) { $jobid = $1; } # add parametric tasks for ($i = 0; $i < 128; $i++) { `job add $jobid /stdout:\\\\hn\\users\\foo\\output.$i /stderr:\\\\hn\\users\\foo\\output.$i myapp.bat`; } # submit the job `job submit /id:$jobid`;
Job Submission: Task Flow # create a job container $str = `job new /numprocessors:4-8`; if ($str =~ /ID: (\d+)/) { $jobid = $1; } # add a set-up task `job add $jobid /name:setup setup.bat`; # all these tasks wait for the setup task to complete for ($i = 0; $i < 128; $i++) { `job add $jobid /name:compute /depend:setup compute.bat`; } # this task waits for all the “compute” tasks to complete `job add $jobid /name:aggregate /depend:compute aggregate.bat`; “setup” “aggregate” “compute”