430 likes | 548 Views
HPC Profile BOF. Marvin Theimer Microsoft Corporation Marty Humphrey University of Virginia. Agenda. 11:00 – 11:15 Review of Charter ( Humphrey ) 11:15 – 11:30 HPC Use Cases – Base Case and Common Cases ( Theimer ) 11:30 – 11:45 Extensible Job Submission Design ( Theimer )
E N D
HPC Profile BOF Marvin Theimer Microsoft Corporation Marty Humphrey University of Virginia
Agenda • 11:00 – 11:15 Review of Charter (Humphrey) • 11:15 – 11:30 HPC Use Cases – Base Case and Common Cases (Theimer) • 11:30 – 11:45 Extensible Job Submission Design (Theimer) • 11:45 – 12:00 Comparative Analysis • Extensible Job Submission Design and JSDL/BES (Wasson) • ESI – Snelling/Foster (Theimer) • 12:00 – 12:30 Discussion
History • GGF14: Chicago Jul 14 2005 • “Minimal Web Services BOF” (aka WS-Management) • Newhouse, Theimer, Humphrey, Tollefsrud • GGF15: Boston Oct 6 2005 • UVa update on WS-Management use for OGSA (Wasson) • Specific technical thoughts on the support of dual stacks • Suspended given rumored “reconcilation” • GGF16: Athens Feb 14 2006 • “An evolutionary approach to realizing the Grid vision” • Theimer, Parastatidis, Hey, Humphrey, Fox • OGSA F2F Feb 17 206 • Theimer gives detailed presentation of the “evolutionary” paper
More History • Mar 15 2006 • “Toward Converging Web Service Standards for Resources, Events, and Management” • HP, IBM, Intel, Microsoft • OGSA F2F: Sunnyvale, CA April 5 2006 • Theimer presented the use-case document • Since March 2006, active engagement of OGSA-WG mailing list to build consensus
OGSA HPC Profile WG (Computing Area) • Objective: the profile and protocol specifications needed to realize the vertical use case of batch job scheduling of scientific/technical applications • “use case” = HPC use case • Output: HPCP (normative) • Scope • Identify any changes/extensions that are deemed necessary to existing protocol specifications and will work with the relevant working groups to try to affect the identified changes/extensions • Identify additional protocol specifications that need to be defined and will either work on their definition or spin them out to additionally defined working groups.
OGSA HPC Profile WG (Computing Area) • “sub-profiles” • interface for specifying and submitting and scheduling jobs • interface for bulk data staging • Evolutionary approach • A simple base case will be defined that we expect to have universally implemented by all batch job scheduling clients and schedulers. • All additional functionality will be defined in terms of optional extensions (which are anticipated to be widely applicable)
Pre-existing Documents • JSDL • BES • “An evolutionary approach to realizing the Grid vision”
Status • Use-case in final revisions • Resource reservation • Provisioning • Execution • Next: what the framework should be for defining extension profiles • Aggressive milestones to meet vendor deadlines
Deliverables • OGSA HPC Use Cases – Base Case and Common Cases (GFD-I) • OGSA HPC profile specification (GFD-R.P) • OGSA HPC initial common cases extension profile specification (GFD-R.P)
HPC Use Cases – Base Case and Common Cases[ GFD-I ](11:15 – 11:30)
Goals • BASE case: • ALL scheduling clients and services are expected to understand • HPC, not Grid (i.e., do NOT span administrative domains) • Common Cases: • Represent some significant fraction of implementors, not all implementors • NOT all cases – only common cases • capture client-visible functionality requirements rather than being technology/system-design-driven
Base Case • High throughout compute cluster used only within the enterprise • User requests: • Submit a job with specification of resource requirements unique jobID or fault • Query a specific job for its current state • Cancel a specific job • List jobs • State diagram: queued, running, finished
Base Case (cont) • Only small set of “standard” resources • number of CPUs/compute nodes needed, memory requirements, disk requirements, etc. • Only equality of string values and numeric relationships among pairs of numeric values are provided in the base use case. • Once a job has been submitted it can be cancelled, but its resource requests can't be modified
Base case: Out of Scope • Data access issues • Programs are assumed to be pre-installed • Creation and management of user security credentials • No need for directory services beyond something like DNS • Management of the system resources
Base case: Fault tolerance model • Job fails because of “system problems” • Job must be resubmitted by client • the job scheduler will not automatically rerun the job • Failure of the scheduler may or may not cause currently running jobs to fail.
Base case: Job Exits • Whether it exited successfully, with an error code, or terminated due to some other fault situation. • How long it ran in terms of wall-clock time.
Base case: scheduling policy • FIFO • Out-of-scope: • quotas and other forms of SLAs • Non-independent jobs • Infrastructure support for parallel, distributed programs (such as MPI) • Reservation of resources separate from allocation to a running job (e.g., reserve 3 cpus for future use) • Interactive access to running jobs
Common Cases • Purpose of enumerating common cases: use as the basis for creating appropriate extension mechanisms • 13 cases
13 Common Cases • Exposing existing schedulers’ functionality • Condor, Globus, LSF, Maui, Microsoft-CCS, PBS, SGE, etc. • Polling vs. notification • notification “call-back” messages for significant changes in the state of a job • What are the semantics of message delivery? • At-Most-Once and Exactly-Once Submission Guarantees • The base use case allows the possibility that a client can’t directly tell whether its job submission request has been successfully received by a job scheduler or not • Types of Data Access • non-transparent staging of data between independent storage systems. • explicitly supports transparent data access within a virtual organization or across a federated set of organizations
13 Common Cases (cont.) • Types of Programs to Install/Provision/Run • users may have programs that require explicit installation of some form. • Multiple Organization Use Cases • Submission of jobs requires additional security support (e.g., “foreign” credential) • Data current resides outside of the enterprise in question • Additional sandboxing of non-local users • Extended Resource Descriptions • allow arbitrary resource types whose semantics are not understood by the HPC infrastructure • accounting information returned for a job
13 Common Cases (cont.) • Extended Client/System Administrator Operations • users may wish to modify the requirements for a job after it has already been submitted • Arrays of jobs • system administrators: suspension/resumption of jobs and migration of jobs among compute nodes • Extended Scheduling Policies • shortest/smallest-job-first, weighted-fair-share scheduling, etc. • multiple submission queues, job submission quotas, and various forms of SLAs, such as guarantees on how quickly a job will be scheduled to run.
13 Common Cases (cont.) • Parallel Programs and Workflows of Programs • instantiate such programs (e.g., MPI) across multiple compute nodes in a suitable manner, including provision of information that will allow the various program instances to find each other within the cluster • Programs may have execution dependencies on each other. • Advanced Reservations and Interactive Use Cases • reserve resources for use at a specific future time • communicate in real time with external client users • Cycle Scavenging • batch job scheduler dispatches jobs to machines that have dynamically indicated to it that they are currently available for running guest jobs.
13 Common Cases (cont.) • Multiple Schedulers • submit work to the whole of the computing infrastructure without having to manually select which facility to submit to
Status • Need feedback • Is the base case sufficient? • Missing any “common” cases? • Any of the 13 “too uncommon”?
Extensible Job Submission Design (EJS) • Main focus: extensibility • Philosophy: • Cover all the bases (resource reservation, provisioning, execution, data staging, etc.) • Keep it simple • Approach: • Minimalist base cases (overall and for each sub-component) • Optional extensions to enable both richer semantics and evolution
What is a Job? • OGSA glossary: • Job: User-defined task that is scheduled to be carried out by an execution subsystem • Task: ??? • Single program instance? • Distributed MPI program? • What about data staging? BES defines simple workflows • Execution subsystem: ??? • Job queue? • Process? Compute node? Multiple compute nodes? • Workflow: • Focus is on business processes & services • No mention of executing multiple user-defined tasks or data staging steps • Batch job scheduling literature: • Job ~ accounting entity under which multiple user-defined steps are run
Core Concepts • Task: execution of one or more program instances in one or more execution subsystems • Compute node: execution subsystem that actually executes a program • Resources: • Compute node CPUs, memory, disk space, etc. • Aggregates: # of compute nodes, all resources of a compute node, etc. • Scheduler: allocates resources to job and tasks • Resource allocation: 3 distinct phases • Clients query schedulers about available resources • Clients reserve resources • Schedulers allocate resources to tasks or to reservation requests • Job: reified resource reservation against which tasks can be run
Canceled Pending Running New Finished Failed Base Task States
Canceled New Unsatisfied Satisfied Finished Failed Base Job States
Multiple Schedulers Cluster13 Cluster13-1 CN Cluster13-headnode Task1 Sched13 Cluster13-2 Meta-sched Client CN Task1 Cluster42 Desktop-foo Sched42 Cluster42-8 CN CN Task3 42-1 Task2 42-7 …
Other Topics Covered • Advertising resource information • Failure and recovery model • Security and credential delegation
Types of Extensions • Purely additive extensions allowed (i.e. no changes to base semantics) • Additional WSDL operations (incl. for parameter overloading) • Array operations • Extended state diagrams • Extended resource descriptions • Extended information representations • Multiple, composable, extensible “micro”-protocols
Canceled Canceled New Pending Running Finished New Pending Running Finished Migrate Suspend Running: Migrating Running: Suspended Failed Failed Canceled New Pending Running: Stage-in Running: Executing Running: Stage-out Finished Failed Specialization of States Profile A: Task state transition diagram for a scheduling profile that extends the base protocol to support task migration Profile C: Task state transition diagram for a scheduling profile that extends the base protocol to support task suspension Profile B: Task state transition diagram for a scheduling profile that extends the base protocol to support the notion of staging in data to a compute node before a task runs and staging data out back to the client user after the task has finished execution
Base Interoperability Interface • Task interface: • CreateTask(schedulerEPR, resourceDescr, credentialsDescr, lifetime) taskDescr • QueryTask(taskEPR, taskID, queryDescr) taskDescr • CancelTask(taskEPR, taskID) • Scheduler interface: • QueryScheduler(schedulerEPR, queryDescr) schedulerDescr
Generic Extensions • Array operations • Notifications • Query operation modifiers • Idempotent message delivery semantics • EPR resolution
Task Interface Extensions • Re-execution of failed tasks • Additional & extended resource definitions • Additional operations • ModifyTask • … • Additional scheduling policies • Support for parallel/distributed programs • Data staging • Provisioning • Static workflow
Resource Reservations • Job interface: • CreateJob (schedulerEPR, resourceDescr, credentialsDescr, lifetime) rsrvDesc • QueryJob (rsrvEPR, rsrvID) rsrvDescr • ModifyJob (rsrvEPR, rsrvID, resourceDescr) rsrvDescr • CancelJob (rsrvEPR, rsrvID)
Multiple Schedulers • Hierarchical information option • Client scheduler list • AnnounceScheduler (schedulerEPR, announcerDesc)
Comparison of ESI to Extensible Job Submission Design • Focus of ESI: reconciliation/synthesis of Globus and Unicore • Focus of EJS: extensibility