Grid(Lab) Resource Management System

Grid(Lab) Resource Management System …and general Grid Resource Management Jarek Nabrzyski et al. naber@man.poznan.pl Poznan Supercomputing And Networking Center

GridLab • EU funded project, involving 11 European and 3 American partners (Globus and Condor teams), • January 2002 – December 2004 • Main goal: to develop a Grid Application Toolkit (GAT) and set of grid services and tools... • resource management (GRMS), • data management, • monitoring, • adaptive components, • mobile user support, • security services, • portals, ... and test them on a real testbed with real applications CGW 2003

GridLab Members • PSNC (Poznan) - coordination • AEI (Potsdam) • ZIB (Berlin) • Univ. of Lecce • Cardiff University • Vrije Univ. (Amsterdam) • SZTAKI (Budapest) • Masaryk Univ. (Brno) • NTUA (Athens) • Sun Microsystems • Compaq (HP) • ANL (Chicago, I. Foster) • ISI (LA, C.Kesselman) • UoWisconsin (M. Livny) • collaborating with: • Users! • EU Astrophysics Network, • DFN TiKSL/GriKSL • NSF ASC Project • other Grid projects • Globus, Condor, • GrADS, • PROGRESS, • GriPhyn/iVDGL, • CrossGrid and all the other European Grid Projects (GRIDSTART) • other... CGW 2003

It’s Easy to ForgetHow Different 2003 is From 1993 • Ubiquitous Internet: 100+ million hosts • Collaboration & resource sharing the norm • Ultra-high-speed networks: 10+ Gb/s • Global optical networks • Enormous quantities of data: Petabytes • For an increasing number of communities, gating step is not collection but analysis • Huge quantities of computing: 100+ Top/s • Ubiquitous computing via clusters • Moore’s law everywhere: 1000x/decade • Instruments, detectors, sensors, scanners CGW 2003 Courtesy of Ian Foster

And Thus,The Grid “Problem” (or Opportunity) • Dynamically link resources/services • From collaborators, customers, eUtilities, … (members of evolving “virtual organization”) • Into a “virtual computing system” • Dynamic, multi-faceted system spanning institutions and industries • Configured to meet instantaneous needs, for: • Multi-faceted QoX for demanding workloads • Security, performance, reliability, … CGW 2003 Courtesy of Ian Foster

Many sources of data, services, computation Security & policy must underlie access & management decisions Discovery R R RM RM Registries organize services of interest to a community Access RM Resource management is needed to ensure progress & arbitrate competing demands RM RM Policy service Security service Policy service Security service Data integration activities may require access to, & exploration of, data at many locations Exploration & analysis may involve complex,multi-step workflows Integration as a Fundamental Challenge CGW 2003 Courtesy of Ian Foster

Grid Scheduling Current approach: • Extension of job scheduling for parallel computers • Resource discovery and load-distribution to a remote resource • Usually batch job scheduling model on remote machine But actually required for Grid scheduling is: • Co-allocation and coordination of different resource allocations for a Grid job • Instantaneous ad-hoc allocation not always suitable This complex task involves: • “Cooperation” between different resource providers • Interaction with local resource management systems (interfaces and functions of many LRM differ from each other) • Support for reservations and service level agreements • Orchestration of coordinated resource allocations CGW 2003

Allocation for Grid Job Example time Data Storing Data Data Access Network 1 Data Transfer Data Transfer Computer 1 Loading Data Parallel Computation Providing Data Network 2 Communication for Computation Task of the Grid Resource Management! Computer 2 Parallel Computation Software License Software Usage Storage Data Storage Network 3 Communication for Visualization VR-Cave Visualization CGW 2003

Local Scheduling Systems Observation: • Local resource management (LRM) systems exist • require extension for Grids by additional software or • will directly support Grids in the future • DRMAA is available today for some LRM systems • Different LRM systems will be part of the Grid and will perform a lower-level scheduling • In addition the Grid will require some higher-level scheduling for coordinating the user’s jobs. • Multi-level scheduling model CGW 2003

Multi-level Grid Scheduling Architecture Higher-levelGrid Scheduling Grid- Scheduler Grid-User Grid SchedulingArchitecture Scheduler Scheduler Scheduler Lower-levelScheduling time time time Schedule Schedule Schedule localJob-Queues localJob-Queues localJob-Queues Resource 1 Resource 2 Resource n CGW 2003 Courtesy of Ramin Yahyapour

User Objective Local computing typically has: • A given scheduling objective as minimization of response time • Use of batch queuing strategies • Simple scheduling algorithms: FCFS, Backfilling Grid Computing requires: • Individual scheduling objective • better resources • faster execution • cheaper execution • More complex objective functions apply for individual Grid jobs! CGW 2003

Provider/Owner Objective Local computing typically has: • Single scheduling objective for the whole system: • e.g. minimization of average weighted response time or high utilization/job throughput In Grid Computing: • Individual policies must be considered: • access policy, • priority policy, • accounting policy, and other • More complex objective functions apply for individual resource allocations! • User and owner policies/objectives may be subject to privacy considerations! CGW 2003

Grid Economics – Different Business Models • Cost model • Use of a resource • Reservation of a resource • Individual scheduling objective functions • User and owner objective functions • Formulation of an objective function • Integration of the function in a scheduling algorithm • Market-economic approaches • Application of computational intelligence • Resource selection • The scheduling instances act as broker • Collection and evaluation of resource offers CGW 2003

Scheduling Model Using a Brokerage/Trading strategy: Consider individual userpolicies Coordinate Allocations Higher-levelscheduling Submit Grid Job Description Select Offers Discover Resources Collect Offers Query for Allocation Offers Consider communitypolicies Generate Allocation Offer Lower-levelscheduling Consider individual owner policies Analyze Query CGW 2003

Properties of Multi-Level Scheduling Model • Multi-level scheduling must support different RM systems and strategies. • Provider can enforce individual policies in generating resource offers • User receives resource allocation optimized to the individual objective • Different higher-level scheduling strategies can be applied • Multiple levels of scheduling instances are possible • Support for fault-tolerant and load-balanced services CGW 2003

Negotiation in Grids • Multilevel Grid scheduling architecture • Lower level local scheduling instance • Implementation of owner policies • Higher level Grid scheduling instance • Resource selection and coordination • (Static) Interface definition between both instances • Different types of resources • Different local scheduling systems with different properties • Different owner policies • (Dynamic) Communication between both instances • Resource discovery • Job monitoring CGW 2003

GGF WG Scheduling Attributes Define the attributes of a lower-level scheduling instance that can be exploited by a higher-level scheduling instance. • Attributes of allocation properties • Guaranteed completion time of allocation, • Allocations run-to-completion, … • Attributes of available information • Access to tentative schedule, • Exclusive control,… • Attributes for manipulating allocation execution • Preemption, • Migration,… • Attributes for requesting resources • Allocation Offers, • Advance Reservation,… CGW 2003

Towards Grid Scheduling Grid Scheduling Methods: • Support for individual scheduling objectives and policies • Economic scheduling methods to Grids • Multi-criteria scheduling models (most general) Architectural requirements: • Generic job description • Negotiation interface between higher- and lower-level scheduler • Economic management services • Workflow management • Integration of data and network management • Interoperability is a key! CGW 2003

Data and Network Scheduling Most new resource types can be included via individual lower-level resource management systems. Additional considerations for • Data management • Select resources according to data availability • But data can be moved if necessary! • Network management • Consider advance reservation of bandwidth or SLA • Network resources usually depend on the selection of other resources! • Coordinate data transfers and storage allocation • User empowered/owned lambdas! CGW 2003

Example of a Scheduling Process Example: 40 resources of requested type are found. 12 resources are selected. 8 resources are available. Network and data dependencies are detected. Utility function is evaluated. 6th tried allocation is confirmed. Data/network provided and job is started Scheduling Service: • receives job description • queries Information Service for static resource information • prioritizes and pre-selects resources • queries for dynamic information about resource availability • queries Data and Network Management Services • generates schedule for job • reserves allocation if possibleotherwise selects another allocation • delegates job monitoring to Job Manager Job Manager/Network and Data Management: service, monitor and initiate allocation CGW 2003

Conclusions for Grid Scheduling Grids ultimately require coordinated scheduling services. • Support for different scheduling instances • different local management systems • different scheduling algorithms/strategies • For arbitrary resources • not only computing resources, also • data, storage, network, software etc. • Support for co-allocation and reservation • necessary for coordinated grid usage (see data, network, software, storage) • Different scheduling objectives • cost, quality, other • Grid resource management services are a key to success of the Grid vision! • …so are the applications that could show the Grid benefit! CGW 2003

Integration of a Grid Scheduling System • Globus as de-facto standard • but no higher-level scheduling services available • Many projects include scheduling requirements • Focus on a running implementation for a specific problem • No attempt to generate a general solution • Grid scheduling cannot be developed by single groups • Requirements for several other services • Community effort is key! • Requires open Grid standards that enables Grid scheduling • Support for different implementations while being interoperable CGW 2003

Activities • Core service infrastructure • OGSA/OGSI • GGF hosts several groups in the area of Grid scheduling and resource management. Examples: • WG Scheduling Attributes (finished) • WG Grid Resource Allocation Agreement Protocol (active) • WG Grid Economic Services Architecture (active) • WG Scheduling Ontology (proposed) • RG Grid Scheduling Architecture (proposed) • Network of Excellence “CoreGRID” (proposed) • define the software infrastructure for Next Generation Grids CGW 2003

Information Service static & scheduled/forecasted Query for resources Scheduling Service Resources Reservation Data Management Service Maintain information Data Job Supervisor Service Network Network Management Service Accounting and Billing Service Maintain information Data Manager Compute Manager Network Manager ManagementSystem Management SystemNetwork Compute/ Storage /Visualization etc Data-Resources Network-Resources What are Basic Blocks for a Grid Scheduling Architecture? Basic Blocks and Requirements are still to be defined! CGW 2003 Courtesy of Ramin Yahyapour

Conclusion • Resource management and scheduling is a key service in an Next Generation Grid • In a large Grid the user cannot handle this task • Nor is the orchestration of resources a provider task • System integration is complex but vital • Individual results may be of limited benefit without being embedded in a larger project • Basic research is required in this area. • No ready-to-implement solution is available (although EDG, CrossGrid, GridLab etc. work on it) • New concepts are necessary A significant joint effort is needed to support Grid Scheduling! Also research is still necessary! CGW 2003

RM in GridLab: What our users want... • Two primary applications: Cactus (simulations) and Triana (data mining/analysis) • other application communities are also being engaged, • Application oriented environment • Resources (grid) on demand • Adaptive applications/adaptive scenarios – adaptive grid environment • job checkpoint, migration, spawn off a new job when needed, • Open, pervasive, not even restricted to a single Virtual Organization • The ability to work in a disconnected environment • start my job on a disconnected laptop; migrate it to grid when it becomes available • from laptops to fully deployed Virtual Organizations • Mobile working • Security on all levels CGW 2003

What our users want... (cont.) • The infrastructure must provide capabilities to customise choice of service implementation (e.g.using efficiency, reliability, first succeeding, all) • Advance reservation of resources, • To be able to express their preferences regarding their jobs on one hand and to understand the resource and VO policies on the other hand, • Policy information and negotiation mechanisms • what is a policy of usage of the remote resources? • Prediction-based information • How long will my job run on a particular resource? • What resources do I need to complete the job before deadline? CGW 2003

Coalescing Binary Scenario Controller Email, SMS notification Logical File Name GW Data Distributed Storage GAT (GRMS, Adaptive) GW Data • Submit Job • Optimised Mapping GAT (Data Management) CB Search GridLab Test-bed CGW 2003

GridLab RMS approach • Grid resources are not only the machines, but also databases, files, users, administrators, instruments, mobile devices, jobs/applications ... • Many metrics for scheduling: throughput, cost, latency, deadline, other time and cost metrics... • Grid resource management consists of job/resource scheduling, security (authorization services,...), local policies, negotiations, accounting, ... • GRM is both, user and resource owner driven negotiation process and thus, multicriteria decision making process • WS-Agreement is badly needed • Two ongoing implementations: production keep-up and future full-feature CGW 2003

GRMS - the VO RMS • GRMS is a VO (community) resource management system • Component and pluggable architecture allows to use it as a framework for many different VOs • Components include: • Resource discovery (now MDS-based, other solutions easy to be added: now adding the GridLab testbed information system: see Ludek’s talk tomorrow) • Scheduling (Multicriteria, economy, workflow, co-scheduling, SLA-based scheduling) - work in progress • Job Management • Workflow Management • Resource Reservation CGW 2003

GRMS - the plan Information Services Data Management Authorization System Adaptive Resource Discovery File Transfer Unit Jobs Queue BROKER Job Receiver Execution Unit Monitoring SLA Negotiation Scheduler Workflow Manager Resource Reservation Prediction Unit GRMS GLOBUS, other Local Resources (Managers) CGW 2003

Current implementation • submitJob - submits new job, • migrateJob - migrates existing job, • getMyJobsList - returns list of jobs belonging to the user, • registerApplicationAccess - registers application access, • getJobStatus - returns GRMS status of the job, • getHostName - returns host name, on which the job is/was running • getJobInfo - returns a structure describing the job, • findResources - returns resources matching user's requirements, • cancelJob - cancels the job, • getServiceDescription - returns description of a service. CGW 2003

GRMS - overview Resource Management System Grid Environment User Access Layer Resource Discovery GridLab Services • Data Management • Adaptive Components • GridLab Authorization Service (GAT) Application Broker GridLab Portal Globus Infrastructure • MDS • GRAM • GridFTP • GASS Job Manager CGW 2003

GRMS –detailed view DB Resource Discovery GridLab Services Web Service Interface Workflow Mgmt. Job Queue Broker User AccessLayer System Layer Services Task Registry Job Manager CGW 2003

GRMS –detailed view DB Resource Discovery Web Service Interface GridLab Services Web Service Interface Web Service Interface Workflow Mgmt. Job Queue Web Service Interface Web Service Interface Broker Web Service Interface User AccessLayer System Layer Services Task Registry Web Service Interface Job Manager Web Service Interface CGW 2003

GRMS functionality • Ability to choose the best resources for the job execution, according to Job Description and chosen mapping algorithm; • Ability to submit the GRMS Task according to provided Job Description; • Ability to migrate the GRMS Task to better resource, according to provided Job Description; • Ability to cancel the Task; • Provides information about the Task status; • Provides other information about Tasks (name of host where the Task is/was running, start time, finish time); CGW 2003

GRMS functionality (cont.) • Provides list of candidate resources for the Task execution (according to provided Job Description); • Providesa list of Tasks submitted by given user; • Ability to transfer input and output files (GridFTP, GASS, WP8 Data Management System); • Ability to contact Adaptive Components Services to get additional information about resources • Ability to register a GAT Application callback information • Ability to submit a set of tasks with precedence constraints (work-flow of tasks and input/output files) CGW 2003

GRMS modules • Broker Module • Steers process of job submittion • Chooses the best resources for job execution (scheduling algorithm) • Transfers input and output files for job's executable • Resource Discovery Module • Finds resources that fullfills requirements described in Job Description • Provides information about resources, required for job scheduling CGW 2003

GRMS modules (cont.) • Job Manager Module • Ability to check current status of job • Ability to cancel running job • Monitors for status changes of runing job • Workflow Management Module • Creates workflow graph of tasks from Job Description • Put tasks to Job Queue • Controls the tasks execution according to precedence constraints CGW 2003

GRMS modules (cont.) • Web Service Interface • Provides GSI enabled web service interface for Clients (GAT Application, GridLab Portal) • Job Queue • Allows to put the tasks into the queue • Provides way for getting tasks from queue accorging to configured algorithm (FIFO) • Task Registry • Stores information about the task execution (start time, finish time, machine where executed, current status, Job Description) CGW 2003

Job Description • Task executable • file location • arguments • file argument (files which have to be present in working directory of running executable) • environment variables • standard input • standard output • standard error • checkpoint files CGW 2003

Job Description (cont.) • Resource requirements of executable • name of host for job execution (if provided no scheduling algorithm is used) • operating system • required local resource management system • minimum memory required • minimum number of cpus required • minimum speed of cpu • other parameter passed directly to Globus GRAM CGW 2003

Job Description – new elements • Job Description consists of one or more Task descriptions • Each Task can have a section which denotes parent tasks CGW 2003

Job Description - example < grmsjob appid = MyApplication> <task id=1> <resource> <osname> Linux </osname> <memory> 128</memory> <cpucount> 2 </cpucount> </resource> <executable type="single" count="1"> <file name="String" type="in"> <url> gsiftp://rage.man.poznan.pl/~/Apps/MyApp </url> </file> <arguments> <value> 12 </value> <value> abc </value> </arguments> <stdin> <url> gsiftp://rage.man.poznan.pl/~/Apps/appstdin.txt </url> </stdin> <stdout> <url> gsiftp://rage.man.poznan.pl/~/Apps/appstdout.txt </url> </stdout> </ executable > </task> </grmsjob > CGW 2003

Job Description – example 2 < grmsjob appid = MyApplication> <task id=task1> <resource> ... </resource> <executable type="single" count="1"> ... </ executable > </task> <task id=task2> <resource> ... </resource> <executable type="single" count="1"> ... </ executable > <workflow> <parent>task1</parent> </workflow> </task> </grmsjob > CGW 2003

Research focus of GRMS • Focus on the infrastructure is not enough for the efficient GRM • Focus on policies • Focus on multicriteria aspects of the GRM • users, their preferences and applications • resource owners’ preferences • preference models, multicriteria decision making, knowledge will be crucial for efficient resource management • Focus on AI techniques for GRM • Focus on business models, economy grids • Cost negotiation mechanisms could be part of the SLA negotiation process (WS-Agreement) contradictory in nature CGW 2003

GRMS and SLA CGW 2003

GRMS and SLA (cont.) CGW 2003

STAKEHOLDERS OF THE GRID RESOURCE MANAGEMENT PROCESS • End-users (consumers) • having requirements concerning their applications (e.g. expect a good performance of their applications, expect a good response time) • have requirements concerning resources (e.g. prefer machines with a big storage, machines with certain configurations) • Resource Administrators and Owners (providers) • share resources to achieve some benefits • VO Administrator(criteria and preferences must be secure) • requires robust and reliable provisioning of resources • manages and controls VO by making global policies (community policies) CGW 2003

Multicriteria RM in GridLab • Gathering of information • apps requirements (resource requirements, environment, etc.) • user preferences (which criteria and how important) • user support, preference modeling tools, • Selection phase • choose the best resources (schedule) based on the information provided and on the resource availability (estimates, predictions) • from simple matchmaking to multiple optimisation techniques • Execution phase • file staging, execution control, job monitoring, migration, usually re-selection of resources, application adaptation (application managers, adaptive services from GridLab) CGW 2003

Grid(Lab) Resource Management System