320 likes | 477 Views
Grid Scheduling through Service-Level Agreement. Karl Czajkowski The Globus Project http://www.globus.org/. Overview. Introduction to Grid Environments The Resource Management Problem Cross-domain applications Resource owner goals vs. application goals
E N D
Grid Scheduling through Service-Level Agreement Karl Czajkowski The Globus Project http://www.globus.org/
Overview • Introduction to Grid Environments • The Resource Management Problem • Cross-domain applications • Resource owner goals vs. application goals • An Open Architecture to Manage Resources • Service-Level Agreement (SLA) • GRAM and Managed Services • Related and Ongoing Work
Grid Resource Environment R ? R R ? R R R R R R network dispersed users R ? ? R R R R R R R R R R • Distributed users and resources • Variable resource status • Variable grouping and connectivity • Decentralized scheduling/policy VO-A VO-B
Social/Policy Conflicts • Application Goals • Users: deadlines and availability goals • Applications: need coordinated resources • Localized Resource Owner Goals • Policies towards users • Optimization goals • Community Goals Emerge As: • An aggregate user/application? • A virtual resource? Both!
Parallel I/O Reduction Sorting Transport Rendering TCP/IP Receive Buffer ... ... ... ... Data-Intensive Example • Concurrent resource requirements • Large scale storage, computing, network, graphics • Datapath involves autonomous domains
Early Co-Allocation in Grids • SF-Express (1997-8) • Real-time simulation • 12+ supercomputers, 1400 processors • Required advance reservation • Brokered by telephone! • Globus DUROC software to sync startup • Over 45 minutes to recover from failure • In use today in MPICH-G2 (MPI library)
Traditional Scheduling • Closed-System Model • Presumption of global owner/authority • Sandboxed applications with no interactions • “Toss job over the fence and wait” • Utilization as Primary Metric • Deep batch queues allow tighter packing • No incentives for matching user schedule • Sub-cultures Counter Site Policies • Users learn tricks for “gaming” their site
An Open Negotiation Model • Resources in a Global Context • Advertisement and negotiation • Normalized remote client interface • Resource maintains autonomy • Users or Agents Bridge Resources • Drive task submission and provisioning • Coordinate acts across domains • Community-based Mediation • Coordination for collective interest
Community Scheduling Example • Individual users • Require service • Have application goals • Community schedulers • Broker service • Aggregate scheduling • Individual resources • Provide service • Have policy autonomy • Serve above clients
Negotiation Phases • Discovery • “What resources are relevant to interest?” • Finds service providers • Monitoring • “What’s happening to them now?” • Compare service providers • Service-Level Agreement • “Will they provide what I need?” • The core Resource Management problem • Process can iterate due to adaptation
Service-Level Agreement • Three kinds of SLA • Task submission (do something) • Resource reservation (pre-agreement) • Lazy task/resource Binding (apply resv.) • Simple protocol for negotiating SLAs • Basic 2-party negotiation • Support for basic offer/accept pattern • Optional counter-offer patterns • Variable commitment phase for stricter promises • Client may maintain multiple 2-party SLAs
Many Types of Service • Must support service heterogeneity • Resources • Hardware: disks, CPU, memory, networks, display… • Logical: accounts, services… • Capabilities: space, throughput… • Tasks • Data: stored file, data read/write • Compute: execution, suspended/swapped job • SLAs bear embedded term languages • Isolate domain-specific details
Domain Extension: File Transfer • Single goal • Reliable deadline transfer • Specialized scheduler • Brokers basic services • Synthesizes new service • Fault-handling logic • Distributed resources • Storage space • Storage bandwidth • Network bandwidth
Technical Challenges • Complex Security Requirements • Global Scalability • Similar ideals to Internet • Interoperable infrastructure • Policy-configurable for social needs • Permanence or “Evolve in Place” • Cannot take World off-line for service • Over time: upgrade, extend, adapt • Accept heterogeneity
Coordinator GRAM Architecture SLA implementation Planner Domain-specific SLA Application Information Service Monitor & Discover Concrete SLA Incremental SLAs Local resource managers GRAM2 GRAM2 GRAM2 Job CPU Disk
WS-Agreement • New standardization effort • Generalizes GRAM ideas • Service-oriented architecture • Resource becomes Service Provider • Tasks become NegotiatedServices • SLAs presented as Agreement services • Still supports extensible domain terms
Agreement-based Jobs • Agreement represents “queue entry” • Commitment with job parameters etc. • Agreement Provider • i.e. Job scheduler/Queuing system • Management interface to service provider • Service Provider • i.e. scheduled resource (compute nodes) • Service is the Job computation
Advance Reservation for Jobs • Schedule-based commitment of service • Requires schedule based SLA terms • Optional Pre-Agreement (RSLA) • Agreement to facilitate future Job Agreement • Characterizes virtual resource needed for Job • May not need full job terms • Job Agreement almost as usual • May exploit Pre-Agreement • Reference existing promise of resource schedule • May get schedule commitment in one shot • Directly include schedule terms • (Can think of as atomic advance reserve/claim)
Need for Complex Description • 128 physical nodes • Physical topology • Interconnect • RAM, disk size • Subject of RSLA • Single MPI job • Subject of TSLA • May reference RSLAs • Quality requirements • Real-time parameters • CPU, disk performance • Subject of BSLA
Future Models • Service behavioral descriptions • Unified service term model • Capture user/application requirements • Capture provider capabilities • Core meta-language • Facilitates planner/decision designs • Extends with domain concepts • Extensible negotiability mark-up • Capture range of negotiability for variable terms • Capture importance of terms (required/optional) • Capture cost of options (fees/penalties)
SLA Types in Depth • Resource SLA (RSLA), i.e. reservation • A promise of resource availability • Client must utilize promise in subsequent SLAs • Task SLA (TSLA), i.e. execution • A promise to perform a task • Complex task requirements • May reference an RSLA (implicit binding) • Binding SLA (BSLA), i.e. claim • Binds a resource capability to a TSLA • May reference an RSLA (otherwise obtain implicitly) • May be created lazily to provision the task
Resource Lifecycle • S0: Start with no SLAs • S1: Create SLAs • TSLA or RSLA • S2: Bind task/resource • Explicit BSLA • Implicit provider schedule • S3: Active task • Resource consumption • Backtrack to S0 • On task completion • On expiration • On failure
Incremental Negotiation • RSLA: reserve resources for future use • TSLA: submit task to scheduler • BSLA: bind reservation to task • Resources change state due to SLAs and scheduler decisions
Linking SLAs for Complex Case TSLA1 account tmpuser1 RSLA1 50 GB in /scratch filesystem BSLA1 30 GB for /scratch/tmpuser1/foo/* files TSLA2 Complex job TSLA3 TSLA4 RSLA2 Net Stage in Stage out BSLA2 • Dependent SLAs nest intrinsically • BSLA2 defined in terms of RSLA2 and TSLA4 • Chained SLAs simplify negotiation • Optionally link destruction/reclamation time
Related Work • Academic Contemporaries • Condor Matchmaking • Economy-based Scheduling • Work-flow Planning • Commercial Scheduler Examples • Many examples for traditional sites • Several generalized for “the enterprise” • Platform Computing • LSF scaled to lots of jobs • MultiCluster for site-to-site resource sharing • IBM eWLM • Goal-based provisioning of transactional flows
Condor Matchmaking • At heart: a scheduling algorithm • Heuristics for pairing job with resource • Match symmetric “Classified Ads” • Great for bulk/commodity matching • Closed system view • Subsumes resource through “lease” • Sandboxed job environment • Favor vertical integration over generality • Tuned high-throughput system
Future Work • SLA interaction with policy • SLA negotiation subject to policy • One SLA affects another, e.g. RSLA subdivision • One client “more important” than another • SLA implemented by low-level policies • Domain-specific SLA maps to resource SLAs • Resource SLAs map to resource control mechanisms • Resource characterization • Advertisement of resources: options, cost • Interoperable capability languages
Conclusion • Generic SLA management • Compositional for complex scenarios • Extensible for unique requirements • Requires work on Grid service modeling • To describe jobs, resource requirements, etc. • Enhancement to proven architectures • Encompasses GRAM+GARA • Evolution of the Globus Toolkit RM • GRAM evolving since 1997 • WS-Agreement standard in progress