1 / 32

Grid Scheduling through Service-Level Agreement

Grid Scheduling through Service-Level Agreement. Karl Czajkowski The Globus Project http://www.globus.org/. Overview. Introduction to Grid Environments The Resource Management Problem Cross-domain applications Resource owner goals vs. application goals

ull
Download Presentation

Grid Scheduling through Service-Level Agreement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Scheduling through Service-Level Agreement Karl Czajkowski The Globus Project http://www.globus.org/

  2. Overview • Introduction to Grid Environments • The Resource Management Problem • Cross-domain applications • Resource owner goals vs. application goals • An Open Architecture to Manage Resources • Service-Level Agreement (SLA) • GRAM and Managed Services • Related and Ongoing Work

  3. Grid Resource Environment R ? R R ? R R R R R R network dispersed users R ? ? R R R R R R R R R R • Distributed users and resources • Variable resource status • Variable grouping and connectivity • Decentralized scheduling/policy VO-A VO-B

  4. Social/Policy Conflicts • Application Goals • Users: deadlines and availability goals • Applications: need coordinated resources • Localized Resource Owner Goals • Policies towards users • Optimization goals • Community Goals Emerge As: • An aggregate user/application? • A virtual resource? Both!

  5. Parallel I/O Reduction Sorting Transport Rendering TCP/IP Receive Buffer ... ... ... ... Data-Intensive Example • Concurrent resource requirements • Large scale storage, computing, network, graphics • Datapath involves autonomous domains

  6. Early Co-Allocation in Grids • SF-Express (1997-8) • Real-time simulation • 12+ supercomputers, 1400 processors • Required advance reservation • Brokered by telephone! • Globus DUROC software to sync startup • Over 45 minutes to recover from failure • In use today in MPICH-G2 (MPI library)

  7. Traditional Scheduling • Closed-System Model • Presumption of global owner/authority • Sandboxed applications with no interactions • “Toss job over the fence and wait” • Utilization as Primary Metric • Deep batch queues allow tighter packing • No incentives for matching user schedule • Sub-cultures Counter Site Policies • Users learn tricks for “gaming” their site

  8. An Open Negotiation Model • Resources in a Global Context • Advertisement and negotiation • Normalized remote client interface • Resource maintains autonomy • Users or Agents Bridge Resources • Drive task submission and provisioning • Coordinate acts across domains • Community-based Mediation • Coordination for collective interest

  9. Community Scheduling Example • Individual users • Require service • Have application goals • Community schedulers • Broker service • Aggregate scheduling • Individual resources • Provide service • Have policy autonomy • Serve above clients

  10. Negotiation Phases • Discovery • “What resources are relevant to interest?” • Finds service providers • Monitoring • “What’s happening to them now?” • Compare service providers • Service-Level Agreement • “Will they provide what I need?” • The core Resource Management problem • Process can iterate due to adaptation

  11. Service-Level Agreement • Three kinds of SLA • Task submission (do something) • Resource reservation (pre-agreement) • Lazy task/resource Binding (apply resv.) • Simple protocol for negotiating SLAs • Basic 2-party negotiation • Support for basic offer/accept pattern • Optional counter-offer patterns • Variable commitment phase for stricter promises • Client may maintain multiple 2-party SLAs

  12. Many Types of Service • Must support service heterogeneity • Resources • Hardware: disks, CPU, memory, networks, display… • Logical: accounts, services… • Capabilities: space, throughput… • Tasks • Data: stored file, data read/write • Compute: execution, suspended/swapped job • SLAs bear embedded term languages • Isolate domain-specific details

  13. Domain Extension: File Transfer • Single goal • Reliable deadline transfer • Specialized scheduler • Brokers basic services • Synthesizes new service • Fault-handling logic • Distributed resources • Storage space • Storage bandwidth • Network bandwidth

  14. Technical Challenges • Complex Security Requirements • Global Scalability • Similar ideals to Internet • Interoperable infrastructure • Policy-configurable for social needs • Permanence or “Evolve in Place” • Cannot take World off-line for service • Over time: upgrade, extend, adapt • Accept heterogeneity

  15. Coordinator GRAM Architecture SLA implementation Planner Domain-specific SLA Application Information Service Monitor & Discover Concrete SLA Incremental SLAs Local resource managers GRAM2 GRAM2 GRAM2 Job CPU Disk

  16. WS-Agreement • New standardization effort • Generalizes GRAM ideas • Service-oriented architecture • Resource becomes Service Provider • Tasks become NegotiatedServices • SLAs presented as Agreement services • Still supports extensible domain terms

  17. WS-Agreement Entities

  18. WS-Agreement Adds Management

  19. Virtualized Providers

  20. Agreement-based Jobs • Agreement represents “queue entry” • Commitment with job parameters etc. • Agreement Provider • i.e. Job scheduler/Queuing system • Management interface to service provider • Service Provider • i.e. scheduled resource (compute nodes) • Service is the Job computation

  21. Advance Reservation for Jobs • Schedule-based commitment of service • Requires schedule based SLA terms • Optional Pre-Agreement (RSLA) • Agreement to facilitate future Job Agreement • Characterizes virtual resource needed for Job • May not need full job terms • Job Agreement almost as usual • May exploit Pre-Agreement • Reference existing promise of resource schedule • May get schedule commitment in one shot • Directly include schedule terms • (Can think of as atomic advance reserve/claim)

  22. Need for Complex Description • 128 physical nodes • Physical topology • Interconnect • RAM, disk size • Subject of RSLA • Single MPI job • Subject of TSLA • May reference RSLAs • Quality requirements • Real-time parameters • CPU, disk performance • Subject of BSLA

  23. MDS Resource Models (History)

  24. Future Models • Service behavioral descriptions • Unified service term model • Capture user/application requirements • Capture provider capabilities • Core meta-language • Facilitates planner/decision designs • Extends with domain concepts • Extensible negotiability mark-up • Capture range of negotiability for variable terms • Capture importance of terms (required/optional) • Capture cost of options (fees/penalties)

  25. SLA Types in Depth • Resource SLA (RSLA), i.e. reservation • A promise of resource availability • Client must utilize promise in subsequent SLAs • Task SLA (TSLA), i.e. execution • A promise to perform a task • Complex task requirements • May reference an RSLA (implicit binding) • Binding SLA (BSLA), i.e. claim • Binds a resource capability to a TSLA • May reference an RSLA (otherwise obtain implicitly) • May be created lazily to provision the task

  26. Resource Lifecycle • S0: Start with no SLAs • S1: Create SLAs • TSLA or RSLA • S2: Bind task/resource • Explicit BSLA • Implicit provider schedule • S3: Active task • Resource consumption • Backtrack to S0 • On task completion • On expiration • On failure

  27. Incremental Negotiation • RSLA: reserve resources for future use • TSLA: submit task to scheduler • BSLA: bind reservation to task • Resources change state due to SLAs and scheduler decisions

  28. Linking SLAs for Complex Case TSLA1 account tmpuser1 RSLA1 50 GB in /scratch filesystem BSLA1 30 GB for /scratch/tmpuser1/foo/* files TSLA2 Complex job TSLA3 TSLA4 RSLA2 Net Stage in Stage out BSLA2 • Dependent SLAs nest intrinsically • BSLA2 defined in terms of RSLA2 and TSLA4 • Chained SLAs simplify negotiation • Optionally link destruction/reclamation time

  29. Related Work • Academic Contemporaries • Condor Matchmaking • Economy-based Scheduling • Work-flow Planning • Commercial Scheduler Examples • Many examples for traditional sites • Several generalized for “the enterprise” • Platform Computing • LSF scaled to lots of jobs • MultiCluster for site-to-site resource sharing • IBM eWLM • Goal-based provisioning of transactional flows

  30. Condor Matchmaking • At heart: a scheduling algorithm • Heuristics for pairing job with resource • Match symmetric “Classified Ads” • Great for bulk/commodity matching • Closed system view • Subsumes resource through “lease” • Sandboxed job environment • Favor vertical integration over generality • Tuned high-throughput system

  31. Future Work • SLA interaction with policy • SLA negotiation subject to policy • One SLA affects another, e.g. RSLA subdivision • One client “more important” than another • SLA implemented by low-level policies • Domain-specific SLA maps to resource SLAs • Resource SLAs map to resource control mechanisms • Resource characterization • Advertisement of resources: options, cost • Interoperable capability languages

  32. Conclusion • Generic SLA management • Compositional for complex scenarios • Extensible for unique requirements • Requires work on Grid service modeling • To describe jobs, resource requirements, etc. • Enhancement to proven architectures • Encompasses GRAM+GARA • Evolution of the Globus Toolkit RM • GRAM evolving since 1997 • WS-Agreement standard in progress

More Related