430 likes | 628 Views
SLA Management in AssessGrid. Dominic Battr é, TU Berlin. AssessGrid in a Nutshell. Requirement for Service Level Agreements from users Reluctance to sign SLAs by providers. AssessGrid in a Nutshell. AssessGrid in a Nutshell. AssessGrid in a Nutshell. AssessGrid in a Nutshell.
E N D
SLA Management in AssessGrid Dominic Battré, TU Berlin
AssessGrid in a Nutshell • Requirement for Service Level Agreements from users • Reluctance to sign SLAs by providers
AssessGrid in a Nutshell DAS-2 Grid 3 failedjobs succ.jobs TeraGrid … * statisticsfrom 2005/2006!
AssessGrid in a Nutshell • User: • Which provider is reliable? • How reliable is a provider? • Does a provider lie? • Provider: • How reliable am I? • Can I sign SLAs? • Can I improve my reliability?
Agenda • AssessGrid in a Nutshell • Content of SLAs • Demo • Job submission and provider selection • Fault Tolerance • Underlying technology • Negotiation Manager • Risk Assessment and Management • Content of SLAs as WS-Agreement • Future Challenges
Content of SLAs nodes Job 1 Job 7 Job 2 Job 3 Job 4 Job 5 Schedule • Participating parties • Job Definition • Scheduling • Executable • File Staging • Acceptable Probability of Failure • Price and penalty Job 6 Each job specified with Job 1 nr. nodes Job 1 runtime Earliest start time Latest finish time time
Job Submission And Provider Selection
Job Submission and Provider Selection Specify Job End-User Broker Providers • Program, Input, Output • Acceptable PoF • Penalty in case of failure • Deadline
Job Submission and Provider Selection Get Quotes End-User Broker Providers
Job Submission and Provider Selection Get Quotes End-User Broker Providers • Forwarding based on • Matching of templates to request • Quotes created in the past • Performance in the past
Job Submission and Provider Selection Generate Quotes End-User Broker Providers • Calculate Probability of Failure (PoF) • Calculate required number of spare nodes, extra time • Calculate price • Check available resources in schedule
Job Submission and Provider Selection Quotes End-User Broker Providers
Job Submission and Provider Selection Enhance Quotes End-User Broker Providers • Own estimation of PoF in case of unreliable providers • Perform ranking respecting user’s desire
Job Submission and Provider Selection Quotes End-User Broker Providers
Job Submission and Provider Selection Select Provider End-User Broker Providers • Criteria: • Price, PoF, Adjusted PoF • AHP-Ranking
Job Submission and Provider Selection Get Reputation End-User Broker Providers
DS Analytical Hierarchy Process Past Performance Maintenance Security Customer Support Infrastructure Experience Maintenance Staff 24/7 Staff training/yr Staff experience Red. Power Red. Storage Storage Age … Infrastructure
Job Submission and Provider Selection Create Agreement End-User Broker Providers
Underlying Technology: The Negotiation Manager
Negotiation Manager • Globus Toolkit 4 • Apache 2 License • 2 Flavours • Simple Framework • AssessGrid Implementation(OpenCCS, Risk Assessment, …) • Features • Template Store • Access Control, Credential Delegation • State Management • Staging by GridFTP • Simple Validation of CreationConstraints • Extensible • WS-Notification • Optional: Quote Mechanism • Optional: Cheap Cancellation Extension
Template Store • Optional component • Templates stored persistently in RDBMS • Get, Insert, Delete by WS-RF • Monitoring by WS-Notification • Access policies: • Everybody can read • Admin(s) can modify • Templates used in AssessGrid • Regular Job (POSIX and SPMD) • Out-sourced Job with checkpoint data-set
Access Control • Default: • 3 User Groups • Admins, Owners, Users • Admin has access to anything • Owner is legally responsible • Users have read access • Owner and Users are different in case of SLA outsourcing • Overwriteable • Option to delegate credentials
State Management • Asynchronous, multi-threaded, persistent state management Waitfortermination Start Waitforstage-in Dostage-in Stage-in done Dostage-out Stage-out done Cleanup Wait forexecution Wait fortermination
File-staging • Files specified by JSDL • User delegates credentials • User estimates duration • Shorter duration triggers earlier execution • Longer duration triggers later execution • Staging by GridFTP
CreationConstraints • Difficult to support Namespaces: • //wsag:…/assessgrid:… - prefixes are just strings • Very difficult to support structural information • xs:group, xs:all, xs:choice, xs:sequence • Possible but difficult to support xs:restriction • xs:simple • Check for enumeration (xs:restriction of xs:string) • Check for valid dates (xs:restriction of xs:date) • Everything else close to impossible • {min,max}{In,Ex}clusive • totalDigits, fractionDigits, length, … probably useless Context Terms Creation Constraints
Optional Quote Mechanism User Provider Get Template Fill Template Create Quote modify Create Agreement bound Yes / No bound
Extensible Not: But: WSDL WSDL Black Box deployed NegMgr WSDL Interface Domain specificImplementation Domain specificImplementation deployed
Cancellation Policy • Motivation: • Serious issues of 3-way commit protocol (reservations) • Goal: Cheap Cancellation Policy • “Full refund if product bought online is returned online within 14 days” (German law) • “Cancellation before first day of validity: 15 EUR, after that: not possible” (Deutsche Bahn) • “less than 24 hours before scheduled stay: 50% of first day for cancellation” (hotels)
Cancellation Policy - Rules • Ends of periods: • Price: +5min -1d createQuote createAgreement Earliest Start - 80% 1 EUR
Cancellation Policy - Combination price Full price -50% -1d +5min 0.50 EUR time createQuote createAgreement Earliest Start Used in Broker for roll-back of unsuccessful workflow mappings
Context <wsag:Context> … <wsag:AgreementInitiator> <AG:DistinguishedName> /C=DE/O=… </AG:DistinguishedName> </wsag:AgreementInitiator> <wsag:AgreementResponder>…</…> <AG:ServiceUsers> <AG:ServiceUser>DN</…> </AG:ServiceUsers> … </wsag:Context> Context Terms Creation Constraints
Terms, SDTs • Conjunction of terms • Common structure of templates • WS-AG too powerful/difficult to fully support • Service Description Term (one) • assessgrid:ServiceDescription (extension of abstract ServiceTermType) • jsdl:POSIXExecutable / SPMD (executable, arguments, environment) • jsdl:Resources • jsdl:DataStaging * • assessgrid:PoF (upper bound) Context Terms Creation Constraints
Terms, GuaranteeTerms • No hierarchy but two meta guarantees • ProviderFulfillsAllObligations • e.g. Reward: 1000 EUR, Penalty 1000 EUR • ConsumerFulfillsAllObligations • e.g. Reward: 0 EUR, Penalty 1000 EUR • First violation is responsible for failure • No hardware problem, then User fault • Other Guarantees • Execution Time • Any start time (best effort) • Exact start time • Earliest start time, latest finish time • Maximum StageIn/Out time • No Cancellation No timely execution No stage-out Context Terms Creation Constraints
Stuff I did not talk about • Risk Assessment • Risk Management • Checkpointing details, runtime extension, spare nodes, … • Confidence and Reputation Service • Workflows • Description in WS-Agreement • Mapping to individual SLAs • Simulation tools
Future Challenges • Failure detection and analysis • (Re)negotiation • Risk Assessment • Interoperability of WS-Agreement implementations by micro-specs – or even common template structures • Automatic evaluation of CreationConstraints • Posthumous resolving of disagreements • Third party blaming • Persisting Problems • Dependencies of violated guarantees • Violation caused by third party or unknown cause • Failure/success of entire SLA