210 likes | 220 Views
Explore the current directions, contracts, and status of resource selection, with details on scheduling, contracts, and migration management. The protocol involves Cactus Worm Server, GridFTP, and performance detection for improved resource selection.
E N D
Outline • Resource Selection: Current Directions • Contracts: Current Directions • Current Status • Resource Selection • Request Protocol • Response Protocol • Resouce “Scheduling” • Contracts • Migration Manager
Resource Selection Current Directions
Cactus Worm Server Cactus Flesh “Worm” Migration Module User Supplied Application Payload External GridFTP Server (Source) GridFTP Client Thorn External GridFTP Server (Destination) Performance Degradation Detection External Resource Selection Service Resource Selection Client Thorn Migration Logic Manager External Processes Thorns Cactus Application Unit Current ArchitectureUnder Development Data transfer
GRIS’s Resource Selector ArchitectureUCSD (UCSD) Resource Selection Client Thorn Request in ClassAds format Protocol? Http? SOAP? Response (format?) HFA/GradsSoft Translator MDS Resource Selection Library UCSD (HFA/GradsSoft) NWS
GRIS’s Resource Selector ArchitectureClassAds (ClassAds) Resource Selection Client Thorn Protocol? Http? SOAP? Request in ClassAds format Response (format?) UTk Project NWS Resource Selection Engine MDS Needed for recovery and timeliness? ClassAds library
Resource Selector ArchitectureOther RS’s (Other) Resource Selection Client Thorn Request in some format Protocol? Http? SOAP? Response in some format Other Resource Selection Service
Contract Monitoring Current directions
Contract Monitor • Driven by three user-controllable parameters • Time quantum for “time per iteration” • % degradation in time per iteration (relative to prior average) before noting violation • Number of violations before migration • Potential causes of violation • Competing load on CPU • Computation requires more processing power: e.g., mesh refinement, new subcomputation • Hardware problems
Contract Monitor Details • The end user specifies several variables. • These variables can be changed during runtime by contacting the application with an HTTP interface. • These variables include: • time quantum • % degradation • number of violations before migration • The system will then calculate the average wall clock time per iteration for each time quantum. • If the average iteration in any time quantum has lower performance (by the percentage specified) than the average for all the other previous quanta, then a violation is noted.
Actions Taken on Contract Violation • Occurs when more than the specified number of violations have been noted • New set of resources requested from the ResourceSelector • Checkpoints application • Moves checkpoint data to the new resources along with other data needed for restart • Restarts application on the new resources
Resource Selection • Demonstrated migration using RS with simple protocol (using raw sockets). • Working on more robust protocol over HTTP using ClassAds as request and XML as response • Robustness (error handling) critical on real grid • Important to use well known protocol • Working on incorporating performance model into ClassAds
Resource Selection:Example Input [ Type="request"; Owner="dangulo"; RequiredDomains={"cs.uiuc.edu", "ucsd.edu"}; requirements= "other.opSys=="LINUX" & other.minMemSize> (100G/other.CPUCount) && Include(other.domains, RequiredDomains) "; Rank= other.minCPUSpeed * other.CPUCount / (other.maxCPULoad+1); ]
Resource Selection:Input • Need to specify other user-centric informaion • Cactus is installed in user space • We’re investigating whether we can put the Performance Model equations into the ClassAds format in order to pass it to the Resource Selector. • The “Rank” value in the preceding slide shows a simple example of this.
Resource Selection:Example output <virtualMachine> <result statusCode="200" statusMessage="OK"/> <machineList> <machine dns="amajor.cs.uiuc.edu" processor=" 1"> <machine dns="bmajor.cs.uiuc.edu" processor=" 1"> <machine dns="cmajor.cs.uiuc.edu" processor=" 1"> <machine dns="dmajor.cs.uiuc.edu" processor=" 1"> <machine dns="emajor.cs.uiuc.edu" processor=" 1"> <machine dns="fmajor.cs.uiuc.edu" processor=" 1"> <machine dns="hmajor.cs.uiuc.edu" processor=" 1"> </machineList> </virtualMachine>
Resource Selection:Example outputNo resource is found <virtualMachine> <result statusCode="204“ statusMessage="No match Resource is Found"/> <machineList> </machineList> </virtualMachine>
Resource Selection:Example outputBad request from client (request format error) <virtualMachine> <result statusCode="400" statusMessage="Bad Request"/> <machineList> </machineList> </virtualMachine>
Resource Selection:Example outputMDS server is down <virtualMachine> <result statusCode="601“ statusMessage="MDS Service is not available"/> <machineList> </machineList> </virtualMachine>
Resource “Scheduling” • What word do we use for allocating machines to data (“scheduling” seems wrong). • We’re assuming that RS does this • We need to map RS output to Cactus machine distribution
Contract Monitoring • Demonstrated detection of performance degradation • Application monitors placed in Cactus scheduling • routine called once per iteration • accesses Cactus internal timing API • synchronization implies that timing on all nodes are identical • could use different Cactus scheduling times to get node dependant results
Migration Manager • In initial development • Will allow RS selection to occur asynchronously • Will make intelligent choice on whether migration will actually help • Will not migrate to seemingly lower quality resources