220 likes | 302 Views
Results of Meeting on Workload Manager Components Interaction DataGrid WP1. F. Pacini fpacini@datamat.it. Summary. General Decisions General LB Model Interactions between UI and RB Interactions between UI and LB Interactions between RB and JSS. General Decisions (1/4). RB. JSS.
E N D
Results of Meeting on Workload Manager Components InteractionDataGrid WP1 F. Pacini fpacini@datamat.it
Summary • General Decisions • General LB Model • Interactions between UI and RB • Interactions between UI and LB • Interactions between RB and JSS
General Decisions (1/4) RB JSS GK UI LB
General Decisions (2/4) • SUBMITTED: this state is generated by the UI, just after having assigned the Job ID and just before to submit the job to the RB => UI assigns a unique Job ID (dg_jobID) (using e.g. hostname+time+PID) and logs this event • CHKPT: this is a system checkpointing of jobs running on a CE, independently from Application Checkpointing • All parties are strongly invited to use the above Job State Diagram to identify the status of a job in the system!
General Decisions (3/4) • UI will only perform syntax checking on class-ad attributes, whether they exists and are written in the correct format (i.e. <attribute name> = <expression>). UI will NOT perform any semantic checks on the attribute values. UI will assign default values to some mandatory attributes, if needed and when possible. • UI will be able to contact different RB’s and LB’s according to the lists contained in local conf files. • UI will be able to use different conf files as specified by user at UI start-up (e.g. Start_UI –config <filename>). • UI will provide a command for the creation of a proxy (such as the Globus grid-proxy-init function). • JSS will have a proxy repository. CESNET will give details on this.
General Decisions (4/4) • The need for a security mechanism for interactions between UI and RB/LB has been highlighted. Parties will perform a kind of handshake, and a secure channel will be established for their communications. CESNET is going to provide an API encapsulating this mechanism for LB, and will also provide an “how-to” to support UI-RB communications. • If the handshake is successful, RB will send to UI the address of “its” LB. • Both RB and LB will provide C/C++ API’s to be used by UI, encompassing also network communications. As far as UI will use Python for PM9 (at least!), SWIG tool will be used to support their wrapping (http://www.swig.org). NOTE: in order to exploit maximum benefits from SWIG, it is strongly recommended to comply with ANSI C/C++ standards. • LB requires some modifications to be applied to the Globus job-manager. CESNET will take care of this.
General LB Model • We have agreed on the LB model proposed by CESNET in their doc Logging and Bookkeeping Service, Rev. 1.35 • This model is based on a PUSH mechanism, where all the actors within the Workload Manager will push logging info using the defined API’s • Some modifications are needed to the Globus job-manager in order to use these API’s (e.g. RSL schema modification to take into account the dg_jobID). CESNET will take care of this. • CESNET proposed event to be logged will be analysed by all parties to build a final agreed set
Interactions between UI and RB (1/8) • The Job Submission UI contacts the RB when the following commands are issued by the user: • dg-job-submit • dg-list-job-match • dg-job-cancel
Interactions between UI and RB (2/8) • dg-job-submit dg-job-submit <jdl_file> [-resourceres_id] [-notifye_mail_address] • The core information flowing from the UI to the RB consists of a job class-ad built from the job description file • There is a minimal set of mandatory attributes to be specified in the class-ad to be sent to RB (ref. UI Man Pages) • The user will be uniquely identified via his/her CertificateSubject (=> UserID no more needed) • UI builds the dg_jobID also using the RB and related LB addresses, for later M&C purposes
Interactions between UI and RB (3/8) • If the dg-job-submit has been issued with the “-resource” option, then the job-ad contains the attribute: • ResourceID = res_id and the RB shall submit the job to the resource identified by “res_id” without going through the match-making process. A new attribute RetryCount will be added in the class-ad to allow the user to specify the number of retries in case the specified resource is temporarily not available. In case the user will not provide this attribute, the UI will fill it with a default value.
Interactions between UI and RB (4/8) • If the dg-job-submit has been issued with the “-notify” option, then the job class-ad contains the attribute: • UserContact = e_mail_address • The following schema will be followed to notify the user of job status change for PM9: • the RB shall send an e-mail notification to e_mail_address when the matchmaking process has finished and the job is ready to be submitted to JSS (READY status) • the JSS shall send an e-mail notification to e_mail_address when the job starts running on the CE (RUNNING status) • The RB shall send an e-mail notification to e_mail_address when the job has finished (ABORTED or DONE status)
Interactions between UI and RB (5/8) • dg-list-job-match dg-list-job-match <jdl_file> • The core information flowing from the UI to the RB consists of a job class-ad built from the job description file • There is a minimal set of mandatory attributes to be specified in the class-ad to be sent to RB (ref. UI Man Pages) • The user will be uniquely identified via his/her CertificateSubject (=> UserID no more needed)
Interactions between UI and RB (6/8) • In this case the RB does not submit the job but only searches for resources compatible with the input job class-ad. The RB will send back to the UI a list of suitable resources identified by their Resource ID composed by Globus Gatekeeper contact string plus a queue name (if any).
Interactions between UI and RB (7/8) • dg-job-cancel dg-job-cancel <jobID1……..jobIDn | -all > • The core information flowing from the UI to the RB consists of a list of dg_jobID’s • The user will be uniquely identified via his/her CertificateSubject (=> UserID no more needed) • According to the RB address provided in the dg_jobID’s, the UI will contact the relevant RB’s for cancellation request
Interactions between UI and RB (8/8) • If the dg-job-cancel command has been issued with the “-all” input parameter, no dg_jobID’s are available a priori. For PM9, the UI will contact all the RB’s available in the current used conf file, asking for the cancellation of all jobs owned by the requesting user.
Interactions between UI and LB(1/5) • The Job Submission UI contacts the LB when the following commands are issued by the user: • dg-job-status • dg-get-logging-info
Interactions between UI and LB(2/5) • dg-job-status dg-job-status <jobID1……..jobIDn > | -all > [full] • The core information flowing from the UI to the LB consists of a list of dg_jobID’s and the required information level • The user will be uniquely identified via his/her CertificateSubject (=> UserID no more needed) • According to the LB address provided in the dg_jobID’s, the UI will contact the relevant LB’s for status request
Interactions between UI and LB(3/5) • If the dg-job-status command has been issued with the “-all” input parameter, no dg_jobID’s are available a priori. For PM9, the UI will contact all the LB’s available in the current used conf file, asking for the status of all jobs owned by the requesting user. • An agreement on the set of information to be returned in short and full mode has been found. Details in UI Man Pages.
Interactions between UI and LB(4/5) • dg-get-logging-info dg-get-logging-info <jobID1……..jobIDn > | -all >[-from T1] [-to T2] [-full] • The core information flowing from the UI to the LB consists of a list of dg_jobID’s, the required information level and the time interval • The user will be uniquely identified via his/her CertificateSubject (=> UserID no more needed) • According to the LB address provided in the dg_jobID’s, the UI will contact the relevant LB’s for status request
Interactions between UI and LB(5/5) • If the dg-get-logging-info command has been issued with the “-all” input parameter, no dg_jobID’s are available a priori. For PM9, the UI will contact all the LB’s available in the current used conf file, asking for the status of all jobs owned by the requesting user. • An agreement on the set of information to be returned in short and full mode has been found. Details in UI Man Pages. • CESNET will extend some API’s in order to take into account the time interval as querying filter
Interactions between RB and JSS(1/2) • The core information flowing from the RB to the JSS (via a JSS API) consists of a job class-ad where the RB has properly inserted the following attributes: • GKContactString • QueueName (if any) In case the UserContact attribute is present in the received class-ad, the JSS will interpret this as the request to send e-mail notifications when needed • For PM9, JSS will transform this class-ad into the Condor Submission File. This file shall contain the dg_jobID to be passed to GK in RSL expression • For PM9, JSS ia also responsible for maintaining the mapping between dg_jobID and Condor-G ID
Interactions between RB and JSS(2/2) • For PM9, JSS will inspect Condor-G logfile to detect job status transition • After job completion, JSS will notify the RB of this event. • Assuming that RB and JSS will sit on the same computer, the communications will be based on local sockets.