190 likes | 203 Views
This article presents solutions for the dynamic deployment of a VO-specific Condor scheduler using GT4, discussing the challenges and various approaches.
E N D
Dynamic Deployment of VO Specific Condor Scheduler using GT4 Gaurang Mehta - gmehta@isi.edu Center For Grid Technologies Information Sciences Institute/USC Dynamic Deployment of VO Condor Scheduler
Outline • Introduction • VO Problem • Solution 1 : Glideins • Solution 2 : Glideins via GCB • Solution 3 : Condor Brick • Conclusion Dynamic Deployment of VO Condor Scheduler
Grid Job Submission Cluster Worker Nodes PBS Jobs Gram jobs GT4 PBS GRAM Submit Node (Collector, Negotiator, Master, Schedd) Dynamic Deployment of VO Condor Scheduler
Introduction • GT4 Gram allows remote job submissions to a cluster. • No scheduling capabilities of its own. • Condor-G can act as a meta scheduler to submit to different clusters via GRAM. • Multiple translation of job description before it is executed. Dynamic Deployment of VO Condor Scheduler
VO Requirements • The Southern California Earthquake Center (SCEC) • Gathers new information about earthquakes in SoCal • Integrations information into a comprehensive and predictive understanding of earthquake phenomena • Want to run millions of earthquake analysis jobs simultaneously on 10’s of resources (1000’s of Cpus). • Want a reliable performance over all their jobs • Want to hold the allocated resources for a period of time and pipe as many jobs to the resources as they can. • Want to run on wide number of resource configurations. Dynamic Deployment of VO Condor Scheduler
Solution 1 : Glideins Cluster Worker Nodes Connect to Collector Execute Jobs PBS runs Glidein request Glidein request Submit Node (Collector, Master, Negotiator, Schedd) GT4 PBS GRAM Cluster on a public network Dynamic Deployment of VO Condor Scheduler
Solution 1 : Glideins • Glideins add resources to an existing condor setup by running condor startd daemons via GT4 GRAM on a remote scheduler and cluster. • Glideins can be either setup and started by running the condor-glidein command or by writing your own condor or GRAM RSL. • Pros • Allows for your own personal condor cluster (VO grid) to be created • Allows for VO specific policies and priorities to be applied to all the nodes in the grid. • Cons • Glideins are not usable when the remote resources have private IP addresses. (without additional help). • Submit node becomes a bottle neck for the number of jobs that can be submitted Dynamic Deployment of VO Condor Scheduler
Solution 2 : Glideins via GCB Public Network Private Network Cluster Worker Nodes X Execute Jobs Connect to Collector PBS runs Glidein request Glidein request Submit Node (Collector, Master, Negotiator, Schedd) GT4 PBS GRAM Cluster on a private network with outgoing connection allowed Dynamic Deployment of VO Condor Scheduler
Solution 2: Glideins with GCB • Sometimes clusters can be behind a firewall and have only private IP addresses. • Such clusters may still allow outgoing connections from each node • In such a case glideins cannot work directly. • The shadow process is unable to communicate directly with the starter process on the remote node. • Generic Connection Broker (GCB) sits somewhere on the public network and acts as a proxy relay. Dynamic Deployment of VO Condor Scheduler
Solution 2 : Glideins via GCB Public Network Private Network GCB Cluster Worker Nodes Open Connection to GCB Execute Jobs Connect to Collector PBS runs Glidein request Glidein request Submit Node (Collector, Master, Negotiator, Schedd) GT4 PBS GRAM Cluster on a private network with outgoing connection allowed Dynamic Deployment of VO Condor Scheduler
Solution 2 : Glideins with GCB • Pros • GCB allows the shadow and starter processes to communicate across network boundaries. • Allows for a VO specific Grid to be created • Allows for VO specific policies and priorities to be applied to the entire VO Grid • Cons • Additional overhead of maintaining and running a GCB proxy. • Point of failure for all the jobs. • Only works if the remote cluster setup allows outgoing connections to the public network. • Submit Node is a bottle neck for the number of jobs that can be submitted Dynamic Deployment of VO Condor Scheduler
Solution 3: Condor Brick • Some clusters with Private IP address don’t allow either incoming or outgoing connection except to the boundary servers. • In such cases it is not possible to either run glideins directly or via GCB. • Also both the earlier solutions don’t allow per cluster/site VO policies or priorities to be applied. (It may be possible but at least I don’t know) • Welcome Condor Brick Dynamic Deployment of VO Condor Scheduler
Solution 3 : Condor Brick Public Network Private Network GCB Cluster Worker Nodes X Connect to Collector PBS runs Glidein request Glidein request GT4 PBS GRAM Submit Node (Collector, Master, Negotiator, Schedd) Cluster on a private network with outgoing connection allowed Dynamic Deployment of VO Condor Scheduler
Condor Brick Condor Master Condor Collector Condor Negotiator Condor Schedd VO specific Policies and Priorities (Condor Config) Private Network Public Network Dynamic Deployment of VO Condor Scheduler
Solution 3: Condor Brick • Bunch of condor-daemons and configuration files • Condor Bricks bind to all the interfaces on the remote server • Sits on the boundary and talks to both the public and private networks • Dynamically deploy the brick on remote clusters using a GT4 GRAM fork job-manager on any boundary machine when needed. • Dynamically glide in required cluster nodes to the brick using a GT4 <sched> job-manager. (schedd - startd communication over LAN) • Use Condor-C to submit jobs to remote Condor Bricks. (schedd – schedd communication over WAN) Dynamic Deployment of VO Condor Scheduler
Solution 3 : Condor Brick Public Network Private Network GT4 Fork GRAM Cluster Worker Nodes CONDOR BRICK Execute Jobs Deploy Brick Connect to Brick Execute Jobs via Condor-C PBS runs Glidein request Glidein request Submit Node (Collector, Negotiator, Master, Schedd) GT4 PBS GRAM Cluster on a private network with no incoming or outgoing connections allowed Dynamic Deployment of VO Condor Scheduler
Solution 3: Condor Brick • A condor brick at each cluster ensures job load is distributed from the submit node. • Condor brick at each cluster enables VO specific policies and priorities to be implemented at a per cluster/site level. • Uniform scheduling system from submit node to the cluster. • High throughput of jobs due to reduction in grid-overhead in each job submission. • Condor bricks bind on all interfaces making this the most generic of the three solutions. Dynamic Deployment of VO Condor Scheduler
Conclusion • Deploying a dynamic Condor scheduler for the VO on each cluster results in a robust, uniform scheduling environment. • A VO can control the policies and priorities for all their members on the entire VO grid or on each Cluster in the VO grid. • Condor brick eliminates some of the grid latency. • Condor brick allows using clusters with different network + firewall configurations. Dynamic Deployment of VO Condor Scheduler
Questions? • Thanks to Miron and the Condor team. Dynamic Deployment of VO Condor Scheduler