150 likes | 251 Views
The FNAL/CMS GlideinWMS: experience at BNL. Maxim Potekhin Panda/DDM Workshop October 4, 2007 BNL. glideinWMS. What will be covered: a very brief overview of a few existing Workload Management Systems general idea of the FNAL/CMS glideinWMS (I.Sfilogoi, FNAL)
E N D
The FNAL/CMS GlideinWMS: experience at BNL Maxim Potekhin Panda/DDM Workshop October 4, 2007 BNL
glideinWMS • What will be covered: • a very brief overview of a few existing Workload Management Systems • general idea of the FNAL/CMS glideinWMS (I.Sfilogoi, FNAL) • glideinWMS test bench at BNL • strengths and weaknesses of the glideinWMS, and what we can learn from it, in the Panda system context
Workload Management Systems Overview • for details, see a talk by Igor Sfiligoi and Burt Holzman at CHEP07: • http://indico.cern.ch/contributionDisplay.py?contribId=216&sessionId=26&confId=3580 • systems considered: • Condor-G • ReSS • gLite WMS • glideinWMS – Igor’s effort • what was looked at: • performance • scalability • reliability
Workload Management Systems Overview • notes on Condor-G: • significantly, it is used as the underlying submission mechanism for most others WMS • part of the Condor distribution • job submission with Condor-G: • scales up to 7k jobs in the queue • start-up speed depends on the number of Grid Managers in configuration (from 30 with 30 managers to 60 with a 100) • same applies to job-removal speed • empirical result: with 100 managers, Condor-G has enough throughput to saturate the batch system • with CE crashes, jobs may stay in the queue forever
Workload Management Systems Overview • notes on ReSS: • Resource Selection System is a matchmaker for Condor-G, using the information harvested from CEMon on the Grid sites • submission is still done via Condor-G, with ReSS responsible for the determination as to where to submit • tested with 4x10k jobs queued • characteristics similar to plain Condor-G
Workload Management Systems Overview • notes on gLite WMS: • relies on BDII fir information on Grid sites • has a dedicated cliens • uses Condor-G fro submission • performance • slow, down to 5 submissions per minute • however, in collection mode, submission can be very fast with an effective rate of 1000 jobs per minute. Caveat: occasional failures of the collection submissions due to overload of the WMS itself • monitoring: • no easy way to find IDs of owned jobs
glideinWMS • For a complete overview of the FNAL/CMS glideinWMS, see http://home.fnal.gov/~sfiligoi/glideinWMS/doc/manual/index.html#overview (two following slides are borrowed from that source) • in the context of glideinWMS, a glidein is a Condor startd submitted as a Grid job • once it starts, it registers with a Condor pool available from the submission node • the users then submit their payload jobs to Condor, while being insulated from the Grid implementation details • obvious benefits of a familiar environment and monitoring tools that come with it
glideinWMS • our experience with glideinWMS: • the installation script is complex, mostly works and is being improved • due to ready available support from the developer (I.Sfiligoi) in the installation stage, we got it up and running after a few initial misconfigurations • for testing purposes, the Front End and the Factory were colocated on the submission machine • used an instance of the Apache server on a separate node to host the payload • we successfully ran a test job on Panda Pilots, which in turn were deployed on the BNL cluster via the glideinWMS mechanism – a nice proof of principle • the monitoring tools that come with the product work rather nicely (provide detailed stats of the factory operation in graphic and XML formats)
glideinWMS • The payload hosted on the Apache server: note the sha1sum file that is used to verify the payload’s authenticity
glideinWMS • observations: • Condor-G is used for glidein submissions; we can, therefore, expect same intrinsic limitations as with other WMS • Q: how does glideinWMS apparently do better in certain tests? • A: preemptive submission of startd’s, which allows the following: • sequential execution of jobs off the same startd • effectively, advance reservation on remote sites by occupying batch slots • Q:is that compatible with current practices and policies? • A:remains to be seen, and probably not... (cf the large number of slots that need to be “hot” for speedy submission) • Q:what happens to the unused glideins? • A:they die after a predetermined timeout, typically 20 minutes • empirically, the latencies behave as expected, i.e. when a sufficeint number of glideins are active, the submissions are indeed speedy; when there is not enough glideins, the submission is effectively throttled
glideinWMS • observations: • Q: how does glideinWMS handle inter-site, inter-process communications despite the presence of firewalls? • A: by using the GCB (Generic Connection Broker), which is reached by the two communicating nodes via an outgoing connection on either side. The GCB must reside on the public Internet. • Q:Howdoes using the GCB effect the security? • A:Most likely slightly adversely, and remains to be seen (see the GCB site for detail). Almost by definition, anything that defeats a firewall can’t possibly enhance security of the system. • Q:are there known scalability problems with GCB? • A:yes, a GCB instance will keel over if more than approx. 600 connections are made, and will take all the associated Condor jobs with it. Work is being done to rectify that, however currently this is more than a trivial limitation and potential vulnerability of the production system
glideinWMS • observations: • Q: what are other scalability problems with the glideinWMS? • A: Due to memory requirements of the current Condor implementations, the submission machine itself can be a bottleneck, however tests were successful with up to 4k running jobs and this problem can be further circumvented by using multiple submission machines; number of queued jobs can be significantly higher. • Scalability issues can be addressed with multiple Front Ends and multiple Factories • Hardware matters! Dedicated machines needed for critical functions. • redundancy? • one of the useful features of the glideinWMS is that a loss of a startd process is handled elastically in the system, i.e. not user jobs are lost. However, this overlaps with an identical feature of the Panda Pilot submission... Do we need an extra layer of indirection?
glideinWMS • Conclusions: • we have learned a lot about the glideinWMS, and having installed it locally, demonstrated that it can be used to instantiate Panda Pilots on the Grid • we formed an opinion that the glideinWMS is, in principle, subject to the same limitations as any WMS which is using Condor-G as the underlying remote job submission protocol, however it works around them by running payloads serially on the same CE (which is effectively sequestered by the user) and in addition having a configurable number of jobs idling on the remote site, waiting for payload. The latter can be achieved with the existing Panda system, yet it is not likely to be aligned with site policies • we are in the process of integrating our experience with the glideinWMS into our work with the Panda Pilot • certain features of the glideinWMS are highly practical and can be used in our systems, such as using checksums to validate payloads, thus mitigating security gaps in the Panda system