380 likes | 464 Views
Condor Week Summary. March 14-16, 2005 Madison, Wisconsin. Overview. Annual meeting at UW-Madison. About 80 participants at this year’s meeting. Participants come from universities, research labs and industry. Single plenary sessions with talks from users and developers. Overview.
E N D
Condor Week Summary March 14-16, 2005 Madison, Wisconsin
Overview • Annual meeting at UW-Madison. • About 80 participants at this year’s meeting. • Participants come from universities, research labs and industry. • Single plenary sessions with talks from users and developers.
Overview • Topics ranged from basic to advanced. • Selected highlights in today’s talk. • Slides from this year’s talks can be found at http://www.cs.wisc.edu/condor/CondorWeek2005
CondorWeek Topics • distributed computing and Condor • data handling and Condor • 3rd party contributions to Condor • reports from the field • Condor roadmap
Condor Grids (by Alan De Smet) • Various alternatives for accessing remote computing resources (distributed computing, flocking, Globus/Condor-G, Condor-C, etc). • Discussed pros and cons of each approach (ACF uses Globus/Condor-G).
Condor-G Status and News • Globus Toolkit 2 is stable • Globus Toolkit 3 is supported • But we think most people are moving to… • Globus Toolkit 4 in progress • GT4 beta works now in Condor 6.7.6 • Condor will officially support soon after official GT4 release.
Glidein (by Dan Bradley) • You have access to a cluster running some other batch system. • You want Condor features, such as • queue management • matchmaking • checkpoint migration
What Does Glidein Do? • Installation and setup of Condor. • May be done remotely. • Launching Condor. • Through Condor-G submission to Globus. • Or you run the startup script however you like.
Condor and DBMS (by Jeff Naughton) • Premise: A running Condor system is awash in data: • Operational data • Historical data • User data • DBMS technology can help capture, organize, manage, archive, and query this data.
Three potential levels of involvement • Passively collect and organize data, expose it through DB query interfaces. • Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS) • Provide services to help users manage their data.
Why do this? • For Condor administrators • Easier to analyze and trouble shoot; • Easier to audit; • Easier to explore current and past system status and behavior.
Our projects and plans • Quill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!] • CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own “sandbox”]
Quill Master • Job ClassAds information mirrored into an RDBMS • Both active jobs and historical jobs • Benefits BOTH scalability and accessibility Startd … Schedd Quill Job Queue log RDBMS Queue + History Tables
Longer-term plans • Tight integration of DBMS technology and Condor [status: thinking hard!]. • DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]
Stork (by Tevfik Kosar) • Condor tool for data movement. • First available in v. 6.7.6. Will be included in next stable release (6.8.0). • Prototypes deployed at various sites.
Bioinformatics:BLAST High Energy Physics:LHC Educational Technology:WCER EVP Astronomy: LSST 2MASS SDSS DPOSS GSC-II WFCAM VISTA NVSS FIRST GALEX ROSAT OGLE ... 2-3 PB/year 11 PB/year 500 TB/year 20 TB - 1 PB/year
Stork: Data Placement Scheduler • First scheduler specialized for data movement/placement. • De-couples data placement from computation. • Understands the characteristics and semantics of data placement jobs. • Can make smart scheduling decisions for reliable and efficient data placement. http://www.cs.wisc.edu/condor/stork
Stork can also: • Allocate/de-allocate (optical) network links • Allocate/de-allocate storage space • Register/un-register files to Meta Data Catalog • Locate physical location of a logical file name • Control concurrency levels on storage servers
Storage Management (by Jeff Weber) • NEST (Network Storage Technology) is another project at UW-Madison. • To be coupled to Condor and Stork. • No stable release available yet.
Overview of NeST • NeST: Network Storage Technology • Lightweight: Configuration and installation can be performed in minutes. • Multi-protocol: Supports Chirp, GridFTP, NFS, HTTP • Chirp is NeST’s internal protocol • Secure: GSI authentication • Allocation: NeST negotiates “mini storage contracts” between users and server.
Why storage allocations ? • Users need both temporary storage, and long-term guaranteed storage. • Administrators need a storage solution with configurable limits and policy. • Administrators will benefit from NeST’s autonomous reclamations of expired storage allocations.
Storage allocations in NeST • Lot – abstraction for storage allocation with an associated handle • Handle is used for all subsequent operations on this lot • Client requests lot of a specified size and duration. Server accepts or rejects client request.
Condor and SRM (by Derek Wright) • Coordinate computation and data movement with Condor. • Condor ClassAd hook (STARTD_CRON_JOBS) queries DRM for files in cache and publishes it in ClassAd for each node. • FSM keeps track of all files required by jobs in the system and contacts HRM if required files are missing. • Regular Condor matchmaking schedules jobs where files exist.
3rd party contributions to Condor • High availability features (Technion Institute). • Privilege separation in Condor (Univ. of Cambridge). • Optimizing Condor throughput (CORE Feature Animation). • Web interface to Condor (Univ. College of London).
Negotiator Collector Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Current Condor Pool Central Manager
Idle Central Manager Idle Central Manager Active Central Manager Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Highly Available Condor Pool Highly Available Central Manager
Highly Available Central Manager • Our solution - Highly Available Central Manager • Automatic failure detection • Transparent failover to backup matchmaker (no global configuration change for the pool entities) • “Split brain” reconciliation after network partitions • State replication between active and backups • No changes to Negotiator/Collector code
No privilege separation: root Condor daemons Condor job • Privilege separation: root Condor daemons Condor job What is privilege separation? • Isolation of those parts of the code that run at different privilege levels
Throughput Optimization (CORE Feature Animation) Performance Before => After: • Removed Groups: 6 => 5.5 min • Significant Attributes: 5.5 => 3 min • Schedd Algorithm: 3 => 1.5 min • Separate Servers: 1.5 => 0.6 min • Cycle delay: 0.6 => 0.33 min • Server Loads: <1 Middleware <2 Central Manager
Web Service Interface to Condor • Facilitate the development of third-party applications capable of interacting with Condor (remotely). • E.g. build higher-level application specific scheduler that submits jobs to multiple Condor pools based on application semantics • These can be built using a wide range of languages/SOAP packages • BirdBath has been tested on: • Java (Apache Axis, XSUL) • Python (ZSI) • C# (.Net) • C/C++ (gSOAP) • Condor accessible from platforms where its command-line tools are not supported/installed
Condor Plans (by Todd Tannenbaum) • Condor 6.8.0 (stable series) available in May 05. • Fail-over, persistence and other features. • Improved scalability and accessibility (API’s, Grid middleware, Web-based interfaces, etc). • Grid universe and security improvements.
BAM! More tasty Condor goodness! • Condor can now transfer job data files larger than 2 GB in size. • On all platforms that support 64bit file offsets • Real-time spooling of stdout/err/in in any universe incl VANILLA • Real-time monitoring of job progress • Condor Installer on Win32 uses MSI (thanks Micron!) • condor_transfer_data (DZero) • STARTD_VM_EXPRS (INFN) • condor_vacate_job tool • condor_status -negotiator
And More… • New startd policy expression MaxJobRetirementTime. • specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job • -peaceful option to condor_off, condor_restart • noop_job = True • Preliminary support for the Tool Daemon Protocol (TDP) • TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. • specify a ``tool'' that should be spawned along-side their regular Condor job. • On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.
Hey Jobs! We’re watching you! Submit Execute • condor_starter enforce limits • Starter is already monitoring many job characteristics (image size, cpu usage, etc) • Threshold expressions • Use more resources than you said you would, and BAM! • Local Universe • Just like Scheduler Universe, but there is a condor_starter • All advantages of the starter startd schedd starter starter job job Hey, job, behave or else!
ClassAd Improvements in Condor! • Conditionals • IfThenElse(condition,then,else) • String functions • Strcat(), strcmp(), toUpper(), etc. • StringList functions • Example of a “string list” (CSV style) • Mylist = “Joe, Jon, Jeff, Jim, Jake” • StrListContains(), StrListAppend(), StrListRemove(), etc. • Others • Type test, some math functions
Accounting Groups andGroup Quota Support • Account Group (w/ CORE Feature Animation) • Account Group Quota (inspiration CDF @ Fermi) • Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them • Could use Machine Rank… • but this ties to specific machines • Or could use new group support • Each group can be given a quota in config file • Job ads can specify group membership • Group quotas are satisfied first • Accounting by user and by group
Improved Scalability • Much faster negotiation • SIGNIFICANT_ATTRIBUTES determined automatically • Schedd uses non-blocking TCP connects to the startd • Negotiator caching • Collector Forks for queries • More…
What’s brewing for after v6.8.0? Can I commit this to CVS?? • More data, data, data • Stork distributed w/ v6.8.0, incl DAGMan support • NeST manage Condor spool files, ckpt servers • Stork used for Condor job data transfers • Virtual Machines (and the future of Standard Universe) • Condor and Shibboleth (with Georgetown Univ) • Least Privilege Security Access (with U of Cambridge) • Dynamic Temporary Accounts (with EGEE, Argonne) • Leverage Database Technology (with UW DB group) • ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) • Easier Updates • New ClassAds (integration with Optena) • Hierarchical Matchmaking