580 likes | 603 Views
What’s new in Condor? Condor Week 2006. So Todd… where is v6.8? Well, v6.7 has been a challenge…. Around since the 80’s. Around since the 80’s. 80’s Mullet Boy. 100 people surveyed! Favorite “ility” ?. 100 people surveyed! Favorite “ility” ?. Deployability!. Existing Ports.
E N D
Around since the 80’s 80’s Mullet Boy
100 people surveyed!Favorite “ility” ? Deployability!
Existing Ports • Digital UNIX 4.0 Alpha • AIX 5.2 (clipped) PowerPC • Tru64 5.1 (clipped) Alpha • HP UNIX 10.20 PA RISC • HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC • Irix 6.5 (clipped) SGI • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 • Linux 2.4.x (glibc 2.2) - Red Hat 8 Intel x86 • Linux 2.4.x (glibc 2.3) - Red Hat 9 Intel x86 • Enterprise Server 8.1 Intel Itanium • Solaris 8 Sparc • Solaris 9 Sparc • Microsoft Windows 2000 or XP (clipped) Intel x86 CondorWeek 2005
New Ports Sigh… • Introduced in v6.6.x • MacOSX (“clipped") PowerPC • Debian Linux 3.1 Intel x86 • Fedora Core 1 Intel x86 • Red Hat Enterprise Linux 3 Intel x86 • SuSE Linux Enterprise Server 8.1 Intel Itanium • Introduced in v6.7.x • AIX 5.1 (“clipped") PowerPC • Fedora Core 2 on x86 • Fedora Core 3 on x86 • SuSE 8.0 ("clipped") on AMD64 • Solaris 10 ("clipped") on Sparc • Scientific Linux (Release 303) on x86 • Still to be introduced in v6.7.x (before v6.8.0) • HPUX 11i 64-bit pa-risc • RHEL 4 on x86 • “native” 64 bit AMD Linux CondorWeek 2005 “Psilord” – The Condor porting doctor. Talk to him in person tomorrow.
Porting Table • See http://www.cs.wisc.edu/condor/porting/port_table.html • Highlights • Almost every 32-bit Linux flavor as “full” • Every other Unix, MacOS and Windows available as “clipped” • Solaris 10 and HP-UX 11.x now “clipped” • FreeBSD 4 contribution from Yahoo!, added 5 and 6 • X86_64 Linux: “full” running in the lab
Backfill Jobs • Execute machines will run a locally staged executable when otherwise idle. • Currently designed for BOINC. # Turn on backfill functionality, and use BOINC ENABLE_BACKFILL = TRUE BACKFILL_SYSTEM = BOINC # Spawn a backfill job if we've been Unclaimed for more than 5 minutes START_BACKFILL = $(StateTimer) > (5 * $(MINUTE)) # Evict a backfill job if the machine is busy (based on keyboard # activity or cpu load) EVICT_BACKFILL = $(MachineBusy)
Joining Condor’s Einstein@Home Compute Team • If you’re running BOINC backfill jobs in Condor and want to use your cycles to help another UW project, please join the Einstein@Home computation • Join the “Condor Backfill” team: • http://einstein.phys.uwm.edu/team_display.php?teamid=5994 • http://einstein.phys.uwm.edu/create_account_form.php?teamid=5994
More “deployability” • “Personal” Condor Support on Win32 • LocalSystem not required • MSI installer on Win32 (thanks Micron!) • New tools Safe, dynamic Condor service deployment. More info @ Research BOF 9am Rm219 • condor_cold_start and • condor_cold_stop
100 people surveyed!Favorite “ility” ? Availability!
translate GCB Condor with Firewalls and NATS:GCB in v6.8.0! listen accept connect Client app Server app GCB layer GCB layer TCP/IP TCP/IP Relay point
Job Progress continues if connection is interrupted • Now for Vanilla, Java, and Grid universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. • If network outage between execute and submit machine • If submit machine restarts • Grid Universe was tricky… • To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = <N seconds> For example: job_lease_duration = 1200
Job Progress continues if submit machine fails • Condor can now support a submit machine “hot spare” (schedd failover) • If your submit machine A is down for longer than N minutes, a second machine B can take over • Requires shared filesystem between machines A and B
Central Manager Failover • Condor Central Manager has two services • condor_collector • Now a list of collectors is supported • condor_negotiator (matchmaker) • If fails, election process, another takes over • Accounting state is peridocially replicated • Contributed technology from Technion
Reliability, cont. • Time shifts • Quill • Closing windows of vulnerability
100 people surveyed!Favorite “ility” ? Lighweight?
100 people surveyed!Favorite “ility” ? X Lighweight?
100 people surveyed!Favorite “ility” ? Functionality!
Security • Common Authentication Methods between Condor on Unix and Win32 • Kerberos 1.4 • Additional hopeful benefit: Authentication against MS Active Directory! • SSL • Password (shared secret) • Starter only runs known executables • More powerful, unified map file(s) • GSI credentials delegated
With Condor on Win32, it be nice if … • My jobs could access my files just like the condor_shadow can • I didn’t have to tie my execute machines to a single account • I didn’t have to run condor_store_cred from every machine where my credential is needed (thank you Optena)
The Windows CredD • A centralized repository for user passwords C:\>condor_store_cred add Account: gquinn@CROW Enter password: Operation succeeded. myp4sswd y0urs “store password” credd <password>
The Windows CredD schedd myp4sswd “fetch password” y0urs <password> shadow Submit machines can use the CredD to impersonate the user in the shadow
The Windows CredD starter “fetch password” myp4sswd y0urs <password> condor_exec.exe Execute machines can use the CredD to run jobs as the submitting user!
Running Jobs as Submitting User • In submit file: • Run_job_as_owner = true • In config file on submit and execute nodes: CREDD_HOST = vault.cs.wisc.edu STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True
Some Condor APIs • Command Line tools • condor_submit, condor_q, etc • -format, -constraint, -xml • Condor Perl Module • Chirp • Checkpoint Library API • MW --- improved! • DRMAA (Works w/ Win32, on SourceForge) • Condor Grid ASCII Protocol (GAHP) • Web Service Interface
DRMAA • Distributed Resource Management Application API (DRMAA) • GGF Working Group • An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems • An API with C and Java bindings • not a protocol • Scope • Does: job submission, monitoring, control, final status • Does not: file staging, reservations, security, …
Condor GAHP • The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout • Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events
GAHP, cont Example: R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S
Web Service Interfaces • SOAP over http or https to the Condor daemons • Use any language or platform (where you can find a decent SOAP library) • Functionality Exposed in current release • Submit jobs • Retrieve job output • Remove/hold/release jobs • Query machine status (fetch ads from collector) • Query job status (fetch ads from the schedd)
Getting machine status viaSOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“http://machine:port”)); ads = collector.queryStartdAds(“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions.
More Functionality changes.. • FINALLY, clean/consistent cross-platform quoting rules for arguments and environment variables (see condor_submit man page) • Schedd can run HawkEye modules, just like the Startd • Enables monitoring on the submit machine • condor_history : now faster than a snail, and cleans up droppings. • DeferralTime, DeferralWindow • Coordinated starts • BIND_ALL_INTERFACES in config file • WANT_REMOTE_IO in job ClassAd
ClassAd Functions in Condor! • Conditionals • IfThenElse(condition,then,else) • String functions • Strcat(), strcmp(), toUpper(), etc. • StringList functions • Example of a “string list” (CSV style) • Mylist = “Joe, Jon, Jeff, Jim, Jake” • StrListContains(), StrListAppend(), StrListRemove(), etc. • Others • Regular expressions, arithmetic, etc…
Accounting Groups andGroup Quota Support • Account Group (w/ CORE Feature Animation) • Account Group Quota (inspiration CDF @ Fermi) • Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them • Could use Machine Rank… • but this ties to specific machines • Or could use new group support • Each group can be given a quota in config file • Job ads can specify group membership • Group quotas are satisfied first • Accounting by user and by group
100 people surveyed!Favorite “ility” ? Universability!
Grid Universe • With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 • Other gridtypes? • GT2 (Globus Toolkit 2) • GT3 (Globus Toolkit 3.2) • GT4 (Globus Toolkit 3.9.5+) • UNICORE • Nordugrid • PBS (OpenPBS, PBSPro – technology from INFN) • LSF (Platform LSF – technology from INFN) • CONDOR (thanks gLite!) ‘Condor-G’ ‘Condor-C’
Other Grid Universe improvements • Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy (both GT2 and GT4) • GT4 : we start a GridFTP server behind the scenes • GridFTP server bundled w/ Condor nowadays • Some functionality present in Condor-G added to Condor-C • Forwarding of refreshed credentials (EGEE) • GSI authentication support • Cleaner ClassAd representation (URL)
Parallel Universe • Replaces the “MPI” universe • Allows running arbitrary programs that need to gang-schedule multiple machines • MPICH, LAM, … • FT-MPICH (Seoul National Univ) • Great for testing environments
Hey Jobs! We’re watching you! Submit Execute • Local Universe • Just like Scheduler Universe, but there is a condor_starter • All advantages of the starter startd schedd starter starter job job Hey, job, behave or else!
100 people surveyed!Favorite “ility” ? Scalability!
Faster Negotiation • SIGNIFICANT_ATTRIBUTES determined automatically • Job attributes AutoClusterId and AutoClusterAttributes • Rounding of Attributes • Schedd uses non-blocking TCP connects to the startd • Negotiator caching • Collector Forks for queries • More coming…