200 likes | 206 Views
Explore the roadmap for Condor Version 6.7.x focusing on scalability, resources, failover, and accessibility. Learn about key improvements such as increased job capacity, better matchmaking, and enhanced security.
E N D
Outline • The “Big Picture” • Version 6.7.x • Availability • Failover • Scalability • Resources, jobs, matchmaking framework, files • Accessibility • APIs, more Grid middleware, network
Big Picture What do we want to achieve in a new Condor developer series? • Technology Transfer • Building a bridge between the Condor production software development activity and the academic core research activity BAD-FS, Stork, Diskrouter, Parrot (transparent I/O), Schedd Glidein, VO Schedulers, HA, Management, Improved ClassAds…
What do we want to achieve, cont? • New Ports: Go to where the cycles are! • The RedHat Dilemma • Our porting ‘hopper’ : • AIX 5.1L on the PowerPC architecture • Redhat AS server on x86 • Fedora Core on x86 • Fedora Core 2 on x86 • Redhat AS server on AMD64 • SuSE 8.0 on AMD64 • Redhat AS server on IA64 • HPUX 11.11 64-bit
What do we want to achieve, cont. • Improve existing ports • Move “clipped wing” port to full ports (w/ checkpoint, process migration) • Max OS X, Windows • Better integration into environments • Windows: operate better w/ DFS, use MSI • Unix: operate w/ AFS
What do we want to achieve, cont. • Address changes in the computing landscape • Firewalls, NATs • 64-bit operating systems • Emphasis on data • Movement towards standards such as WS, OGSA, …
Version 6.7.x Theme • Version 6.7.x • Scalability • Resources, jobs, matchmaking framework, security • Availability • Failover • Accessibility • APIs, more Grid middleware, network
High Availability in v6.7.x What happens if my submit machine reboots? Once upon a time, only one answer: job restarts. Checkpoint? No Checkpoint?
New: Job Progress continues if connection is interrupted • Now for Vanilla and Java universe jobs, Condor now supports reestablishment of the connection between the submitting and executing machines. • To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = <N seconds> For example: JobLeaseDuration = 1200
What if the submission point spontaneously explodes? (don’t try this at home)
More High Availability Solutions • Condor can support a submit machine “hot spare” • If your submit machine is down for longer than N minutes, a second machine can take over • Two mechanisms available • Job Mirroring • Described by Jaime earlier today • High Availability Daemon Failover • Just tell the condor_master to run ONE instance
Master SchedD Daemon Failover Machine A Machine B Refresh Lock Refresh Lock Obtain Lock Check Lock Master SchedD Active Active (hot spare)
Accessibility • Support for GCB • Condor working w/ NATs, Firewalls • Distributed Resource Management Application API (DRMAA) • GGF Working Group • An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems • Condor DRMAA interface to appear in v6.7.0
SOAP/Grid Service condor_schedd Cedar Web Service: SOAP HTTPS OGSI: SOAP HTTPG
New “Grid Universe” • With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 • Other gridtypes? GT3 for OGSA-based Globus Toolkit 3
Condor-G improvements • Condor-G can submit to either Globus GT2 or GT3 resources, including support for GT3 with web services. • Condor-G includes everything required; no need for client to have a GT3 installation. • Good migration path to OGSA • Condor-G to Nordugrid, Unicore, Condor, ORACLE • Support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy/
Why Condor + MyProxy? • Long-lived tasks or services need credentials • Task lifetime is difficult to predict • Don’t want to delegate long-lived credentials • Fear of compromise • Instead, renew credentials with MyProxy as needed during the task’s lifetime • Provides a single point of monitoring and control • Renewal policy can be modified at any time • For example, disable renewals if compromise is detected or suspected
Refresh Credentials RetrieveCredentials RefreshCredentials Credential Renewal Home Remote SubmitJobs ResourceManager Launch Job Condor-G Scheduler EnableRenewal MyProxy Job
More… • Condor can now transfer job data files larger than 2 GB in size. • On all platforms that support 64bit file offsets • Real-time spooling of stdout/err/in in any universe incl VANILLA • Real-time monitoring of job progress