300 likes | 406 Views
The Roadmap to New Releases. Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor condor-admin@cs.wisc.edu. Stable vs. Development Series. Much like the Linux kernel, Condor provides two different releases at any time: Stable series
E N D
The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor condor-admin@cs.wisc.edu
Stable vs. Development Series • Much like the Linux kernel, Condor provides two different releases at any time: • Stable series • Development series • Allows Condor to be both a research project and a production-ready system
Stable series • Series number in version is even (e.g. 6.2.0) • Releases are heavily tested • Only bug fixes and ports to new platforms are added on a stable series
Stable series (cont.) • A given stable release is always compatible with other releases from the same series • Recommended for production pools
Development Series • Series number in the version is odd (e.g. 6.1.17, 6.3.1) • New features and new technology are added frequently • Versions from the same development series are not always compatible with each other
Development Series (cont.) • Releases are not as heavily tested • Generally not recommended for production pools • … unless new features are required • … unless we recommend otherwise :^)
Where is Condor Today? • Version 6.3.2 being released asap – this is the v6.4.0 release candidate. • We expect version 6.4.0 released by the end of March.
New Ports in 6.4.0 • Full support (with checkpointing and remote system calls): • RedHat 7.x (Linux 2.4.x kernel + glibc 2.2.x)
New Ports in 6.4.0 (cont.) • ”Clipped" support (no checkpointing, PVM, or remote system calls, but all other functionality is available) • Windows 2000 • Mac OS X
Secure Communication • Secure network communication • Strong user authentication • Multiple methods supported: Kerberos, X509, NT LanMan, … • Encryption • Integrity • Authorization based on host or user
New Job Universes • MPI Universe • Launch MPI jobs linked with MPICH library • Globus Universe • Faster, more reliable, better integrated • Java Universe
Java Universe Job universe = java executable = Main.class jar_files = MyLibrary.jar input = infile output = outfile arguments = Main 1 2 3 queue condor_submit
Why not use Vanilla Universe for Java jobs? • Java Universe provides more than just inserting “java” at the start of the execute line • Knows which machines have a JVM installed • Knows the location, version, and performance of JVM on each machine • Provides more information about Java job completion than just JVM exit code • Program runs in a Java wrapper, allowing Condor to report Java exceptions, etc.
Java support, cont. condor_status -java Name JavaVendor Ver State Activity LoadAv Mem aish.cs.wisc. Sun Microsy 1.2.2 Owner Idle 0.000 249 anfrom.cs.wis Sun Microsy 1.2.2 Owner Idle 0.030 249 babe.cs.wisc. Sun Microsy 1.2.2 Claimed Busy 1.120 123 ...
Condor File Transfer • Condor will transfer job files from the submit machine to the execute machine • Files to send and/or receive specified at submit time • Transfer is atomic • All files are transferred, or transfer fails • Appeared in v6.2 only in Condor for Windows
File Transfer, cont. • Example: transfer_input_files = x, y, z … transfer_output_files = a, b, c …. transfer_files = [ ALWAYS | ONEXIT ] • Note: Condor can automatically figure out output files • Default: Send back any new/changed files
Remote I/O Socket • Job can request that the condor_starter process on the execute machine create a Remote I/O Socket • Used for online access of file on submit machine – without Standard Universe. • Use in Vanilla, Java, … • Libraries provided for Java and for C, e.g. : Java: FileInputStream -> ChirpInputStream C : open() -> chirp_open()
shadow starter Secure Remote I/O Local I/O (Chirp) I/O Server I/O Proxy Fork Local System Calls Job Home File System I/O Library Submission Site Execution Site
Job Policy Expressions • User can supply job policy expressions in the submit file. • Can be used to describe a successful run. on_exit_remove = <expression> on_exit_hold = <expression> periodic_remove = <expression> periodic_hold = <expression>
Job Policy Examples • Do not remove if exits with a signal: on_exit_remove = ExitBySignal == False • Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) • Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)
Firewall Support • Port Restrictions • In condor_config file can specify: LOWPORT = x HIGHPORT = y • All dynamic ports will be between x and y inclusive • Condor + Firewalls/Private Networks: • Who: Se-Chang Son • Time: 9am-12pm Weds • Where: rm 3387
Condor on Windows • On both NT and Win2k • New universes added: MPI, Java, Scheduler (and Globus in the works!) • DAGMan ported • CondorView ported • Run shadow + DAGMan as the user • Allows submission from directories on shared filesystems
And more… • Unix Man pages • Fetch/consolidate log files remotely • ClassAd chaining • Many DAGMan improvements • Bug fixes, etc…
What’s Next?Future Directions • Increased focus on standalone tools built with Condor Technology • DAGMan • NeST • PFS • HawkEye • Condor-G …
What’s Next? • Big Item: More focus on being a service provider than just an end-user tool: • Developer APIs / libraries • SOAP access to services • XML representations of user logs, ClassAds, accounting info, etc.
More what’s next… • Condor on Windows • Increased support from Microsoft Research • Remote I/O • Complete Shared Filesystem support • Condor-G • MPI Scheduling Improvements
More what’s next… • New version of ClassAds into Condor • Conditionals !! • if/then/else • Aggregates (lists, nested classads) • Built-in functions • String operations, pattern matching, time operators, unit conversions • Clean implementations in C++ and Java • ClassAd collections
More what’s next… • Re-write of the condor_schedd • Performance enhancements and lowered resource requirements (particularly RAM) • Re-write of the checkpoint server • Add secure communication • NEST technology infusion • Enhanced support for multiple servers • Store meta-data along with checkpoint files