Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist May 10-11, 2005 Chicago, ILL

Resource Management Accounting & user mgmt System Monitoring System Build & Configure Job management ORNL ANL LBNL PNNL SNL LANL Ames IBM Cray Intel SGI NCSA PSC Scalable Systems Software Participating Organizations Problem • Computer centers use incompatible, ad hoc set of systems tools • Present tools are not designed to scale to multi-Teraflop systems Goals • Collectively (with industry) define standard interfaces between systems components for interoperability • Create scalable, standardized management tools for efficiently running our large computing centers To learn more visit www.scidac.org/ScalableSystems

Scalable Systems Software Suite Any Updates to this diagram? Grid Interfaces Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Meta Scheduler Meta Monitor Meta Manager Meta Services Accounting Scheduler System & Job Monitor Node State Manager Service Directory Standard XML interfaces Node Configuration & Build Manager authentication communication Event Manager Allocation Management Usage Reports SSS-OSCAR Process Manager Job Queue Manager Hardware Infrastructure Manager Validation & Testing Checkpoint / Restart

Components in Suites Multiple Component Implementations exits Meta Manager Grid scheduler Warehouse Meta Services NSM Maui sched Warehouse (superMon NWPerf) Gold SD ssslib BCM EM Gold Usage Reports APITest PM Bamboo QM Compliant with PBS, Loadlever job scripts HIM BLCR

Scalable Systems Users • Production use today: • Running an SSS suite at ANL, and Ames • Running components at PNNL • Maui w/ SSS API (3000/mo), Moab (Amazon, Ford, TeraGrid, …) • Who can we involve before the end of the project? • - National Leadership-class facility? • NLCF is a partnership between • ORNL (Cray), ANL (BG), PNNL (cluster) • NERSC and NSF centers • NCSA cluster(s) • NERSC cluster?

Goals for This Meeting • Updates on the Integrated Software Suite components • Planning for SciDAC phase 2 – • discuss new directions and June SciDAC meeting • Preparing for next SSS-OSCAR software suite release • What is missing? What needs to be done? • Getting more outside Users. • Production and feedback to suite • Discussion of involvement with NLCF machines: • IBM BG/L, Cray XT3, Clusters

Highlights of Last Meeting (Jan. 25-26 in DC) Details in Main project notebook • Fred Attended - he gave state of MICS, SciDAC-2, and his vision for changed focus • Discussion of whitepaper and presentation for Strayer ideas and Fred feedback • API Discussions • voted for Process Manager API (12 yes 0 no 0 abstain) • New Warehouse protocol presented • Agreed to Quarterly Suite Releases this year– and dates.

Since Last Meeting • CS ISICs meet with SciDAC director (Strayer) Feb 17 DC • Whitepaper – some issues with Mezzacapa • Give hour “highlight” presentation on goals, impact, and potential CS ISIC ideas for next round. • Strayer was very positive. Fred reported that the meeting could not have gone any better. • Cray Software Workshop (called by Fred) • January in Minneapolis • Status of Cray SW and how DOE research could help • Several SSS members there. Anything since? • Telecoms and New entries in Electronic Notebooks • Pretty sparse since last meeting

Major Topics for This Meeting • Latest news on the Software Suite components • Preparing for next SSS-OSCAR software suite release • Discuss ideas for next round of CS ISICs • Preparation for upcoming meetings in June • Presentation/ 1st vote on Queue Manager API • Getting more users and feedback on suite

Agenda – May 10 8:00 Continental Breakfast 8:30 Al Geist - Project Status 9:00 Discussion of ideas presented to Strayer 9:30 Scott Jackson - Resource Management components 10:30 Break 11:00 Will Mclendon - Validation and Testing Ron Oldfield – integrated SSS test suites 12:00 Lunch (on own at cafeteria ) 1:30 Paul Hargrove Process Management and Monitoring 2:30 Narayan Desai - Node Build, Configure, and Cobalt on BG/L 3:30 Break 4:00 Craig Steffen – SSSRMAP in ssslib 4:30 Discussion of getting SSS users and feedback 5:30 Adjourn for dinner

Agenda – May 11 • 8:00 Continental Breakfast • 8:30 Thomas Naughton - SSS OSCAR software releases through SC05 • 9:30 Discussion and voting • Bret Bode - XML API for Queue Manager • 10:30 Group discussion of ideas for SciDAC-2. • 11:00 Preparations for upcoming meetings • FastOS meeting June 8-10, • SciDAC PI Meeting in June 26-30 (poster and panels), • Set next meeting date/location: August 17-19, ORNL • 12:00 Meeting Ends

Ideas Presented to SciDAC DirectorMike StrayerFebruary 17, 2005Washington DC

Ultrascale Hardware Rainer, Blue Gene, Red Storm OS/HW teams View to the Future HW, CS, and Science Teams all contribute to the science breakthroughs Computing Environment Common look&feel across diverse HW Leadership-class Platforms SciDAC Science Teams Software & Libs SciDAC CS teams High-End science problem Research team Tuned codes BreakthroughScience

SciDAC Phase 2 and CS ISICs • Future CS ISICs need to be mindful of needs of • National Leadership Computing facility • w/ Cray, IBM BG, SGI, clusters, multiple OS • No one architecture is best for all applications • SciDAC Science Teams • Needs depend on application areas chosen • End stations? Do they have special SW needs? • FastOS Research Projects • Complement, don’t duplicate these efforts • Cray software roadmap • Making the Leadership computers usable, efficient, fast

Gaps and potential next steps • Heterogeneous leadership-class machines • science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines. • Fault tolerance requirements in apps and systems software • particularly as systems scale up to petascale around 2010 • Support for application users submitting interactive jobs • computational steering as means of scientific discovery • High performance File System and I/O research • increasing demands of security, scalability, and fault tolerance • Security • One-time-passwords and impact on scientific progress

Heterogeneous Machines • Heterogeneous Architectures • Vector architectures, Scalar, SMP, Hybrids, Clusters • How is a science team to know what is best for them? • Multiple OS • Even within one machine, eg. Blue Gene, Red Storm • How to effectively and efficiently administer such systems? • Diverse programming environment • science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines • Diverse system management environment • Managing and scheduling multiple node types • System updates, accounting, … everything will be harder in round 2

Fault Tolerance • Holistic Fault Tolerance • Research into schemes that take into account the full impact of faults: application, middleware, OS, and hardware • Fault tolerance in systems software • Research into prediction and prevention • Survivability and resiliency when faults can not be avoided • Application recovery • transparent failure recovery • Research into Intelligent checkpointing based on active monitoring, sophisticated rule-based recoverys, diskless checkpointing… • For petascale systems research into recovery w/o checkpointing

Interactive Computing • Batch jobs are not the always the best for Science • Good for large numbers of users, wide mix of jobs, but • National Leadership Computing Facility has different focus • Computational Steering as a paradigm for discovery • Break the cycle: simulate, dump results, analyze, rerun simulation • More efficient use of the computer resources • Needed for Application development • Scaling studies on terascale systems • Debugging applications which only fail at scale

File System and I/O Research • Lustre is today’s answer • There are already concerns about its capabilities as systems scale up to 100+ TF • What is the answer for 2010? • Research is needed to explore the file system and I/O requirements for petascale systems that will be here in 5 years • I/O continues to be a bottleneck in large systems • Hitting the memory access wall on a node • To expensive to scale I/O bandwidth with Teraflops across nodes • Research needed to understand how to structure applications or modify I/O to allow applications to run efficiently

Security • New stricter access policies to computer centers • Attacks on supercomputer centers have gotten worse. • One-Time-Passwords, PIV? • Sites are shifting policies, tightening firewalls, going to SecureID tokens • Impact on scientific progress • Collaborations within international teams • Foreign nationals clearance delays • Access to data and computational resources • Advances required in system software • To allow compliance with different site policies and be able to handle tightest requirements • Study how to reduce impact on scientists

Meeting notes Al Geist – project status Al Geist – Ideas for CS ISICs in next round of SciDAC Scott Jackson – production use at more places eg. U. Utah Icebox 430proc Incorporation of SSSRMAP into ssslib in progress Paper accepted and new documents (see RM notebook) SOAP as basis for SSSRMAP v4 Discussion of pros and cons (scalability issues, but ssslib can support) Fault tolerance in Gold using hot failover New Gold release v2 b2.10.2 includes distributed accounting Simplify allocation management Enabled support for mysql database Bamboo QM v1.1 released New fountain component alternate to Warehouse used in Work for support for SuperMon, Ganglin, and Nwperf Maui – improved grid scheduler multisite authentication. Support for Globus 4 Future Work - increase deployment base, ssslib integration, portability support for loadlever-like multi-step jobs, and PBS job language release of Silver meta-scheduler

Meeting notes Will McClendon – APITest project status current release v 1.0 Latest work – new look using cascading style sheets new capabilities – pass/fail batch files, better parse error reporting User Guide Documentation done (50 pages) and SNL approved SW requirements: Python 2.3+, ElementTree, MySQL, ssslib, Twisted (version 2.0 added new dependencies) Helping fix bad tests – led to good discussion of this utility Future work: config file, test developer GUI, more… Ron Oldfield – Testing SSS suites 2 wks ago hired full time contractor (Tod Cordenbach) plus summer student Goals and deliverables for summer work performance testing of SSS-OSCAR comparison to other components write tech report of results What is important for each component: scheduler, job launch, queue, I/O,… Discussion of metrics. Scalability? User time, Admin time, HW resource efficiency Report what works, what doesn’t, what is performance critical

Meeting notes Paul Hargrove – PM update Checkpoint (BLCR) status: users on four continents, bug fixes, Works with Linux2.6.11, partial AMD64/EM64T pot Next step is process groups/sessions OpenMPI work this summer ( student of Lumsdane) Have sketch of less restrictive syntax API Process manager status: complete rewrite of MPD more OO and pythonic provided a non-MPD implementation for BG/L using SSS API Narayan Dasi – BCM update SSS infrastructure in use at ANL: clusters, BG/L, IA32, PPC64 Better documentation LRS Syntax: spec done, SDK complete, todo ssslib integration BG/L: arrived in January, initial Cobalt (SSS) suite on February many features being requested eg, node modes set in mpirun DB2 used for everything Cobalt – same as SW on Chiba City. All python components implemented using SSS-SDK several major extensions required for BG/L

Meeting notes Narayan Dasi – Cobalt update for BG/L Scheduler (bgsched): new implementation needed to be topology aware, use DB2 partition unit is 512 nodes. Queue Manager (cqm): same SW as Chiba OS change on BG/L is trivial since system rebooted for each job Process Manager (bgpm): new implementation computer nodes don’t run full OS so no MPD mpirun complicated Allocation Manager (am): same as chiba very simple design Experiences: SSS really works Easy to port, simple approach makes system easy to understand Agility required for BG/L Comprehensive interfaces expose all information Admins can access internal state component behavior less mysterious extracting new info is easy Shipping Cobalt to a couple other sites

Meeting notes Craig Stefan – (no slides) Not as much to report. Sidetracked for past three months on other projects Gives reasons Warehouse bugs also not done. Fixes to be done by next OSCAR release Graphical display for Warehouse created Same interfaces as Maui wrt requesting everything from all nodes SSSRMAP into ssslib Initial skeleton code for integration into ssslib begun. Needs questions answered from Jackson and Narayan to proceed

Meeting notes Thomas Naughton – SSS OSCAR releases Testing for v1.1 release Base OSCAR v4.1 includes SSS APItest runs post-install tests on packages Discussion that Debian support will require both RPM and DEB formats Future work: complete v1.1 testing, migrate distribution to FRE repository extend SSS component tests, distribute as basic OSCAR “package set” needed ordering within a phase (work around for now) Release schedule: version Freeze Release New v1.0 Nov (SC05) first full suite release v1.1 Feb 15 May Gold update, bug fixes v1.2 Jun 15 July RH9 to Fedora2 oscar 4.1, BLCR to linux 2.6, improved tests, close known bug reports v2.0b Aug 15 Sept less restrictive syntax switch over, perf tests Silver meta-scheduler, Fedora4 v2.0 Oct 15 Nov (SC05) bug fixes, minor updates In oscar 5.0 as package set (after SC05) Remove Bugzilla link from web page

Meeting notes Bret Bode – Queue Manager API Lists all the functions then goes through detailed scheme of each Bamboo Uses SSSRMAP messaging and wire protocol Authentication – uses ssslib Authorization – uses info in SSSRMAP wire protocol Questions and discussion of interfaces

Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2