280 likes | 406 Views
Progress on Release, API Discussions, Vote on APIs, and Quarterly Report. Al Geist May 6-7, 2004 Chicago, ILL. Participating Organizations. Coordinator: Al Geist. Participating Organizations. ORNL ANL LBNL PNNL. SNL LANL Ames NCSA. PSC SDSC IBM SGI. Cray Intel.
E N D
Progress on Release, API Discussions,Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL
Participating Organizations Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL SNL LANL Ames NCSA PSC SDSC IBM SGI Cray Intel How do we position ourselves for the DOE Ultrascale facility winner to be announced May 12 Regardless of who is chosen we should try to be in a position to help with the system software needs of the facility.
Resource Management Accounting & user mgmt System Monitoring System Build & Configure Job management ORNL ANL LBNL PNNL SNL LANL Ames IBM Cray Intel SGI NCSA PSC SDSC Scalable Systems Software Participating Organizations Problem • Computer centers use incompatible, ad hoc set of systems tools • Present tools are not designed to scale to multi-Teraflop systems Goals • Collectively (with industry) define standard interfaces between systems components for interoperability • Create scalable, standardized management tools for efficiently running our large computing centers To learn more visit www.scidac.org/ScalableSystems
Scalable Systems Software Suite Updates to this diagram Grid Interfaces Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Meta Scheduler Meta Monitor Meta Manager Meta Services Accounting Scheduler System & Job Monitor Node State Manager Service Directory Standard XML interfaces Node Configuration & Build Manager authentication communication Event Manager Allocation Management Packaging & Install Usage Reports Process Manager Job Queue Manager Hardware Infrastructure Manager Validation & Testing Checkpoint / Restart
Review of Last Meeting Scalable Systems Software Center January 15-16 Argonne Details in Main project notebook
Highlights from Jan. mtg Craig – 1280 dual xeon cluster “Titanium” is available this evening To test the scalability of SSS suite. One node will be used as Head node to install our suite and run on entire cluster. Could build everything but Bambo and ssslib due to Xerses Will begin to be available at 6pm Late night session on 1280 node testbed PM ran at 1280 worked at 4000, hung at 6000 Warehouse had a problem at 1280 and took out head node RM components ran on head node OK until Warehouse crashed it Scott Jackson – Gold running on 11 TF PNNL cluster Thomas Naughton – 2nd release March. Discussion of how many orgs in our group could shakedown the tarball. Group feels better to have few very reliable components than all components
Highlights from Jan. mtg (cont.) Rusty Lusk – Process Manager Spec for first vote Presentation and discussion… Who is responsible for limited enforcement PM or QM? I.e. Must use certain amount of memory, must not execute OS command (in general - things that happen after fork) Rusty says the question is good and he needs to think about How this may affect the interface. Other items to think about - use of wildcard as “to be returned” operator – OK - Inclusion but don’t show me. - Dynamic jobs and PM. - improve readability Delay vote until we have a written proposal.
Highlights from Jan. mtg Discussion of having two XML syntax styles (functional, object) Al says he would like to see one common one across the suite that he didn’t care which one as long as the whole group could agree. Narayan – Restriction Syntax Overview.An issue of uniqueness was brought up and was to be taken into consideration by Narayan Rusty Lusk – Restriction Syntax on Chiba City David would like to see a paper of the requirements that the Chiba effort required. Andrew and Paul and Craig offer to investigate a prototype translator To see how / if it is possible. Investigate standardization of tokens across the two syntax
Progress Since Last Meeting Scalable Systems Software Center January-May
SciDAC PI mtg – March 22-24, 2004 In Charleston SC with several attending for Scalable Systems 2 page project summary report Annual report for Fred 20 minute talk – presented by Rusty Fred asked each ISIC to use new speaker Poster Presentation – by Stephen/John
Systems Software Suite 2nd Release Target Date March ‘04 – So we could announce it at the PI meeting. Real Status? SSS-OSCAR – will hear more in next talk Need way to test that the suite is installed correctly
Five Project Notebooks • A main notebook for general information • And individual notebooks for each working group • Over 300 total pages • BC and PM groups need to get specs into their notebooks • Add Telecom meeting notes even if short (Kudos to RM group) Get to all notebooks through main web site www.scidac.org/ScalableSystems Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group Telecoms RM is only notes I see in notebook Resource management, scheduling, and accounting Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg” Proccess management, monitoring, and checkpointing Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910 Node build, configuration, and information service Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)
This Meeting Scalable Systems Software Center May 6-7, 2004
Major Topics this Meeting Stability of Systems Software Suite – second release is out. Are we ready for outside users? Quarterly Report Due – would like to get one to Fred by end of May. Will need text from WG leaders. Formal API presentations and voting - we left several things hanging last meeting MICS PI Mtg - August 9-12 at Argonne. A good time to have a highlight of outside user(s) SC04 Mtg - November in Pittsburg. Talks? Tutorial? Birds of a feather?
Agenda – May 6 • 8:30 Al Geist – Project Status. • 9:15 Thomas Naughton – SSS OSCAR software suite release • Working Group Reports • Progress report on what their group has done • API Proposals for adoption by the group • Progress on software suite improvements • 9:30 Narayan Desai – Node Build, Configure • 10:30 Break • 11:30 Will McClendon – Validation and Testing • 12:30 Lunch (on own – cafeteria) • 1:30 Ron Oldfield – ASAP testing, and formalism issues • 2:00 Paul Hargrove – Process Management • Craig and Rusty • 3:00 Scott Jackson – Resource Management • 4:00 Paul/Craig – findings about trying to build a syntax translator • 4:30 Group Discussion on getting outside users of 2nd release • 5:00 Al – Discussion on SC04, other conferences, papers, etc. • 5:30 Adjourn
Agenda – May 7 8:30 Discussion, proposals, votes Craig – discussion Paul – straw vote on two syntax Rusty - Process Manager proposal (deferred) Scott – Allocation Manager proposal (deferred) Al - Quarterly report, papers, SC04, other meetings. 10:30 Break 11:00 Al Geist – Release 2 and outside users (Jazz? Ram? NCSA? SNL?) MICS PI Mtg August at Argonne (news to come) next meeting date: August 26-27, 2004 location: Argonne 12:00 meeting ends
Meeting notes Al Geist – presents project overview and goals for this meeting Thomas Naughton – SSS-OSCAR: in tarball is Bamboo, BRLC, Gold, LAM/MPI, MAUI-SSS, SSSLib, Warehouse, MPD2 SSSLib contains SD, EM, PM, BCM, NSM, NHw, plus communication Todo: bug tracker, test sss-oscar-v2a6-v3.0 for pre-release, Documentation- use scidac review 1 pager, add license-sss to directory Need: A test suite and a few test machines to test on Discussion on APItest and who creates tests, etc. Each does individual Establish release schedule thru SC04 Add easier way for authors to “test just their stuff SC04 – fully tested release v1.0 with all SSS components code freeze Friday September 3
Meeting notes Narayan Dasi – Build Configure Library improvements- bugfixes, testing of java support, SSL testing Infrastructure Improvements-sss python library improvements, EM bugfixes BCM component usage experience Hardware infrastructure – still seeking purpose Restriction Syntax examples given and discused craig thankful that !d (don’t display this field) now works Uniqueness issue-default is to return all duplicates new flag “unique=true” to remove duplicates much discussion. Rusty suggests remove only duplicate lines Paul brings up the problem on “action” commands ie kill jobs twice Al says the problem is not solvable in general in restriction syntax Scott asked if RMAP syntax can handle this? Much work on the board. And question of atomicity of queries which require multiple SQL queries to complete.
Meeting notes Will McClendon – Component Interface Testing APITest v0.1.2 It is now available by FTP by putting it under GPL Cplant license ftp://ftp.sandia.gov/outgoing/apitest (also in notebook) Not integrated back into ssslib HTTP Interface development “Twisted Python” framework Info and www.effbot.org Scott helped find bug in python popen3 – now uses Twisted SpawnProcess Better support for browsing test data within session Batch and test data stored in an in-memory in XML file format writing out data to file available soon Shows an XML example that runs test. Several questions answered Shows an XML batch file example. Runs live demo – works fine. Discussion follows. Ron Oldfield – replacing Eric DeBenedictis who is moving to other SNL jobs -ORNL help set up a testing environment -Testing for correct installation and individual tests, then whole suite test
Meeting notes Ron Oldfield (cont) – simulating real workloads performance and scalability testing needed in the future portability is important for our reference implementation discussion code portability vs feature portability authorization also needs testing What are the issues in lightweight OS Standard naming conventions both format and semantics someone really needs to go through the existing schemaes RMAP dictionary makes a good starting point Paul Hargrove – process management Still continue development on all three components Syntax translation effort to be discussed later today. Checkpoint –pre-emption (suspend and resume) works -checkpointing (ckpt works, restart in progress) Todo: migration, checkpoint file management – not overflow disks (list,delete) Query- “can I restart here”
Meeting notes Paul Hargrove – process management (cont) Suspend/resume works with Bamboo, SD, EM, OM, PM components Still need to design restart-time interactions with RM group Open files support under testing Bug fix releases as needed. Checkpoint manger outstanding issues Implement full interface using restriction syntax, event generation, error reporting Must implement file management think ls and rm, expiration Craig Steffan – no slides Tried run on 1280 nodes on Tungsten failed, did run on 128 Can now run on 1024 nodes. Being stopped by #sockets limit Harvesting can now be done of other info f.e. myrinet HW Next: adding support for “job” management start interfacing with Build group help to get it on Chiba
Meeting notes Rusty Lusk – process manager update PM component – added “limits” interface, dynamic jobs (mpi_comm_spawn) can spawn lots of nodes and the use “unused” ones as needed show limits spec MPD2 improvements found by production use on chiba support for limits support for mpi_comm_spawn interactive debugging via mpigdb – allows control of stdin, stderr, stdout Future: need to work more closely with QM QM interface for requesting dynamic jobs
Meeting notes Multi-step job Scott Jackson – resource manager update Diagram on board Released SSSRMAPv3 spec New things - wire protocol - message format - job groups Latest software release (in OSCAR) uses SSSRMAP v2 Second release of Bamboo in March w/ epilogue and prologue support Gold now fully SSSRMAP v2 - second alpha release due June - which will be in Perl (first release in Java ran into memory size limits) - user guide done - first release running on PNNL’s SGI Altix Testing using APITest begun Silver several,various improvements in XML Future work: implement SSSRMAP v3 in the components - merger of Maui 3.2 and SSS. Integrate chkpt/restart. Limit enforcement - now SSS affects all Maui users. Ability to handle dynamic jobs Job group Job Job Job T Task group T T T T
Meeting notes Paul – translator report (no slides) looking at the two syntax and seeing if we could automate Translation between sssrmap and restriction syntax Found: sssrmap could say 4<proc<16 but not in RS RS band aid – special operators to handle ranges For multiple table queries – nested RS syntax doesn’t have Information (primary data type) to know how to combine multiple SQL results There is no way to translate between these cases. Paul discourages the implementation of a translator.
Meeting notes – Day 2 Craig – General thoughts on official V1.0 (no slides) Released at SC04 this will be the first time many people will see Our orthogonal directions in syntax is damaging If we don’t make a decision soon - project progress towards V1.0 Brett, who works with both, favors the SSSRMAP He likes the more descriptive nature of it and OO nature. Rusty says that we need two written proposals for a component that we can compare and vote on otherwise we are just all talk. Paul says the one is better but two is not too bad. Scott doesn’t think we can reconcile Paul asks for straw vote for a preference, Scott second’s SSRMAP – 7 and 5 institutions (but one is Al) Restriction Syntax - 3 all ANL Abstain – 3 and 2 institutions Craig says he will do whatever it takes to make either work. he is going to make ssslib SSSRMAP work Neil says “users” are guiding factor and RMAP better there Paul says understandability and acceptability is key and RMAP is better Both say that RS is more compact and elegant.
Meeting notes – Day 2 (cont) Narayan- asks does it just need documentation and tutorials Paul says no. There is closer match for SOAP et al. the OO was not a factor in his choice, but it is more popular today. Neil says potential users won’t have a Narayan to figure this out. Components are both client and server so developer has to know syntax. Rusty – if there was something else added to RS that made it easier to use or understand. He is not sure it is a good idea. Will – documentation is better in RMAP and he has looked at RMAP more Would all this stuff be more abstracted? User does as little as they can read manual only after they get stuck. Doesn’t care as long we pick ONE! Need to have a same look and feel across the project. Rick – I don’t care which. I don’t like XML. What about the SD and EM that are already accepted. Al – says that he feels that RMAP would be more acceptable to vendors and this would be a critical to long term success of the project. Paul says that Process manager document is not complete enough to vote on at this time.
Meeting notes – Day 2 (cont) Discussion -