210 likes | 527 Views
SciDAC Scalable Systems Software Center August 14-15 Atlanta GA Agenda - August 14 8:00 wireless set up 9:00 Introductions 9:30 Overview and Goals of the Center (Geist) 10:00 SciDAC ISIC Expectations (Johnson) 10:30 Discussion of meeting goals
E N D
SciDAC Scalable Systems Software Center August 14-15 Atlanta GA
Agenda - August 14 8:00 wireless set up 9:00 Introductions 9:30 Overview and Goals of the Center (Geist) 10:00 SciDAC ISIC Expectations (Johnson) 10:30 Discussion of meeting goals 11:30 Strawman proposal for an interface framework 12:00 Lunch (as group to hotel restaurant) 1:00 Other proposals/ideas for system interfaces doug - runtime architecture (scyld vs cplant vs ???) karl – rm-api, cpu sets, need to schedule fat nodes don – scyld boot method, multicast status info paul – checkpoint/restart sung – science appliance project 3.00 Enumerating key attributes 4:00 Discuss merits of attribute database 5:00 Break for dinner
Agenda - August 15 8:00 wireless set up 8:30 Decide on logistics of consensus 9:30 Overnight proposals? 10:00 Decide on working groups for key areas Begin initial discussion of interfaces & integration 12:00 Next meeting dates, what happens till then 12:30 Meeting Ends Lunch and further discussion for hangers-on
Scalable Systems Softwarefor Terascale Computer Centers www.scidac.org/ScalableSystems Problem Resource Management • Computer centers use incompatible, ad hoc set of systems tools • Present tools are not designed to scale to multi-Teraflop systems Accounting & user mgmt Solution • Collectively (with industry) define standard interfaces between systems components for interoperability • Create scalable, standardized management tools for efficiently running our large computing centers System Monitoring System Build & Configure Impact • Revolutionize the way system software is designed and used. Job management
Goal and Vision of the Center Four Goals Collectively (with industry) agree on and specify standardized interfaces between system components in order to promote interoperability, portability, and long-term usability. Produce a fully integrated suite of systems softwareand tools for the effective management and utilization of terascale computational resources particularly those at the DOE facilities. Research and development of more advanced versions of the components as well as OS modifications required to support the scalability and performance requirements of SciDAC applications. Carry out a software lifecycle planfor support and maintenance of systems software suite.
Scope—The Spaghetti and Meatballs Picture Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Job Queue Manager & Monitor Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment
Working with Computer Centers Our Customers are the Managers and System Administrators At the terascale computer centers around the nation. their guidance their feedback Working with other SciDAC Centers Common Component Architecture parallel startup event services runtime framework Scalable Data Management others?
Meeting Goals Decide logistics of reaching consensus on standard interfaces MPI-like process, CCA-like process, other? How to deal will errata Enumeration of key attributes common across system components expect there are less than 30 Discuss whether an attribute database be a part of the architecture could be considered as just another component Begin defining interfaces and working groups for key areas: node configuration & building, resource management, parallel job startup, system & job monitoring
Infrastructure Project Web Page – www.scidac.org/ScalableSystems proposal plan overview slides links to individual sites and software downloads Project Notebook – www.csm.ornl.gov/~geist/enote/system.html meeting notes (like this meeting) progress reports draft standards for group to comment on CVS when we begin to produce software suite
Strawman a common integrated interface framework Easy to swap components Vendor optimized highly scalable version common pool of attributes XML format for attributes Standardized request protocol Choose an existing transfer protocol-TCP Every component uses the same framework Attribute database User Host OS Mem Allocation Etc…
Objects and Components Components: Job Manager System Monitor Accounting Allocation management Logging Node Management Process Management Job monitor Configuration management Scheduler Queue manager Meta-services Information service System management Components: Checkpoint File staging Security manager Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition
Node, system, and configuration Services Services: Start job Signal job Services: Start job Signal job Services: Start job Signal job Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition
Job and System Monitor Services Services: Start job Signal job Services: Start job Signal job Services: Start job Signal job Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition
Accounting and logging Services Services: Start job Signal job Services: Start job Signal job Services: Start job Signal job Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition
Job and Process Mgmt (+chkpt) Services Services: Start job Signal job Services: Start job Signal job Services: Start job Signal job Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition
Scheduler, Queue, and meta- Services Services: Start job Signal job Services: Start job Signal job Services: Start job Signal job Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition
Information Services Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition Static Services: Start job Signal job Slow Services: Start job Signal job Fast Services: Start job Signal job
Security Mgr Services Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition Services: Start job Signal job
Storage and I/O Services Objects: Job Node Task User Group Account/project Queue??? Data store/IO Interconnect partition Services: Start job Signal job
Consensus and Voting Rules: Written Documentation: Written draft standards available to everyone in Project notebook Drafts must be presented week to 10 days before a vote Errata or extensions—revisit interface standard every 6 months Voting: Pass with simple majority of people voting yes/no Who can vote? Organizations with physical attendance at 2 of last 3 meetings One vote per organization No email-in or phone votes accepted. Straw votes are non-binding and many can be used for guidance Two formal votes are required to accept a chapter for final vote Both votes can’t occur at the same meeting. Global vote of whole document as standard interface
Other organizational: Weekly teleconference of Working Groups: Up to the groups Video Conference Meetings: Explore AG in future