290 likes | 371 Views
Quality in Chaos: a view from the TeraGrid environment. John Towns TeraGrid Forum Chair Director of Persistent Infrastructure National Center for Supercomputing Applications University of Illinois jtowns@ncsa.illinois.edu with the assistance of many TeraGrid colleagues!!.
E N D
Quality in Chaos: a view from the TeraGrid environment John Towns TeraGrid Forum Chair Director of Persistent Infrastructure National Center for Supercomputing Applications University of Illinois jtowns@ncsa.illinois.edu with the assistance of many TeraGrid colleagues!!
What is Cyberinfrastructure? • Computing systems, • data storage systems, and data repositories, • visualization environments, • and people, • all linked together by high performance networks.
The Vision of TeraGrid • Three part mission: • support the most advanced computational science in multiple domains • empower new communities of users • provide resources and services that can be extended to a broader cyberinfrastructure • TeraGrid is… • an advanced, nationally distributed, open cyberinfrastructure comprised of supercomputing, storage, and visualization systems, data collections, and science gateways, integrated by software services and high bandwidth networks, coordinated through common policies and operations, and supported by computing and technology experts, that enables and supports leadingedge scientific discovery and promotes science and technology education • a complex collaboration of over a dozen organizations and NSF awards working together to provide collective services that go beyond what can be provided by individual institutions
What is TeraGrid?(simple definition) A complex collaboration of over a dozen organizations working together to provide cyberinfrastructure that goes beyond what can be provided by individual institutions, to improve research productivity and enable breakthroughs not otherwise possible.
TeraGrid Objectives • DEEP Science: Enabling Petascale Science • make science more productive through an integrated set of very-high capability resources • address key challenges prioritized by users • WIDE Impact: Empowering Communities • bring TeraGrid capabilities to the broad science community • partner with science community leaders - “Science Gateways” • OPEN Infrastructure, OPEN Partnership • provide a coordinated, general purpose, reliable set of services and resources • partner with campuses and facilities
What you can do with the TeraGrid:Simulation of cell membrane processes Work by EmadTajkhorshid and James Gumbart, of University of Illinois Urbana-Champaign. • Mechanics of Force Propagation in TonB-Dependent Outer Membrane Transport. Biophysical Journal 93:496-504 (2007). • Results of the simulation may be seen at www.life.uiuc.edu/emad/TonB-BtuB/btub-2.5Ans.mpg • Modeled mechanisms for transport of molecules through cell membrane. • Used 400,000 CPU hours [45 processor-years] on systems at National Center for Supercomputing Applications, IU, Pittsburgh Supercomputing Center Image courtesy of EmadTajkhorshid, UIUC
TG App: SCEC-PSHA • Part of Southern California Earthquake Center (Tom Jordan, USC) • Using large scale simulation data, estimate probablistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period) • Used by hospitals, power plants, schools, etc. as part of their risk assessment • For each location, need a Cybershake run followed by roughly 840,000 parallel short jobs • parallelize across locations, not individual workflows • Completed over 300 locations to date, targeting 2000 sites in 2010 Managing these requires effective grid workflow tools for job submission, data management and error recovery, using Pegasus (ISI) and DAGman (Wisconsin) Information/image courtesy of Phil Maechling 7
What is the TeraGrid? • An instrument that delivers high-end IT resources/services: computation, storage, visualization, and data/services • a computational facility – over a PetaFLOP in parallel computing capability • a data storage and management facility - over 20 PetaBytes of storage (disk and tape), over 100 scientific data collections • a high-bandwidth national data network • A service: help desk and consulting, Advanced Support for TeraGrid Applications (ASTA), education and training events and resources • Something you can use without financial cost • research accounts allocated via peer review • Startup and Education accounts automatic • World’s largest distributed cyberinfrastructure for scientific research • supported by National Science Foundation
11 Resource Providers, One Facility Grid Infrastructure Group (UChicago) UW UC/ANL PSC NCAR PU NCSA Caltech UNC/RENCI IU ORNL USC/ISI NICS SDSC LONI TACC Resource Provider (RP) Software Integration Partner Network Hub
TeraGrid Resources and Services • Computing • more than one petaflop of computing power today and growing • 500 Tflop Ranger (Sun Constellation) at Texas Advanced Computing Center (TACC) • 1.03 PFlop Kraken (Cray XT5) at National Institute for Computational Sciences (NICS), University of Tennessee • Remote visualization servers and software • 60 TFlop condor-based viz resource at Purdue University • Data • allocation of data storage facilities • over 100 Scientific Data Collections • Central allocations process • Technical Support • central point of contact for support of all systems • Advanced Support for TeraGrid Applications (ASTA) • education and training events and resources • over 30 Science Gateways
How is TeraGrid Organized? • TG is set up like a large cooperative research group • evolved from many years of collaborative arrangements between the centers • still evolving! • Federation of 12 awards • Resource Providers (RPs) • provide the computing, storage, and visualization resources • Grid Infrastructure Group (GIG) • central planning, reporting, coordination, facilitation, and management group • Strategically lead by the TeraGrid Forum • made up of the PI’s from each RP and the GIG • led by the TG Forum Chair, who is responsible for coordinating the group (elected position) • John Towns – TG Forum Chair • responsible for the strategic decision making that affects the collaboration • Day-to-Day Functioning via Working Groups (WGs): • each WG under a GIG Area Director (AD), includes RP representatives and/or users, and focuses on a targeted area of TeraGrid
Impacting Many Agencies Supported Research Funding by Agency Resource Usage by Agency University Industry 1% 1% International 3% University Other Industry DOD Other International 2% 1% 6% 5% 2% 0% DOD 1% NASA NASA NSF 9% 10% 49% NIH NSF NIH 15% 19% NSF DOE 52% NIH DOE NASA 11% DOD DOE International 13% University $91.5M Direct Support of Funded Research 10B NUs Delivered Other Industry
So why are you here anyhow?? • For moderate scale research projects funded by federal agencies, quality is an afterthought • $10s of millions/year • well.. perhaps just assumed as implicitly needed • no explicit treatment of quality in many programs • Of course, large scale projects have quality as a first class concern • $100s of millions/year • DOD has recognizes importance of quality in modeling and simulation efforts • specifically designed verification, validation, and accreditation (VV&A) processes • understand the simulation’s capabilities, limitations, and performance relative to the real-world objects it simulates • http://vva.msco.mil/ • NSF MREFC planning processes have quality concerns stated in solicitations for these projects
TeraGrid is no exception • Initially defined largely as a technology research activity with intent to support production (academic definition) resources • Behaved organizationally much like an individual investigator research team • lack of clear structure and processes • In the end, TeraGrid is both operations and research • operations: • facilities/services on which researchers rely • infrastructure on which other providers build • research: • learning how to do distributed, collaborative science on a global, federated infrastructure • learning how to run multi-institution shared infrastructure • Further, lack of recognition of what TeraGrid really is • an emerging and evolving infrastructure for enabling science and engineering • (initially) treated as a research project • Thus, something of a “perfect storm”
but…. • TeraGrid has become quite successful anyhow • the picture was perhaps not so bleak • participant centers embodied a great deal of experience and expertise • no lack of vision (perhaps too much) or passion amongst participants • we came to some basic realizations • Fundamentally, we had to mature as a distributed infrastructure organization • while we provided many technically interesting things, we had lost sight of the “quality” of what we provided • we had to understand what that meant!
Quality in a TeraGrid Context • What did this mean for us? • TeraGrid must deliver important services reliably and without barriers to entry to a community of scientists and engineers not interested in the nerdy details we TeraGrid geeks loved to wallow in…
TeraGrid faced many challenges on this front … • Relied on a software we obtained elsewhere and had little control over • “academic grade” quality was a generous description for much of it • We integrated this software along with many services into a distributed environment • resources at various site governed by conflicting policies • software often not based on standards or did not comply with them • The distributed organization presented many faces to the user community for many of the services provided • participants desire to maintain their own identity while playing nice in the larger environment • TeraGrid had (has) many organizational challenges • no strong central management/authority • participants frequently pitted against one another in life/death funding competitions • And the list goes on…
What TeraGrid needed to do… • Create a more stable distributed environment and facilitate use by the user community • institute basic quality assurance mechanisms • Quality Assurance working group • increased stability/reliability of software infrastructure • Inca system, • new interfaces to environment • Science Gateways, workflow support • Reduce the number of faces presented to the user community • reduce electronic interfaces • User Portal, POPS, trouble ticket submission • create common user environment across multiple heterogeneous systems • reduce “human faces” • centralized helpdesk, integrated/coordinated advanced support functions • Focus on facilitating use and not new technology development • support for new and advanced users • understand the challenges users face in our environment
Something Important Going for Us • TeraGrid was not a revolutionary idea suddenly instituted as a project • built on a long history of NSF-funded supercomputing centers • initially funded in 1985 • a progression of NSF programs • a handful of major centers funded early on • some loose collaboration of those centers through 1980’s and 1990s • first NSF program to fund collections of centers in 1997 • evolution of that program to TeraGrid • This provided an important resource • staff with a passion for delivering resources and services to support science an engineering • a culture of striving to do our best in this developed • But… • most staff were subject matter experts and not process driven • we regularly work with cutting edge technologies • no luxury of spending 2 years developing software using traditional software engineering processes
Creating a stable and reliable environment: QA Working Group • Goal: improve reliability of production TeraGrid software components/services • Increase reliability of services: • prioritize testing/debugging of services most relevant to users • identify existing tests to be used and/or develop new tests • improving the use of the Inca monitoring framework • increase availability of CTSS services: • improve time from detected failure to notification • map errors to potential problem resolution procedures • Develop/propose a more formal process for CTSS software deployment
Creating a stable and reliable environment: Build & Test Facility • Lower software build and support costs across providers • Improve software quality • Make software builds reproducible • Faster software turnaround time • Provide public access to software manufacture process
Reducing the number of electronic faces • TeraGrid User Portal • access RP resources and special multi-site services • current, up-to-date information about TG environment • manage and monitor allocations via common tools • first line of support for users • documentation, information about hardware and software resources • education, outreach and training events and resources • Common User Environment Working Group • remove barriers to user movement between TeraGrid resources • coordinate with RP staff and TG WG • CUE Management System, CUE Build Environment, CUE Testing Platform, CUE Variable Collection • Science Gateways • user access without allocation request • simplifies access to resources • immediate reach to communities of researchers
What is a Science Gateway? • A Science Gateway • enables scientific communities of users with a common scientific goal • uses high performance computing • has a common interface • leverages community investment • Three common forms: • web-based portals • application programs running on users' machines but accessing services in TeraGrid • coordinated access points enabling users to move seamlessly between TeraGrid and other grids
How can a Gateway help? • Make science more productive • researchers use same tools • complex workflows • common data formats • data sharing • Bring TeraGrid capabilities to the broad science community • lots of disk space • lots of compute resources • powerful analysis capabilities • nice interface to information
Support for New and Established Users • TeraGrid Advanced Support for Applications (ASTA) • request help with code optimization, workflow improvement and gateways through • TeraGrid Pathways • new user support, mentoring, fellowships • Campus Champions • individuals at your institution to offer support • HPC University • online public resources • TeraGrid Annual Conference • showcases capabilities, achievements and impact of TeraGrid in research • presentations, demos, posters, visualizations • tutorials, training and peer support • student competitions and volunteer opportunities
Understanding the Challenges User Face • Established User Interaction Council • key group of project leaders chaired by Director of Science • Regular analysis of trouble tickets to identify problem areas • leverage expertise and experience of other staff in resolving • often results in an agreement among support teams at the 11 RPs how to (better, faster) resolve problems in future • relevant insights are promptly reflected in the online materials • documentation, User Portal, Knowledge Base • cross-cutting operational issues identified and reported to the User Interaction
But what did we learn? • We did many things to improve the quality of the product we delivered to our customers • established many practices and procedures • adopted many formal software engineering practices • improved the user experience in using our resources and services • paid attention to the experiences our users had in making use of the environment • But these were not the heart of what has made us successful
Its all about the people! • Staff with a passion to produce a quality product in the form of integrated software, services and resources • who were willing to go beyond tradition research activities to attain the goal • Staff with a passion to enable the work of scientists and engineers • with the expertise in the use of advanced technologies • Staff with a vision for excellence • who connected with our user community on many levels Never underestimate the value of the staff working on your projects!