490 likes | 641 Views
TeraGrid: A Terascale Distributed Discovery Environment. Jay Boisseau TeraGrid Executive Steering Committee (ESC) Member and Director, Texas Advanced Computing Center at The University of Texas at Austin. Outline. What is TeraGrid? Users Requirements TeraGrid Software
E N D
TeraGrid: A Terascale Distributed Discovery Environment Jay Boisseau TeraGrid Executive Steering Committee (ESC) Member and Director, Texas Advanced Computing Center atThe University of Texas at Austin 1
Outline • What is TeraGrid? • Users Requirements • TeraGrid Software • TeraGrid Resources & Support • Science Gateways • Summary 2
The TeraGrid Vision • Integrating the Nation’s Most Powerful Resources • Provide a unified, general purpose, reliable set of services and resources. • Strategy: An extensible virtual organization of people and resources across TeraGrid partner sites. • Enabling the Nation’s Terascale Science • Make Science More Productive through a unified set of very-high capability resources. • Strategy: leverage TeraGrid’s unique resources to create new capabilities driven & prioritized by science partners • Empowering communities to leverage TeraGrid capabilities • Bring TG capabilities to the broad science community (no longer just “big” science). • Strategy: Science Gateways connecting communities, Integrated roadmap with peer Grids and software efforts 4
Building a distributed system of unprecedented scale 40+ teraflops compute 1+ petabyte storage 10-40Gb/s networking Creating a unified user environment across heterogeneous resources User software environment, User support resources. Created an initial community of over 500 users, 80 PI’s. Integrating new partners to introduce new capabilities Additional computing, visualization capabilities New types of resources- data collections, instruments The TeraGrid Strategy Make it extensible! 5
The TeraGrid Team • TeraGrid Team has two major components: • 9 Resource Providers (RPs) who provide resources and expertise • Seven universities • Two government laboratories • Expected to grow • The Grid Integration Group (GIG) who provides leadership in grid integration among the RPs • Led by Director, who is assisted by Executive Steering Committee, Area Directors, Project Manager • Includes participation by staff at each RP • Funding now provided for people, not just networks and hardware! 6
Integration: Converging NSF Initiatives • High-End Capabilities: U.S. Core Centers, TeraGrid… • Integrating high-end, production-quality supercomputer centers • Building tightly coupled, unique large-scale resources • STRENGTH: Time-critical and/or unique high-end capabilities • Communities: GriPhyN, iVDGL, LEAD, GEON, NEESGrid… • ITR and MRI projects integrate science communities • Building community-specific capabilities and tools • STRENGTH: Community integration and tailored capabilities, high-capacity loosely coupled capabilities • Common Software Base: NSF/NMI, DOE, NASA programs • Projects integrating, packaging, distributing software and tools from the Grid community • Building common middleware components and integrated distributions • STRENGTH: Large-scale deployment, common software base, assured-quality software components and component sets 7
Coherence: Unified User Environment • Do I have to learn how to use 9 systems? • Coordinated TeraGrid Software and Services (CTSS) • Transition toward services and service oriented architecture • From “software stack” to “software and services” • Do I have to submit proposals for 9+ allocations? • Unified NRAC for Core and TeraGrid Resources; Roaming allocations • Can I use TeraGrid the way I use other Grids? • Partnership with Globus Alliance, NMI GRIDS Center, Other Grids • History of collaboration and successful interoperation with other Grids 9
Teragrid User Survey • TeraGrid capabilities must be user-driven • Undertook needs analysis Summer 2004 • 16 Science Partner Teams • Realize these may not be widely representative, so will repeat this analysis every year with increasing number of groups • 62 items considered, top 10 needs reflected in the TeraGrid roadmap 10
TeraGrid User Input Data Grid Computing Science Gateways Overall Score Partners in Need Remote File Read/Write High-Performance File Transfer Coupled Applications, Co-scheduling Grid Portal Toolkits Grid Workflow Tools Batch Metascheduling Global File System Client-Side Computing Tools Batch Scheduled Parameter Sweep Tools Advanced Reservations 11
Some Common Grid Computing Use Cases • Submitting large number of individual jobs • Requires grid scheduling to multiple systems • Requires automated data movement or common file system • Running on-demand jobs for time-critical applications (e.g. weather forecasts, medical treatments) • Requires preemptive scheduling • Requires fault tolerance (checkpoint/recovery) 12
Highest Priority Items • Common to many projects that are quite different in their specific usage scenarios: • Efficient cross-site data management • Efficient cross-site computing • Capabilities to customize Science Gateways to the needs of specific user communities • Simplified management of accounts, allocations, and security credentials across sites 13
Bringing TeraGrid Capabilities to Communities A new generation of “users” that access TeraGrid via Science Gateways, scaling well beyond the traditional “user” with a shell login account. Projected user community size by each science gateway project. Impact on society from gateways enabling decision support is much larger! 15
Exploiting TeraGrid’s Unique Capabilities Aquaporin mechanism Water moves through aquaporin channels in single file. Oxygen leads the way in. At the most constricted point of channel, water molecule flips. Protons can’t do this. Animation pointed to by 2003 Nobel chemistry prize announcement. (Klaus Schulten, UIUC) ENZO(Astrophysics) Enzo is an adaptive mesh refinement grid-based hybrid code designed to do simulations of cosmological structure formation (Mike Norman, UCSD). Reservoir Modeling • Given: An (unproduced) oil field; permeability and other material properties (based on geostatistical models); locations of a few producer/injector wells • Question: Where is the best place for a third injector? • Goal: To have fully automatic methods of injector well placement optimization. (J. Saltz, OSU) GAFEM (Ground-water modeling) GAFEM is a parallel code, developed at North Carolina State Univ., for solution of large scale groundwater inverse problems. 16
Exploiting TeraGrid’s Unique Capabilities: Flood Modeling Merry Maisel (TACC), Gordon Wells (UT) 17
Exploiting TeraGrid’s Unique Capabilities: Flood Modeling • Flood Modeling needs more than traditional batch-scheduled HPC systems! • Precipitation data, groundwater data, terrain data • Rapid large-scale data visualization • On-demand scheduling • Ensemble scheduling • Real-time visualization of simulations • Computational steering of possible remedies • Simplified access to results via web portals for field agents, decisions makers, etc. • TeraGrid adds the data and visualization systems, portals, and grid services necessary 18
Harnessing TeraGrid for Education Example: Nanohub is used to complete coursework by undergraduate and graduate students in dozens of courses at 10 universities. 19
User Inputs Determine TeraGrid Roadmap • Top priorities reflected in Grid Capabilities and Software Integration roadmap: First targets • User-defined reservations • Resource matching and wait time estimation • Grid interfaces for on-demand and reserved access • Parallel/striped data movers • Co-scheduling service defined for high-performance data transfers • Dedicated GridFTP transfer nodes available to production users. 20
Working Groups Applications Data External Relations Grid Interoperability Networks Operations Performance Evaluation Portals Security Software Test Harness and Information Services (THIS) User Services Visualization RATs Science Gateways Security Advanced Application Support User Portal CTSS Evolution Data Transport Tools Job Scheduling Tools TeraGrid Network Working Groups, Requirements Analysis Teams 23
Software Strategy • Identify existing solutions; develop solutions only as needed • Some solutions are frameworks • We need to tailor software to our goals • Information services/site interfaces • Some solutions do not exist • Software function verification • INCA project… scripted implementation of the docs • Global account / accounting management • AMIE • Data Movers • Etc. • Deploy, Integrate, Harden, and Support! 25
TeraGrid Software Stack Offerings • Core Software • Grid service servers and clients • Data management and access tools • Authentication services • Environment commonality and management • Applications: springboard for workflow and service oriented work • Platform-specific software • Compilers • Binary compatibility opportunities • Performance tools • Visualization software • Services • Databases • Data archives • Instruments 26
TeraGrid Software Development • Consortium of leading project members • Define primary goals and targets • Mine helpdesk data • Review pending software request candidates • Transition test environments to production • Eliminate software workarounds • Implement solutions derived from user surveys • Deployment testbeds • Separate environments as well as alternate access points • Independent testbeds in place • Internal staff testing from applications teams • Initial Beta users 27
Software Roadmap • Near term Work (work in progress) • Co-scheduled file transfers • Production-level GridFTP resources • Metascheduling (grid scheduling) • Simple workflow tools • Future directions • On-demand integration with Open Science Grid • Grid checkpoint/restart 29
Grid Roadmap • Near term • User-defined reservations • Web services testbeds • Resource wait time estimation • To be used by workflow tools • Striped data movers • WAN file system prototypes • Longer term • Integrated tools for workflow scheduling • Commercial grid middleware opportunities 30
TeraGrid Resources Partners will add resources and TeraGrid will add partners! 33
TeraGrid Usage by NSF Division CDA IBN CCR ECS DMS BCS ASC DMR AST MCB CHE PHY CTS Includes DTF/ETF clusters only 34
TeraGrid User Support Strategy • Proactive and Rapid Response for General User Needs • Sustained Assistance for Groundbreaking Applications • GIG Coordination with staffing from all RP sites • Area Director (AD) Sergiu Sanielevici (PSC) • Peering with Core Centers User Support teams 35
User Support Team (UST)Trouble Tickets • Filter TeraGrid Operations Center (TOC) trouble tickets: system issue or possible user issue • For each Ticket, designate a Point of Contact (POC) to contact User within 24 hours • Communicate status if known • Begin dialog to consult on solution or workaround • Designate a Problem Response Squad (PRS) to assist POC • Experts who respond to POCs postings to UST list, and/or requested by AD • All UST members monitor progress reports and contribute their expertise • PRS membership may evolve with our understanding of the problem, including support from hardware and software teams • Ensure all GIG/RP/Core Helps and Learns • Weekly review of user issues selected by AD: decide on escalation • Inform TG development plans 36
User Support Team (UST)Advanced Support • For applications/groups judged by TG management to be groundbreaking in exploiting DEEP/WIDE TG infrastructure • “Embedded” Point Of Contact (labor intensive) • Becomes de-facto member of the application group • Prior working relationship with the application group a plus • Can write and test code, redesign algorithms, optimize etc • But no throwing over the fence • Represents needs of the application group to systems people, if required • Alerts AD to success stories 37
The Gateway Concept • The Goal and Approach • To engage advanced scientific communities that are not traditional users of the supercomputing centers. • We will build science gateways providing community-tailored access to TeraGrid services and capabilities • Science Gateways take two forms: • Web-based Portals that front-end Grid Services that provide TeraGrid-deployed applications used by a community. • Coordinated access points enabling users to move seamlessly between TeraGrid and other grids. 39
Workflow Composer Grid Portal Gateways • The Portal accessed through a browser or desktop tools • Provides Grid authentication and access to services • Provide direct access to TeraGrid hosted applications as services • The Required Support Services • Searchable Metadata catalogs • Information Space Management. • Workflow managers • Resource brokers • Application deployment services • Authorization services. • Builds on NSF & DOE software • Use NMI Portal Framework, GridPort • NMI Grid Tools: Condor, Globus, etc. • OSG, HEP tools: Clarens, MonaLisa 40
Gateways that Bridge to Community Grids • Many Community Grids already exist or are being built • NEESGrid, LIGO, Earth Systems Grid, NVO, Open Science Grid, etc. • TeraGrid will provide a service framework to enable access in ways that are transparent to their users. • The community maintains and controls the Gateway • Different Communities have different requirements. • NEES and LEAD will use TeraGrid to provision compute services • LIGO and NVO have substantial data distribution problems. • All of them require remote execution of complex workflows. Storms Forming Forecast Model Streaming Observations Data Mining On-Demand Grid Computing 41
The Architecture of Gateway Services Grid Portal Server The Users Desktop TeraGrid Gateway Services Proxy Certificate Server / vault User Metadata Catalog Application Workflow Application Deployment Application Events Resource Broker App. Resource catalogs Replica Mgmt Core Grid Services Security Notification Service Resource Allocation Grid Orchestration Data Management Service Accounting Service Policy Reservations And Scheduling Administration & Monitoring Web Services Resource Framework – Web Services Notification Physical Resource Layer 42
Flood Modeling Gateway • University of Texas at Austin • TACC • Center for Research in Water Resources • Center for Space Research • Oak Ridge National Lab • Purdue University Large-scale flooding along Brays Bayou in central Houston triggered by heavy rainfall during Tropical Storm Allison (June 9, 2001) caused more than $2 billion of damage. Gordon Wells, UT; David Maidment, UT; Budhu Bhaduri, ORNL, Gilbert Rochon, Purdue 43
Biomedical and Biology • Building Biomedical Communities – Dan Reed (UNC) • National Evolutionary Synthesis Center • Carolina Center for Exploratory Genetic Analysis • Portals and federated databases for the Biomed research community 44
Neutron Science Gateway • Matching Instrument science with TeraGrid • Focusing on application use cases that can be uniquely served by TeraGrid. For example, a proposed scenario from March 2003 SETENS proposal: Neutron Science TeraGrid Gateway (NSTG) John Cobb, ORNL 45
Summary 46
SURA Opportunities with TeraGrid • Identify applications in SURA universities • Leverage TeraGrid technologies in SURA grid activities • Provide tech transfer back to TeraGrid • Deploy grids in SURA region that interoperate with TeraGrid, allow users to ‘scale up’ to TeraGrid 47
Summary • TeraGrid is a national cyberinfrastructure partnership for world-class computational research, with many types of resources for knowledge discovery • TeraGrid aims to integrate with other grids, and other researchers around the world • All Hands Meeting in April will yield new details on roadmaps, software, capabilities, and opportunities. 48
For More Information • TeraGrid: http://www.teragrid.org • TACC: http://www.tacc.utexas.edu • Feel free to contact me directly: Jay Boisseau: boisseau@tacc.utexas.edu Note: TACC is about to announce the newInternational Partnerships for Advanced Computing (IPAC) program, with initial members from Latin America and Spain, which can serve as ‘gateway’ into TeraGrid. 49