110 likes | 200 Views
TeraGrid-Wide Operations. Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009. Highlights. TeraGrid surpassed 1 petaflops of aggregate computing. Aggregate compute power available is 3.5x times from 2007 to 2008.
E N D
TeraGrid-Wide Operations Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009
Highlights • TeraGrid surpassed 1 petaflops of aggregate computing. • Aggregate compute power available is 3.5x times from 2007 to 2008. • Primarily result of Track 2 systems at TACC and NICS coming online. • NUs used and allocated is ~4x times from 2007 to 2008. • Significant improvement in the instrumentation, including tracking of grid usage and data transfers. • Inca providing historical tracking of software and service reliability along with a new interface for both users and administrators. • An international security incident touched TeraGrid, resulting in a very strong incident response as well as improved procedures for a new attack vector. • Improvements in authentication procedures and cross-resource single-sign-on.
Big Picture Resource Changes • Sun Constellation Cluster (Ranger) at TACC, Feb ’08 • Initially 504 Tflops; upgraded in July 2008 to 580 Tflops • Cray XT4 (Kraken) at NICS, Aug ’08 • 166 Tflops and 18,000 computing core cores • Additional resources that entered production in 2008: • Two Dell PowerEdge 1950 clusters: 668-node system at LONI (QueenBee) and the 893-node system at Purdue (Steele) • PSC’s SGI Altix 4700 shared-memory NUMA system (Pople) • FPGA-based resource at Purdue (Brutus) • Remote visualization system at TACC (Spur) • Other improvements: • Condor Pool at Purdue also grew from 7,700 to more than 22,800 processor cores. • Indiana integrated its Condor resources with the Purdue flock, simplifying use. • Decommissioned systems: • NCSA’s Tungsten, PSC’s Rachel, Purdue’s Lear, SDSC’s DataStar and Blue Gene,and TACC’s Maverick.
TeraGrid HPC Usage, 2008 3.8B NUs in Q4 2008 • In 2008: • Aggregate HPC power increased by 3.5x • NUs requested and awarded quadrupled • NUs delivered increased by 2.5x Kraken, Aug. 2008 Ranger, Feb. 2008 3.9B NUs in 2007
2008 Statistics Created 7,762 tickets Immediately resolved 2,652 (34%) Took 675 phone calls Immediately resolved 454 (67%) TeraGrid Operations Center • Manage TG Ticket System and 24x7 toll-free call center • Respond to all users and provide front-line resolution if possible • Routes remaining tickets to User Services, RP sites, etc. as appropriate. • Maintain situational awareness across the TG project (upgrades, maintenance, etc.)
Instrumentation and Monitoring • Monitoring and statistics gathering for TG services • E.g. Backbone, Grid Services (GRAM, GridFTP) • Used for individual troubleshooting – e.g. LEAD • Moving to views of the big picture. Inca custom display for LEAD Daily peak Bandwidth used GrudFTP usageby day.
Inca Grid Monitoring System • Automated, user-level testing to improve reliability by detecting Grid infrastructure problems. • Provides detailed information about tests and their execution to aid in debugging problems. • 2,538 pieces of test data are being collected. • Originally designed for TeraGrid, and now successfully used in other large-scale projects including ARCS, DEISA, and NGS. 2008 Improvements Custom views: LEAD, User Portal Email Notifications of Errors Historical Views Recognizes scheduled downtime. 20 new tests written 77 TeraGrid tests were modified.
TeraGrid Backbone Network TeraGrid 10 Gb/s backbone runs from Chicago to Denver to Los Angeles. Contracted from NLR. Dedicated 10 Gb/s link(s) from each RP to one of the three core routers. Usage: Daily BW peaks on backbone typically in 2-4 Gb/s range with ~3% increase/month.
Security • Gateway Summit (with Science Gateways team) • Form best practices and common processes among sites • Develop understanding between sites and Gateway developers. • User Portal Password Reset Procedure • Risk Assessments for Science Gateways and User Portal • TAGPMA leadership for PKI interoperability • Uncovered large-scale attack in collaboration with EU Grid partners. • Established secure communications: Secure Wiki, SELS, Jabber (secure IM) – including EU partners.
Single Sign-on • GSI-SSHTERM (from NGS) added to User Portal • Consistently in top 5 used applications. • Augments command-line functionality already in place. • Started deploying catastrophic failover for Single Sign-on • Replicating NCSA MyProxy PKI to PSC. • Implemented client changes on RPs and User Portal for failover. • Developed policies for coherent TeraGrid identities • Identities (X.509 distinguished names) come from allocations, RPs and users – complicated, error-prone process. • Tests written for TGCDB; Inca tests for RPs will follow. • Started addition of Shibboleth support for User Portal • TeraGrid now member of InCommon (as a service provider) • Will migrate to new Internet Framework when ready.