1 / 11

TeraGrid-Wide Operations

TeraGrid-Wide Operations. Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009. Highlights. TeraGrid surpassed 1 petaflops of aggregate computing. Aggregate compute power available is 3.5x times from 2007 to 2008.

gomer
Download Presentation

TeraGrid-Wide Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TeraGrid-Wide Operations Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009

  2. Highlights • TeraGrid surpassed 1 petaflops of aggregate computing. • Aggregate compute power available is 3.5x times from 2007 to 2008. • Primarily result of Track 2 systems at TACC and NICS coming online. • NUs used and allocated is ~4x times from 2007 to 2008. • Significant improvement in the instrumentation, including tracking of grid usage and data transfers. • Inca providing historical tracking of software and service reliability along with a new interface for both users and administrators. • An international security incident touched TeraGrid, resulting in a very strong incident response as well as improved procedures for a new attack vector. • Improvements in authentication procedures and cross-resource single-sign-on.

  3. Big Picture Resource Changes • Sun Constellation Cluster (Ranger) at TACC, Feb ’08 • Initially 504 Tflops; upgraded in July 2008 to 580 Tflops • Cray XT4 (Kraken) at NICS, Aug ’08 • 166 Tflops and 18,000 computing core cores • Additional resources that entered production in 2008: • Two Dell PowerEdge 1950 clusters: 668-node system at LONI (QueenBee) and the 893-node system at Purdue (Steele) • PSC’s SGI Altix 4700 shared-memory NUMA system (Pople) • FPGA-based resource at Purdue (Brutus) • Remote visualization system at TACC (Spur) • Other improvements: • Condor Pool at Purdue also grew from 7,700 to more than 22,800 processor cores. • Indiana integrated its Condor resources with the Purdue flock, simplifying use. • Decommissioned systems: • NCSA’s Tungsten, PSC’s Rachel, Purdue’s Lear, SDSC’s DataStar and Blue Gene,and TACC’s Maverick.

  4. TeraGrid HPC Usage, 2008 3.8B NUs in Q4 2008 • In 2008: • Aggregate HPC power increased by 3.5x • NUs requested and awarded quadrupled • NUs delivered increased by 2.5x Kraken, Aug. 2008 Ranger, Feb. 2008 3.9B NUs in 2007

  5. 2008 Statistics Created 7,762 tickets Immediately resolved 2,652 (34%) Took 675 phone calls Immediately resolved 454 (67%) TeraGrid Operations Center • Manage TG Ticket System and 24x7 toll-free call center • Respond to all users and provide front-line resolution if possible • Routes remaining tickets to User Services, RP sites, etc. as appropriate. • Maintain situational awareness across the TG project (upgrades, maintenance, etc.)

  6. Instrumentation and Monitoring • Monitoring and statistics gathering for TG services • E.g. Backbone, Grid Services (GRAM, GridFTP) • Used for individual troubleshooting – e.g. LEAD • Moving to views of the big picture. Inca custom display for LEAD Daily peak Bandwidth used GrudFTP usageby day.

  7. Inca Grid Monitoring System • Automated, user-level testing to improve reliability by detecting Grid infrastructure problems. • Provides detailed information about tests and their execution to aid in debugging problems. • 2,538 pieces of test data are being collected. • Originally designed for TeraGrid, and now successfully used in other large-scale projects including ARCS, DEISA, and NGS. 2008 Improvements Custom views: LEAD, User Portal Email Notifications of Errors Historical Views Recognizes scheduled downtime. 20 new tests written 77 TeraGrid tests were modified.

  8. TeraGrid Backbone Network TeraGrid 10 Gb/s backbone runs from Chicago to Denver to Los Angeles. Contracted from NLR. Dedicated 10 Gb/s link(s) from each RP to one of the three core routers. Usage: Daily BW peaks on backbone typically in 2-4 Gb/s range with ~3% increase/month.

  9. Security • Gateway Summit (with Science Gateways team) • Form best practices and common processes among sites • Develop understanding between sites and Gateway developers. • User Portal Password Reset Procedure • Risk Assessments for Science Gateways and User Portal • TAGPMA leadership for PKI interoperability • Uncovered large-scale attack in collaboration with EU Grid partners. • Established secure communications: Secure Wiki, SELS, Jabber (secure IM) – including EU partners.

  10. Single Sign-on • GSI-SSHTERM (from NGS) added to User Portal • Consistently in top 5 used applications. • Augments command-line functionality already in place. • Started deploying catastrophic failover for Single Sign-on • Replicating NCSA MyProxy PKI to PSC. • Implemented client changes on RPs and User Portal for failover. • Developed policies for coherent TeraGrid identities • Identities (X.509 distinguished names) come from allocations, RPs and users – complicated, error-prone process. • Tests written for TGCDB; Inca tests for RPs will follow. • Started addition of Shibboleth support for User Portal • TeraGrid now member of InCommon (as a service provider) • Will migrate to new Internet Framework when ready.

  11. Questions?

More Related