100 likes | 183 Views
Network, Operations and Security Area. Tony Rimovsky NOS Area Director tony@ncsa.uiuc.edu. The TeraGrid Map. UC/ANL. PSC. NCAR. PU. NCSA. IU. ORNL. NICS. SDSC. LONI. TACC. Resource Provider (RP). Network Hub. Networking. Origins of the TeraGrid network
E N D
Network, Operations and Security Area Tony Rimovsky NOS Area Director tony@ncsa.uiuc.edu
The TeraGrid Map UC/ANL PSC NCAR PU NCSA IU ORNL NICS SDSC LONI TACC Resource Provider (RP) Network Hub
Networking • Origins of the TeraGrid network • Originally 4 sites with 3x10G each • Full mesh of 10Gbps links • Evolution • Most sites now at 10G, not 30G • TG Backbone is now 10G • One router serves almost all Resource Providers • Key question: Why continue to have a TeraGrid specific network? • Variation in capacities to R&E Networks • Application specific utility • GPFS-WAN and Luster-WAN • Security
Networking • Networking challenges • Tracking application specific use of the network • Finding a new architecture paradigm
Operations • TeraGrid GIG does not operate resources • Resource Providers operate resources individually, but in a coordinated fashion • Accounts, Allocations, Accounting, Software, Processes, User Support and Policy are all coordinated • Coordinated does not necessarily mean “the same” • Resources are operated under a range of complementary awards. This has encompassed Resource Provider, HPCOPS and Track 2 awards. • Some activities are either shared or impact across the project • Operations Center, Accounting/Allocations, Networking, Security • Instrumenting core activities is key • INCA validation testing helps coordinate software • Common accounting provides the ability to report across resources • Usage instrumentation helps understand how users interact with TeraGrid across all platforms
Operations • TGCDB/AMIE, POPS and Core2 • Some Definitions • TGCDB is the Database of account and accounting records • AMIE is the protocol for transferring records • POPS is the system for submitting and reviewing allocations • Core2 is the current name for a major review and redesign of account/allocations system • The accounting system is significant to us and the user community for several reasons • Common account/allocation mechanism across HPC resources • Relatively easy to add new resources • Facilitates user portability and access to resources across the project
Operations • Operations Challenges • There are a lot of places to collect data • It is difficult to get a complete picture in any particular area • eg. Network traffic levels can be measured, but the real question is about the applications that are driving that traffic. Some applications can be measured. Others are more challenging • New systems with unique architectures are providing challenges with respect to how to balance commonality with resource needs.
Security • Security has two main thrusts: • Operational Security/Incident Response • Security Architecture • Operational Security/Incident Response • Security events happen. The goal of TG IR is to control the spread of incidents among the sites. • Communication is key to success. • All sites participate in IR • Regular calls combined with distinct tools to maintain a secure and rapid communication environment in the event of an incident • The group is very successful at IR • Operational Sec includes TAGPMA, writing and reviewing policy, and working with WGs on implementation details.
Security • Security Architecture • Emphasis is on design and keeping track of the big picture • Grid security • Gateway and Campus AAA
Security Challenges • Policy crafting and adoption • NSF, DOE and campus cultures bring unique perspectives • We try for consensus and people are passionate. • Example: Centralized password management • Grid security and operations • Operational people are focused on traditional computer security and exposures • Architectural group creation was driven by the need for big-picture security • Example: capturing the process for distributing DNs for SSO • Certificate based authentication needs • Accounting and record keeping • IR logging • Gateways, community accounts, and accountability • Example: Attribute passing and tracking