1 / 10

The SLAC Cluster

The SLAC Cluster Chuck Boeheim, Assistant Director of SLAC Computing Services, oversees a complex infrastructure including Solaris and Linux Farms, AFS servers, NFS servers, Objectivity servers, and more. The staff manages Unix desktops, growth, and networking challenges such as speed matching, glitches, and monitoring for optimal system administration. The user application focuses on workload scheduling, system limits, and network issues.

edelen
Download Presentation

The SLAC Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services

  2. Components • Solaris Farm 900 single CPU units • Linux Farm 512 dual CPU units • AFS 7 servers, 3 TB • NFS 21 servers, 16 TB • Objectivity 94 servers, 52 TB • LSF Master, backup, license • HPSS Master + 10 tape movers • Interactive 25 servers, + E10000 • Build Farm 12 servers • Network 9 Cisco 6509 switches

  3. Staffing • Same staff supports most Unix desktops on site

  4. Growth in Systems

  5. Growth in Staffing

  6. Ratio of Systems/Staff

  7. Physical • Racking, power, cooling, seismic, network • Remote power management • Remote console management • Installation • Burn-in, DOAs • Maintenance • Replacement burn-in • Divergence from original models • Locating a machine

  8. Networking • Gb to servers • 100Mb to farm nodes • Speed matching (problems) at switches • Network glitches and storms • Network monitoring

  9. System Admin • Network install (256 machines in < 1 hr) • Patch management • Power Up/Down • Nightly maintenance • System Ranger (monitor) • Report summarization • “A Cluster is a large Error Amplifier”

  10. User Application Issues • Workload scheduling • Startup effects • Distribution vs Hot Spots • System and Network Limits • File descriptors • Memory • Cache contention • NIS, DNS, AMD • Job Scheduling • Test Beds

More Related