100 likes | 131 Views
The SLAC Cluster Chuck Boeheim, Assistant Director of SLAC Computing Services, oversees a complex infrastructure including Solaris and Linux Farms, AFS servers, NFS servers, Objectivity servers, and more. The staff manages Unix desktops, growth, and networking challenges such as speed matching, glitches, and monitoring for optimal system administration. The user application focuses on workload scheduling, system limits, and network issues.
E N D
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services
Components • Solaris Farm 900 single CPU units • Linux Farm 512 dual CPU units • AFS 7 servers, 3 TB • NFS 21 servers, 16 TB • Objectivity 94 servers, 52 TB • LSF Master, backup, license • HPSS Master + 10 tape movers • Interactive 25 servers, + E10000 • Build Farm 12 servers • Network 9 Cisco 6509 switches
Staffing • Same staff supports most Unix desktops on site
Physical • Racking, power, cooling, seismic, network • Remote power management • Remote console management • Installation • Burn-in, DOAs • Maintenance • Replacement burn-in • Divergence from original models • Locating a machine
Networking • Gb to servers • 100Mb to farm nodes • Speed matching (problems) at switches • Network glitches and storms • Network monitoring
System Admin • Network install (256 machines in < 1 hr) • Patch management • Power Up/Down • Nightly maintenance • System Ranger (monitor) • Report summarization • “A Cluster is a large Error Amplifier”
User Application Issues • Workload scheduling • Startup effects • Distribution vs Hot Spots • System and Network Limits • File descriptors • Memory • Cache contention • NIS, DNS, AMD • Job Scheduling • Test Beds