110 likes | 233 Views
Building a Cluster Support Service Implementation of the SCS Program. UC Computing Services Conference. Gary Jung SCS Project Manager http://scs.lbl.gov/. August 8, 2005. Agenda. SCS Program Overview Implementation Areas for collaboration. Background.
E N D
Building a Cluster Support ServiceImplementation of the SCS Program UC Computing Services Conference Gary JungSCS Project Managerhttp://scs.lbl.gov/ August 8, 2005
Agenda SCS Program • Overview • Implementation • Areas for collaboration UCCSC – August 8, 2005
Background • The 1990’s – Computing at the Desktop • The “Gap” between desktops and NERSC • 2001 - MRC Working Group • Large institutional system originally envisioned by working group • Users not ready to share large system • Recommendation to support Linux clusters • December 2002 - SCS Program approved • $1.3M Four-year program started January 2003 • Ten strategic science projects are selected • IT Division provides support for Linux Clusters • Goals • Enable our scientists to use and take advantage of computing • HPC that works. Avoid security incidents,lost time and expensive mistakes • More effective science UCCSC – August 8, 2005
Strategy • Planning • Formal project mgmt methods • Steering Committee • Support Methodology • Use proven technical approaches that enable us to quickly provide production capability • Adopt standards to facilitate scaling support to several clusters • Staffing • Need to develop a core of expertise to address changes in technology. (e.g. 64-bit Linux, kernel hacking, Cluster mgmt, Myrinet, MPI, schedulers) • Costs • Drive down support costs UCCSC – August 8, 2005
Support Methodology Balance Choice vs. Standardization • User has choice over components that are important to them (e.g. cpu, memory, interconnect.) • We standardize on the aspects that allow us to scale support and reduce costs • Leading, but not bleeding. No exotic stuff. (e.g. no Lustre yet) • On the other hand, tightly coupled, parallel systems are a must to push paradigm shift • Remember that the goal is a production system. • The real trick is in the integration. Making the correct choices so that they all work together and perform well UCCSC – August 8, 2005
Support Methodology The Standard • Hardware - ia32 or AMD64 • Interconnect – GigE, Myrinet, or Infiniband • Operating system - Red Hat Enterprise Linux or Centos • LBNL Warewulf Cluster Toolkit (http://warewulf-cluster.org) • MPI implementation - LAM-MPI • Scheduler - Sun Grid Engine, Torque • Monitoring – Nagios, Ganglia (http://metacluster.lbl.gov) • Cybersecurity – Host-based measures, PIX Firewall, OTP, specific user policies UCCSC – August 8, 2005
Staffing Staffing • Need team with specialized skills to meet technical expertise requirements • Limited funding, tight timeline. • Team roles – Division of responsibilities • Project mgmt, facilities planning • Technology and procurement • Cluster architect, OS, kernel, MPI expert • Scheduler expert • Cluster installation and support • 1.6 FTE total - 10 SCS clusters, 295 nodes UCCSC – August 8, 2005
Costs Driving Down Costs • Standardization of components critical • In-house integration reduces hardware costs and facilitates standards • Leverage relations with open source community • Outsourcing of various pieces - wiring, seismic • Long term planning for electrical infrastructure saves on cost • Develop lower cost staff - college interns • Competitive bid procurement • Benchmarking costs - other National labs, private industry UCCSC – August 8, 2005
Success Factors • Adherence to standards • Effective Steering Committee • Initial funding key to get started • Prominent scientists were our customers • Funding, visibility, ROI • Talented, motivated staff • Creativity allowed, but the focus is on production use • Transparent costing model UCCSC – August 8, 2005
Collaboration What do we have from this? • Methodology for cluster support • New Consulting Offerings • Cluster architecture • Procurement specification • Facilities planning • Development of cluster support business • Effort/cost analysis • Recharge model • LBNL Warewulf software • GPL, 20,000 downloads UCCSC – August 8, 2005
Collaboration Challenges • Larger systems • Scalability issues - e.g. parallel filesystems, power and cooling issues • Moving up the technology curve - Infiniband, PCI-E • Assessing integration risks – Will it work? How will it perform? • Harder problems to debug • Getting scientists to share a system • New services - User facilities, application support • Computer room space • Funding and funding models • Driving down costs further • Charting path forward UCCSC – August 8, 2005