1 / 11

Building a Cluster Support Service Implementation of the SCS Program

Building a Cluster Support Service Implementation of the SCS Program. UC Computing Services Conference. Gary Jung SCS Project Manager http://scs.lbl.gov/. August 8, 2005. Agenda. SCS Program Overview Implementation Areas for collaboration. Background.

ruby
Download Presentation

Building a Cluster Support Service Implementation of the SCS Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Cluster Support ServiceImplementation of the SCS Program UC Computing Services Conference Gary JungSCS Project Managerhttp://scs.lbl.gov/ August 8, 2005

  2. Agenda SCS Program • Overview • Implementation • Areas for collaboration UCCSC – August 8, 2005

  3. Background • The 1990’s – Computing at the Desktop • The “Gap” between desktops and NERSC • 2001 - MRC Working Group • Large institutional system originally envisioned by working group • Users not ready to share large system • Recommendation to support Linux clusters • December 2002 - SCS Program approved • $1.3M Four-year program started January 2003 • Ten strategic science projects are selected • IT Division provides support for Linux Clusters • Goals • Enable our scientists to use and take advantage of computing • HPC that works. Avoid security incidents,lost time and expensive mistakes • More effective science UCCSC – August 8, 2005

  4. Strategy • Planning • Formal project mgmt methods • Steering Committee • Support Methodology • Use proven technical approaches that enable us to quickly provide production capability • Adopt standards to facilitate scaling support to several clusters • Staffing • Need to develop a core of expertise to address changes in technology. (e.g. 64-bit Linux, kernel hacking, Cluster mgmt, Myrinet, MPI, schedulers) • Costs • Drive down support costs UCCSC – August 8, 2005

  5. Support Methodology Balance Choice vs. Standardization • User has choice over components that are important to them (e.g. cpu, memory, interconnect.) • We standardize on the aspects that allow us to scale support and reduce costs • Leading, but not bleeding. No exotic stuff. (e.g. no Lustre yet) • On the other hand, tightly coupled, parallel systems are a must to push paradigm shift • Remember that the goal is a production system. • The real trick is in the integration. Making the correct choices so that they all work together and perform well UCCSC – August 8, 2005

  6. Support Methodology The Standard • Hardware - ia32 or AMD64 • Interconnect – GigE, Myrinet, or Infiniband • Operating system - Red Hat Enterprise Linux or Centos • LBNL Warewulf Cluster Toolkit (http://warewulf-cluster.org) • MPI implementation - LAM-MPI • Scheduler - Sun Grid Engine, Torque • Monitoring – Nagios, Ganglia (http://metacluster.lbl.gov) • Cybersecurity – Host-based measures, PIX Firewall, OTP, specific user policies UCCSC – August 8, 2005

  7. Staffing Staffing • Need team with specialized skills to meet technical expertise requirements • Limited funding, tight timeline. • Team roles – Division of responsibilities • Project mgmt, facilities planning • Technology and procurement • Cluster architect, OS, kernel, MPI expert • Scheduler expert • Cluster installation and support • 1.6 FTE total - 10 SCS clusters, 295 nodes UCCSC – August 8, 2005

  8. Costs Driving Down Costs • Standardization of components critical • In-house integration reduces hardware costs and facilitates standards • Leverage relations with open source community • Outsourcing of various pieces - wiring, seismic • Long term planning for electrical infrastructure saves on cost • Develop lower cost staff - college interns • Competitive bid procurement • Benchmarking costs - other National labs, private industry UCCSC – August 8, 2005

  9. Success Factors • Adherence to standards • Effective Steering Committee • Initial funding key to get started • Prominent scientists were our customers • Funding, visibility, ROI • Talented, motivated staff • Creativity allowed, but the focus is on production use • Transparent costing model UCCSC – August 8, 2005

  10. Collaboration What do we have from this? • Methodology for cluster support • New Consulting Offerings • Cluster architecture • Procurement specification • Facilities planning • Development of cluster support business • Effort/cost analysis • Recharge model • LBNL Warewulf software • GPL, 20,000 downloads UCCSC – August 8, 2005

  11. Collaboration Challenges • Larger systems • Scalability issues - e.g. parallel filesystems, power and cooling issues • Moving up the technology curve - Infiniband, PCI-E • Assessing integration risks – Will it work? How will it perform? • Harder problems to debug • Getting scientists to share a system • New services - User facilities, application support • Computer room space • Funding and funding models • Driving down costs further • Charting path forward UCCSC – August 8, 2005

More Related