1 / 37

Using Description-Based Methods for Provisioning Clusters

Using Description-Based Methods for Provisioning Clusters. Philip M. Papadopoulos San Diego Supercomputer Center University of California, San Diego. Computing Clusters. How I spent my Weekend Why should you care about about software configuration Description based configuration

RoyLauris
Download Presentation

Using Description-Based Methods for Provisioning Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Description-Based Methods for Provisioning Clusters Philip M. Papadopoulos San Diego Supercomputer Center University of California, San Diego

  2. Computing Clusters • How I spent my Weekend • Why should you care about about software configuration • Description based configuration • Taking the administrator out of cluster administration • XML-based assembly instructions • What’s next

  3. A Story from this last Weekend • Attempt a TOP 500 Run on a two fused 128 node PIII (1GHz, 1GB mem) clusters • 100 Mbit ethernet, Gigabit to frontend. • Myrinet 2000. 128 port switch on each cluster • Questions • What LINPACK performance could we get? • Would Rocks scale to 256 nodes? • Could we set up/teardown and run benchmarks in the allotted 48 hours? • Would we get any sleep?

  4. Setup • Fri: Started 5:30pm. Built new frontend. Physical rewiring of myrinet, added ethernet switch. • Fri: Midnight. Solved some ethernet issues. Completely reinstalled all nodes. • Sat: 12:30a Went to sleep. • Sat: 6:30a. Woke up. Submitted first LINPACK runs (225 nodes) New Frontend 8 Cross Connects (Myrinet) 128 nodes (120 on Myrinet) 128 nodes (120 on Myrinet)

  5. Mid-Experiment • Sat 10:30a. Fixed a failed disk, and a bad SDRAM. Started Runs on 240 nodes. Went shopping at 11:30a • 4:30p. Examined Output. Started ~6hrs of Runs • 11:30p. Examined more visits. Submitted final large runs. Went to sleep • 285 GFlops • 59.5% Peak • Over 22 hours of continuous computing 240 Dual PIII (1Ghz, 1GB) - Myrinet

  6. Final Clean-up • Sun 9a. Submitted 256 node Ethernet HPL • Sun 11:00a. Added GigE to frontend. Started complete reinstall of all 256 nodes. • ~40 Minutes for Complete reinstall (return to original system state) • Did find some (fixable) issues at this scale • Too Many open database connections (fixing for next release) • DHCP client not resilient enough. Server was fine. • Sun 1:00p. Restored wiring, rebooted original frontends. Started reinstall tests. Some debug of installation issues uncovered at 256 nodes. • Sun 5:00p. Went home. Drank Beer.

  7. Installation, Reboot, Performance 32 Node Re-Install • < 15 minutes to reinstall 32 node subcluster (rebuilt myri driver) • 2.3min for 128 node reboot Start Finsish Reboot Start HPL

  8. Key Software Components • HPL + ATLAS. Univ. of Tennessee • The truckload of open source software that we all build on • Redhat linux distribution and installer • de facto standard • NPACI Rocks • Complete cluster-aware toolkit that extends a Redhat Linux distribution • (Inherent Belief that “simple and fast” can beat out “complex and fast”)

  9. Why Should You Care about Cluster Configuration/Administration? • “We can barely make clusters work.” – Gordon Bell @ CCGCS02 • “No Real Breakthroughs in System Administration in the last 20 years” – P. Beckman @ CCGCS02 • System Administrators love to continuously twiddle knobs • Similar to a mechanic continually adjusting air/fuel mixture on your car • Or worse: Randomly unplugging/plugging spark plug wires • Turnkey/Automated systems remove the system admin from the equation • Sysadmin doesn’t like this. • This is good

  10. Clusters aren’t homogeneous • Real clusters are more complex than a pile of identical computing nodes • Hardware divergence as cluster changes over time • Logical heterogeneity of specialized servers • IO servers, Job Submission, Specialized configs for apps, … • Garden-variety cluster owner shouldn’t have to fight the provisioning issues. • Get them to the point where they are fighting the hard application parallelization issues • They should be able to easily follow software/hardware trends • Moderate-sized (upto ~128 nodes) clusters are the “standard” grid endpoints • A non-uber administrator handles two 128 node Rocks clusters at SIO (260:1 admin to system ratio)

  11. NPACI Rocks Toolkit – rocks.npaci.edu • Techniques and software for easy installation, management, monitoring and update of Linux clusters • A complete cluster-aware distribution and configuration system. • Installation • Bootable CD + floppy which contains all the packages and site configuration info to bring up an entire cluster • Management and update philosophies • Trivial to completely reinstall any (all) nodes. • Nodes are 100% automatically configured • RedHat Kickstart to define software/configuration of nodes • Software is packaged in a query-enabled format • Never try to figure out if node software is consistent • If you ever ask yourself this question, reinstall the node • Extensible, programmable infrastructure for all node types that make up a real cluster.

  12. Tools Integrated • Standard cluster tools • MPICH, PVM, PBS, Maui (SSH, SSL -> Red Hat) • Rocks add ons • Myrinet support • GM device build (RPM), RPC-based port-reservation (usher-patron) • Mpi-launch (understands port reservation system) • Rocks-dist – distribution work horse • XML (programmable) Kickstart • eKV (console redirect to ethernet during install) • Automated mySQL database setup • Ganglia Monitoring (U.C. Berkeley and NPACI) • Stupid pet administration scripts • Other tools • PVFS • ATLAS BLAS, High Performance Linpack

  13. Support for Myrinet • Myrinet device driver must be versioned to the exact kernel version (eg. SMP,options) running on a node • Source is compiled at reinstallation on every (Myrinet) node (adds 2 minutes installation) (a source RPM, by the way) • Device module is then installed (insmod). • GM_mapper run (add node to the network) • Myrinet ports are limited and must be identified with a particular rank in a parallel program • RPC-based reservation system for Myrinet ports • Client requests port reservation from desired nodes • Rank mapping file (gm.conf) created on-the-fly • No centralized service needed to track port allocation • MPI-launch hides all the details of this • HPL (LINPACK) comes pre-packaged for Myrinet • Build your Rocks cluster, see where it sits on the Top500

  14. Key Ideas • No difference between OS and application software • OS installation is completely disposable • Unique state that is kept only at a node is bad • Creating unique state at the node is even worse • Software bits (packages) are separated from configuration • Diametrically opposite from “golden image” methods • Description-based configuration rather than image-based • Installed OS is “compiled” from a graph. • Inheritance of software configurations • Distribution • Configuration • Single step installation of updated software OS • Security patches pre-applied to the distribution not post-applied on the node

  15. Rocks extends installation as a basic way to manage software on a cluster It becomes trivial to insure software consistency across a cluster

  16. Rocks Disentangles Software Bits (distributions) and Configuration Collection of all possible software packages (AKA Distribution) Descriptive information to configure a node Kickstart file RPMs Appliances Compute Node IO Server Web Server

  17. Managing Software Distributions Collection of all possible software packages (AKA Distribution) Descriptive information to configure a node Kickstart file RPMs Compute Node IO Server Web Server

  18. Rocks-dist Repeatable process for creation of localized distributions • # rocks-dist mirror • Rocks mirror • Rocks 2.2 release • Rocks 2.2 updates • # rocks-dist dist • Create distribution • Rocks 2.2 release • Rocks 2.2 updates • Local software • Contributed software • This is the same procedure NPACI Rocks uses. • Organizations can customize Rocks for their site. • Iterate, extend as needed

  19. Description-based Configuration Collection of all possible software packages (AKA Distribution) Descriptive information to configure a node Kickstart file RPMs Compute Node IO Server Web Server

  20. Description-based Configuration • Infrastructure that “describes“ the roles of cluster nodes • Nodes are assembled using Red Hat's kickstart • Components are defined in XML files that completely describe and configure software subsystems • Software bits are in standard RPMs. • Kickstart files are built on-the-fly, tailored for each node through Database interaction VS.

  21. What is a Kickstart file? Setup/Packages (20%) Package Configuration (80%) cdrom zerombr yes bootloader --location mbr --useLilo skipx auth --useshadow --enablemd5 clearpart --all part /boot --size 128 part swap --size 128 part / --size 4096 part /export --size 1 --grow lang en_US langsupport --default en_US keyboard us mouse genericps/2 timezone --utc GMT rootpw --iscrypted nrDG4Vb8OjjQ. text install reboot %packages @Base @Emacs @GNOME %post cat > /etc/nsswitch.conf << 'EOF' passwd: files shadow: files group: files hosts: files dns bootparams: files ethers: files EOF cat > /etc/ntp.conf << 'EOF' server ntp.ucsd.edu server 127.127.1.1 fudge 127.127.1.1 stratum 10 authenticate no driftfile /etc/ntp/drift EOF /bin/mkdir -p /etc/ntp cat > /etc/ntp/step-tickers << 'EOF' ntp.ucsd.edu EOF /usr/sbin/ntpdate ntp.ucsd.edu /sbin/hwclock --systohc Portable (ASCII), Not Programmable, O(30KB)

  22. What are the Issues • Kickstart file is ASCII • There is some structure • Pre-configuration • Package list • Post-configuration • Not a “programmable” format • Most complicated section is post-configuration • Usually this is handcrafted • Really Want to be able to build sections of the kickstart file from pieces • Straightforward extension to new software, different OS

  23. Focus on the notion of “appliances” How do you define the configuration of nodes with special attributes/capabilities

  24. Assembly Graph of a Complete Cluster - “Complete” Appliances (compute, NFS, frontend, desktop, …) - Some key shared configuration nodes (slave-node, node, base)

  25. Describing Appliances • Purple appliances all include “slave-node” • Or derived from slave-node • Small differences are readily apparent • Portal, NFS has “extra-nic”. Compute does not • Compute runs “pbs-mom”, NFS, Portal do not • Can compose some appliances • Compute-pvfs IsA compute and IsA pvfs-io

  26. Architecture Dependencies • Focus only on the differences in architectures • logically, IA-64 compute node is identical to IA-32 • Architecture type is passed from the top of graph • Software bits (x86 vs. IA64) are managed in the distribution

  27. Abstract Package Names, versions, architecture ssh-client Not ssh-client-2.1.5.i386.rpm Allow an administrator to encapsulate a logical subsystem Node-specific configuration is retrieved from our database IP Address Firewall policies Remote access policies … <?xml version="1.0" standalone="no"?> <!DOCTYPE kickstart SYSTEM "@KICKSTART_DTD@" [<!ENTITY ssh "openssh">]> <kickstart> <description> Enable SSH </description> <package>&ssh; </package> <package> &ssh;-clients</package> <package> &ssh;-server</package> <package> &ssh;-askpass</package> <!-- include XFree86 packages for xauth --> <package>XFree86</package> <package>XFree86-libs</package> <post> cat &gt; /etc/ssh/ssh_config &lt;&lt; 'EOF' <!-- default client setup --> Host * CheckHostIP no ForwardX11 yes ForwardAgent yes StrictHostKeyChecking no UsePrivilegedPort no FallBackToRsh no Protocol 1,2 EOF </post> </kickstart> XML Used to Describe Modules

  28. Space-Time and HTTP Node Appliances Frontends/Servers DHCP IP + Kickstart URL Kickstart RQST Generate File kpp DB Request Package Serve Packages kgen Install Package • HTTP: • Kickstart URL (Generator) can be anywhere • Package Server can be (a different) anywhere Post Config Reboot

  29. Subsystem Replacement is Easy • Binaries are in de facto standard package format (RPM) • XML module files (components) are very simple • Graph interconnection (global assembly instructions) is separate from configuration • Examples • Replace PBS with Sun Grid Engine • Upgrade version of OpenSSH or GCC • Turn on RSH (not recommended) • Purchase commercial compiler (recommended)

  30. Reset • 100s of clusters have been built with Rocks on a wide variety of physical hardware • Installation/Customization is done in a straightforward programmatic way • Scaling is excellent • HTTP is used as a transport for reliability/performance • Configuration Server does not have to be in the cluster • Package Server does not have to be in the cluster • (Sounds grid-like)

  31. What’s still missing? • Improved Monitoring • Monitoring Grids of Clusters • “Personal” cluster monitor • Straightforward Integration with Grid (Software) • Will use NMI (NSF Middleware Initiative) software as a basis (A Grid endpoint should no harder to setup than a cluster) • Any sort of real IO story • PVFS is a toy (no reliability) • NFS doesn’t scale properly (no stabilty) • MPI-IO is only good for people willing to retread large sections of code. (most users want read/write/open close). • Real parallel job control • MPD looks promising • …

  32. Meta Cluster Node Meta Cluster Monitor Built on Ganglia

  33. Integration with Grid Software • As part of NMI, SDSC is working on a package called gridconfig • http://rocks.npaci.edu/nmi/gridconfig/overview.html • Provide a higher level tool for creating and consistently managing configuration files for software components • Allows config data (eg path names, policy files, etc). To be shared among unrelated components • Does NOT require changes to config file formats • Globus GT2 has over a dozen configuration files • Over 100 configurable parameters • Many files share similar information • Will be part of NMI-R2

  34. Some Gridconfig Details Top Level Components Globus Details (example)

  35. Accepting Configuration • Parameters from previous screens stored in a simple schema (mySQL database is backend) • From these parameters, native format configuration files are written

  36. Rocks Near Term • Version 2.3 in alpha now, Beta by end 1st week of October • RedHat 7.3 • Integration with Sun Grid Engine (SCS – Singapore) • Resource target for SCE (P. Uthayopas, KU, Thailand) • Bug fixes, additional software components • Some low-level Engineering changes on interaction with Redhat installer • Working toward NMI Ready (point release in Dec. early Jan).

  37. Parting Thoughts • Why is grid software so hard to successfully deploy? • If large cluster configurations can be really thought of as disposable, shouldn’t virtual organizations be just as easily constructed? • http://rocks.npaci.edu • http://Ganglia.sourceforge.net • http://rocks.npaci.edu/nmi/gridconfig

More Related