340 likes | 665 Views
Leveraging Standard Core Technologies to Programmatically Build Linux Cluster Appliances. Mason Katz San Diego Supercomputer Center IEEE Cluster 2002. Outline . Problem definition What is so hard about clusters? Distinction between Software Packages (bits)
E N D
Leveraging Standard Core Technologies to Programmatically Build Linux Cluster Appliances Mason Katz San Diego Supercomputer Center IEEE Cluster 2002
Outline • Problem definition • What is so hard about clusters? • Distinction between • Software Packages (bits) • System Configuration (functionality and state) • Programmatic software installation with: • XML, SQL, HTTP, Kickstart • Future Work San Diego Supercomputer Center
Build this cluster • Build a 128 node cluster • Known configuration • Consistent configuration • Repeatable configuration • Do this in an afternoon • Problems • How to install software? • How to configure software? • We manage clusters with (re)installation • So we care a lot about this problem • Other strategies still must solve this San Diego Supercomputer Center
The Myth of the Homogeneous COTS Cluster • Hardware is not homogeneous • Different chipset revisions • Chipset of the day (e.g. Linksys Ethernet cards) • Different disk sizes (e.g. changing sector sizes) • Vendors do not know this is happening! • Entropy happens • Hardware components fail • Cannot replace with the same components past a single Moore cycle • A Cluster is not just compute nodes (appliances) • Fileserver Nodes • Management Nodes • Login Nodes San Diego Supercomputer Center
What Heterogeneity Means • Hardware • Cannot blindly replicate machine software • AKA system imaging / disk cloning • Requires patching the system after cloning • Need to manage system software at a higher level • Software • Subsets of a cluster have unique software configuration • One “golden image” cannot build a cluster • Multiple images replicate common configuration • Need to manage system software at a higher level San Diego Supercomputer Center
Packages vs. Configuration Collection of all possible software packages (AKA Distribution) Descriptive information to configure a node Kickstart file RPMs Appliances Compute Node IO Server Web Server San Diego Supercomputer Center
Software Packages Collection of all possible software packages (AKA Distribution) Descriptive information to configure a node Kickstart file RPMs Appliances Compute Node IO Server Web Server San Diego Supercomputer Center
System Configuration Collection of all possible software packages (AKA Distribution) Descriptive information to configure a node Kickstart file RPMs Appliances Compute Node IO Server Web Server San Diego Supercomputer Center
Setup & Packages (20%) cdrom zerombr yes bootloader --location mbr --useLilo skipx auth --useshadow --enablemd5 clearpart --all part /boot --size 128 part swap --size 128 part / --size 4096 part /export --size 1 --grow lang en_US langsupport --default en_US keyboard us mouse genericps/2 timezone --utc GMT rootpw --iscrypted nrDq4Vb42jjQ. text install reboot %packages @Base @Emacs @GNOME Post Configuration (80%) %post cat > /etc/nsswitch.conf << 'EOF' passwd: files shadow: files group: files hosts: files dns bootparams: files ethers: files EOF cat > /etc/ntp.conf << 'EOF' server ntp.ucsd.edu server 127.127.1.1 fudge 127.127.1.1 stratum 10 authenticate no driftfile /etc/ntp/drift EOF /bin/mkdir -p /etc/ntp cat > /etc/ntp/step-tickers << 'EOF' ntp.ucsd.edu EOF /usr/sbin/ntpdate ntp.ucsd.edu /sbin/hwclock --systohc What is a Kickstart File? San Diego Supercomputer Center
Issues • High level description of software installation • List of packages (RPMs) • System configuration (network, disk, accounts, …) • Post installation scripts • De facto standard for Linux • Single ASCII file • Simple, clean, and portable • Installer can handle simple hardware differences • Monolithic • No macro language (as of RedHat 7.3 this is changing) • Differences require forking (and code replication) • Cut-and-Paste is not a code re-use model San Diego Supercomputer Center
It looks something like this San Diego Supercomputer Center
Implementation • Nodes • Single purpose modules • Kickstart file snippets (XML tags map to kickstart commands) • Over 100 node files in Rocks • Graph • Defines interconnections for nodes • Think OOP or dependencies (class, #include) • A single default graph file in Rocks • Macros • SQL Database holds site and node specific state • Node files may contain <var name=“state”/> tags San Diego Supercomputer Center
Composition • Aggregate Functionality • Scripting • IsA perl-development • IsA python-development • IsA tcl-development San Diego Supercomputer Center
Functional Differences • Specify only the deltas • Desktop IsA • Standalone • Laptop IsA • Standalone • Pcmcia San Diego Supercomputer Center
Architecture Differences • Conditional inheritance • Annotate edges with target architectures • if i386 • Base IsA lilo • if ia64 • Base IsA elilo San Diego Supercomputer Center
Putting it all together - “Complete” Appliances (compute, NFS, frontend, desktop, …) - Some key shared configuration nodes (slave-node, node, base) San Diego Supercomputer Center
Sample Node File <?xml version="1.0" standalone="no"?> <!DOCTYPE kickstart SYSTEM "@KICKSTART_DTD@" [<!ENTITY ssh "openssh">]> <kickstart> <description> Enable SSH </description> <package>&ssh;</package> <package>&ssh;-clients</package> <package>&ssh;-server</package> <package>&ssh;-askpass</package> <post> cat > /etc/ssh/ssh_config << 'EOF’ <!-- default client setup --> Host * ForwardX11 yes ForwardAgent yes EOF chmod o+rx /root mkdir /root/.ssh chmod o+rx /root/.ssh </post> </kickstart>> San Diego Supercomputer Center
Sample Graph File <?xml version="1.0" standalone="no"?> <!DOCTYPE kickstart SYSTEM "@GRAPH_DTD@"> <graph> <description> Default Graph for NPACI Rocks. </description> <edge from="base" to="scripting"/> <edge from="base" to="ssh"/> <edge from="base" to="ssl"/> <edge from="base" to="lilo" arch="i386"/> <edge from="base" to="elilo" arch="ia64"/> … <edge from="node" to="base" weight="80"/> <edge from="node" to="accounting"/> <edge from="slave-node" to="node"/> <edge from="slave-node" to="nis-client"/> <edge from="slave-node" to="autofs-client"/> <edge from="slave-node" to="dhcp-client"/> <edge from="slave-node" to="snmp-server"/> <edge from="slave-node" to="node-certs"/> <edge from="compute" to="slave-node"/> <edge from="compute" to="usher-server"/> <edge from="master-node" to="node"/> <edge from="master-node" to="x11"/> <edge from="master-node" to="usher-client"/> </graph> San Diego Supercomputer Center
Nodes and Groups Nodes Table Memberships Table San Diego Supercomputer Center
Groups and Appliances Memberships Table Appliances Table San Diego Supercomputer Center
Simple key - value pairs • Used to configure DHCP and to customize appliance kickstart files San Diego Supercomputer Center
Space-Time and HTTP Node Appliances Frontends/Servers DHCP IP + Kickstart URL Kickstart RQST Generate File kpp SQL DB Request Package Serve Packages kgen Install Package • HTTP: • Kickstart URL (Generator) can be anywhere • Package Server can be (a different) anywhere Post Config Reboot San Diego Supercomputer Center
256 Node Scaling • Attempt a TOP 500 Run on a two fused 128 node PIII (1GHz, 1GB mem) clusters • 100 Mbit ethernet, Gigabit to frontend. • Myrinet 2000. 128 port switch on each cluster • Questions • What LINPACK performance could we get? • Would Rocks scale to 256 nodes? • Could we set up/teardown and run benchmarks in the allotted 48 hours? • SDSC’s Teragrid Itanium2 system is about this size San Diego Supercomputer Center
Setup New Frontend • Fri Night: Built new frontend. Physical rewiring of Myrinet, added Ethernet switch. • Sat: Initial LINPACK runs, and debugging hardware failures, 240 node Myri run. • Sun: Submitted 256 Ethernet run, re-partitioned clusters, complete re-installation (40 min) 8 Cross Connects (Myrinet) 128 nodes (120 on Myrinet) 128 nodes (120 on Myrinet) San Diego Supercomputer Center
Some Results 240 Dual PIII (1Ghz, 1GB) - Myrinet • 285 GFlops • 59.5% Peak • Over 22 hours of continuous computing San Diego Supercomputer Center
Installation, Reboot, Performance • < 15 minutes to reinstall 32 node subcluster (rebuilt myri driver) • 2.3min for 128 node reboot 32 Node Re-Install Start Finsish Reboot Start HPL San Diego Supercomputer Center
Future Work • Other backend targets • Solaris Jumpstart • Windows Installation • Supporting on-the-fly system patching • Cfengine approach • But using the XML graph for programmability • Traversal order • Subtleties with order of evaluation for XML nodes • Ordering requirements != Code reuse requirements • Dynamic cluster re-configuration • Node re-targets appliance type according to system need • Autonomous clusters? San Diego Supercomputer Center
Summary • Installation/Customization is done in a straightforward programmatic way • Leverages existing standard technologies • Scaling is excellent • HTTP is used as a transport for reliability/performance • Configuration Server does not have to be in the cluster • Package Server does not have to be in the cluster • (Sounds grid-like) San Diego Supercomputer Center
www.rocksclusters.org San Diego Supercomputer Center