Disclaimer

Disclaimer Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server Institute ~0.25TIPS Physics data cache 100 - 1000 Mbits/sec Tier 4 Workstations

Who the Heck Are You? • Joe Kaiser • Fermilab Employee • Computing Division/Core Services Support • Scientific Computing Support – CD/CSS-SCS • Lead sys admin for US-CMS • Sisyphus • Socrates – Gadfly

US-CMS Linux Grid Installations with Rocks

Installation, Configuration, Monitoring, and Maintenance of Multiple Grid Subclusters

Managing Heterogenous Use Grid Clusters with a Minimum of Fuss, Muss, and Cuss for US-CMS

Managing Multiple Grid Environments within the US-CMS Grid

Managing Grid Clusters So People Will Go Away and Let You Get Some Real Work Done

TPM: Total Provisioning and Management

TPM

Total Pajama Management

THIS COULD BE YOU!!!!

OR YOU!!!!

Deceptively Simple Idea • Remotely Manage: • Power up/power down • Interact with BIOS • Install – interact with install • Configuration • Never touch another floppy or cdrom • Minimize human machine interaction. • Anytime – Anywhere: (We already do...sorta.)

Why TPM? • Commodity hardware creates more machines but no sysadmins to take care of them. • Ratio of machines to sysadmins is increasing • Sys admin = It's up and functioning • Low level skill. • Sys admin = Manager of large Grid enabled resources • Interesting systems problems

TPM and The US-CMS Computing Facility Story • Where we were. • Where we are and how we got here. • Where we are going.

Data Grid Hierarchy (CMS) ~PBytes/sec Online System ~100 MBytes/sec Offline Farm Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size ~100 MBytes/sec CERN Computer Center Tier 0 ~622 Mbits/sec Germany Regional Center Italy Regional Center Fermilab France Regional Center Tier 1 ~2.4 Gbits/sec CALTECH Tier 2 U. Of Florida Tier2 Center Tier2 Center UC San Diego ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute Physics data cache 100 - 1000 Mbits/sec Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server Tier 4 Workstations

Where we were • 1998-99ish? • One Sun E4500 and a couple of Hitecs • Solaris, FRHL 5.2? • 1999-2000ish • 1 Sun, 3 Dells 6450's + Powervaults • Solaris, RH with Fermi overlay • 2000ish • 1 Sun, 3 Dells, 40 Atipa farm nodes • FRHL 6.1 NFS installs

Cont'. • 2000 -2001ish • 1 Sun, 3 Dells, 40 farm nodes, 3 IBM eSeries w/disk. • 2.6: RH6.1, AFS, and bigmem don't mix. • Mixed mode machines start. • 2001-2002ish • See above, 3ware terabyte servers, raidzone, 16 farm nodes. • RH6.2 because of Objectivity • Rise of the subcluster

Where we are • Hardware • 65 - dual Athlons +1900 • 16 – Dual 1Ghz PIII • 40 – 750 Mhz PII • 4 - Quad Dells • 7 – 3Ware servers • 3 – Quad IBMS • 2 Suns • Total is 137 machines • 2 Admins

Cont' • Software • 6 Subclusters three more requested: • DGT, IGT, PG, UAF, dCache, LCG-0 • Requested: LCG0-CMS, LCG-1, LCFGng • Combination of RH6.2 and RH7.3. • Install/Configuration mechanisms • Rocks, Fermi Network installs, voldemort, YUM • Fermi software – dfarm, fbsng, dcache, ngop • Grid software – VDT soon to be 1.1.8

Current configuration of computing resources PRODUCTION CLUSTER (>80 Dual Nodes) ESNET (OC12) MREN (OC3) CISCO 6509 US-CMS TESTBED R&D 1 TB 250GB USER ANALYSIS ENSTORE (17 DRIVES) DCACHE (>7TB)

The Impetus • Frustrations: • FRHL didn't work on CMS resources • Dells, Farm nodes, mixed mode clusters, Mosix • Couldn't FRHL install to Tier2. • !^$%@# floppies! • Desires: • One BIOS2LOGIN install, config and management • Dynamic partitioning of worker nodes. • Work with my small cluster and future large one. • No more floppies!

The Four Pillars of TPM • Power control • Serial console • Network installation and configuration • Monitoring

Power Control • APC's • Power on, off, reboot nodes or nodes on a controller. • Scriptable • On private subnet

Serial Console • BIOS redirection – 115200 • SCS Lantronix – port set to 115200 • Boot loader/kernel append set to 115200 • Conserver software for central management

Network Install Mechanisms • FRHL Network installs – used at work • Farm Team and CMS difficulties. • OSCAR – used at conference • Too little, too late, gotta know too much • SystemImager – used for mixed mode • Rocks • Tier-2's were using and looking to us for distro. • Needed to do more than what I had. • Had to apply to future problems – not stop gap.

Rocks Is: • Reliable, fast enough, cheap, simple, scalable. • Allows for remote installation part of B2L admin. • Portable – can build own distributions. • Opensource, active community. • Complete cluster management or part of suite. • Allows for middleware to be built on top.

Rocks Philosophy • Make clusters easy – when in doubt reinstall. • Sys admins aren't cheap and don't scale • Enable non-experts to build clusters • Use installation as common mechanism to manage cluster • All nodes are 100% automatically installed • Zero “hand” configuration • Everything you need for a cluster

Software Configuration Tuning RPMs For clusters For your site Other customization XML Kickstart Programmatic System Building Scalable Appliance Creation Cluster Software Management Software Packages • RPMs • Standard Red Hat (desktop) packaged software • Or your own addons • Rocks-dist • Manages the RPM repository • This is the distribution • RedHat Workgroup

Basic Installation • Fully automated cluster deployment • Get and burn ISO CD (DVD for IA64) image from http://www.rocksclusters.org • Boot frontend with CD/DVD • Fill out 7 configuration screens (mostly Red Hat) • Reboot frontend machine • Integrate compute nodes with insert-ethers • Ready to go! • Complete out of the box solution with rational default settings • Identical environment for x86 or IA64

Fermi Extends It • Two words: Kerberos and Networking • Create out own distribution without disturbing Rocks. Pass down to Tier 2. • XFS Fermi Rocks for 3Ware Servers • Add appliances, RPMS, Workgroups

Web servers Db servers Front-end Node Front-end Node Node Node Node Node Node CISCO 6509 Node Node Node Node Node Network attached Storage ENSTORE Dynamic partitioning and configuring the farm Production User interactive/batch dCache

Dynamic Partitioning • Move hardware resources to different subcluster. • Crudely possible with Rocks and voldemort. • VOLDEMORT rsync utility by Steve Timm. • Install as far as possible with Rocks, critical differences between worker nodes can be pushed/pulled and then rebooted behind another server with different disk. • YUM for updating.

The Fourth Pillar – Monitoring • What is happening on my cluster? • NGOP • Ganglia • Grid Monitoring • Open protocols, open interfaces • Information must be presented in ways that are conducive to answering the above quickly.

Where are we going? • Farm Management Project • Produce multiuse clusters (grid, interactive, and batch) with common infrastructure. • Tools needed - • TPM tools – install, configure, monitor, guarantee • Partitioning meta-tools. • Resource and process management (FBSNG is current) • Any tool to be adopted must be “better” than what I've got.

Remember – This Could be YOU!

Links • http://www.rocksclusters.org • http://ganglia.sourceforge.net • http://www-isd.fnal.gov/ngop • http://linux.duke.edu/projects/yum

Disclaimer

Disclaimer

Presentation Transcript

[DISCLAIMER]

Disclaimer

Disclaimer

Disclaimer!

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer

DISCLAIMER

DISCLAIMER

Disclaimer

Disclaimer

Disclaimer