Causeway—The Drew Cluster Project

Causeway—The Drew Cluster Project Mike Richichi (mrichich@drew.edu) Paul Coen (pcoen@drew.edu) Drew University TTP 2002

Drew at a glance • ~2200 students • All students receive a laptop as part of tuition • Technology relatively well integrated in the curriculum • eDirectory is standardized authentication mechanism • NetWare, Netmail, ZENworks, BorderManager . . .

The Problem • All faculty and staff dependent on a single file server (NW 5.1) for network storage and client applications. • Little downtime, but when it happened, people were, strangely, irritated. • At least 20 minutes of downtime for a failure (mounting volumes). More if things are really messed up. • Any downtime is a hit to our credibility no matter how good we are at our jobs

The Solution • December, 2001: Convince managers to spend $$ on a Compaq SAN • Try to get delivery before end of calendar year, to install in January • Have equipment delayed until early January—window of opportunity closes for install • Have all spring to play around with configuration and testing • Actually a much better solution overall

The Configuration • 3 NW6SP1 file servers: • Compaq DL360G2, 2x1.4GHzPIII, 2.25GB RAM • Compaq ML370G2, 2x1.266GHzPIII, 2.125 GB RAM • Compaq ML370, 1GHzPIII, 1.125GB RAM • All with Gigabit Ethernet (DL360G2 2 onboard), 2 Fiber Channel cards

SAN Hardware • Compaq MA8000 controller (dual controllers, each dual ported, dual power supplies) • 6 disk shelves (dual power supplies) • 26 x 36 GB disks: • 5 arrays (3x6 disks, 2x3 disks) • 1 disk for spool volume • 1 hot spare • 2 Storageworks SAN Switch 8-ELs • In 2 separate buildings • Each controller attached to each switch • Multimode fiber connections • Compaq Modular Data Router (MDR) • Supports SCSI attached MSL5026SL dual SDLT tape library

Configuration • All servers dual connected to each fiber switch • SAN array and controllers have dual power supplies—one side connected to local UPS, other connected to building UPS. • 2 servers in one building, one in the other • Redundant network core • Basically, everything keeps going if we lose a building (except for what’s connected directly to that building, or unless the disk array itself dies) • MDR and tape library in second building, away from the SAN array.

SAN implementation • SAN brought online in February • First server added at that point • Second server and clustering added on Spring Break (second server in cluster was existing server.) • Third server added early June • Moved backups to new tape library unit with Backup Exec 9 in May, backing up old servers plus new cluster node.

Migration Issues • Look at network in terms of services: • Course files • Home directories • Departmental/group directories • Academic and general applications • Network printing • Provide cluster services (volumes) for each

Performing migration • Create new volumes • Update login scripts • Provide drive letters for each cluster volume • Abstraction • Ease of use • Added mappings to old login scripts to ease migration • Educate users to use drive letters or new Causeway volume names

The big night • Tell people to stop messing with stuff • Use JRB Netcopy to get all the trustees, volume and directory quotas, etc right. • Wait • Decide to go home at 3am so you can get at least a few hours sleep before you have to pack for vacation the next day • Users log in the following morning and have everything mapped to the new services, with no loss of service.

Gotchas • New login scripts use home directory attribute of user object—some not set (old accounts) • Migration to NDPS • Had legacy queues serviced by NDPS, but had to move queue volumes to new virtual server volume • This is not really supported, but it seems to work • Some files didn’t copy • Residual volume weirdness

More gotchas • Nomenclature • \\servername\volumename? \\tree\.volume.organization? Directory maps? Win2K/XP or 9x? Argh! • Drive letters, while so ’80s, were our only practical solution to the consistent nomenclature problems across 9x and XP • Other clients? Have fun! • CIFS and failover

Current status • All users now using the cluster • Old servers still up, but data volumes renamed to OLD_volumename • Can still get files if necessary • Some users still running apps off of old application volumes (UNC path issues) • Search and destroy

Backup configuration • Using Veritas Backup Exec 9 with Additional Drive option and SAN Shared Storage option. • Using 4 once-a-week tapes for each primary volume, half a dozen daily differential tapes, plus smaller numbers of more frequently rotating tapes for SYS volumes, NDS, server-specific information and Linux and NT/2000 application servers.

Backup Limitations • Only one node in cluster is acting as a media server • Cost was a factor. We would have had to buy another server license, plus options per media server. • Having had the multi-server edition of BE 8.5 with two years of upgrade protection, we received three remote Netware and three remote NT/2000 agent licenses, enough for our current needs • Largest data volumes are usually attached to media server • Backups and cluster virtual servers • Few (if any) support virtual servers and can find a volume that has failed over • Edit sys:nsn/user/smsrun.bas and change nlmArray line, replacing “TSA600” with “TSA600 /cluster=off” to access volumes as standard server-attached volumes, as per TID 10065605 • Setting cluster volume pools to migrate back when the original server becomes available again helps prevent backup problems, when it works.

Cluster enabled LDAP • 2 virtual IP address resources • LDAP-1 and LDAP-2 • DNS round robins “ldap.drew.edu” and also has “ldap-1.drew.edu and “ldap-2.drew.edu” • Clients configured to use ldap.drew.edu • LDAP will bind to all addresses on server, so NLM doesn’t need to be reloaded • Client timeouts hide most cluster failovers

What’s in a name? • Drew is acronym crazy • Wanted an easy to remember name to brand the project, but didn’t stand for anything or really mean anything • Causeway implies things in a sort of abstract way without actually meaning anything • People can refer to “Causeway” and it means something, but nothing too specific, which is actually good in this case

Cluster-enabled? • What does it mean? • Can I cluster enable: • iFolder • NetStorage • Any product Novell sells • Service availability criteria

Discussion • Problems, issues, concerns? • Other cluster sites? Issues? • NW 5.1 versus 6?

Causeway—The Drew Cluster Project