410 likes | 527 Views
Riding the Dot-Com Wave: A Case Study in Extreme VMScluster Scalability. CETS2001 Session 1255 Wednesday, Sept. 12, 2:45 pm, 303B Keith Parris. Topics. Scale of workload and configuration growth Changes made to scale the cluster Challenges to extreme scalability and high availability
E N D
Riding the Dot-Com Wave:A Case Study in ExtremeVMScluster Scalability CETS2001 Session 1255 Wednesday, Sept. 12, 2:45 pm, 303B Keith Parris
Topics • Scale of workload and configuration growth • Changes made to scale the cluster • Challenges to extreme scalability and high availability • Surprises along the way • Lessons learned
Hardware Growth • 1981: 1 Alpha Microsystems PC • 1983: 1 VAX 11/750 • 1993: 1 MicroVAX 4100 • 1996: 4 VAX 7700s, 1 8400 • 1997: 6 VAX 7700s, 2 VAX 7800s, 2 8400s • 1999: 18 GS-140s • 2001: 2 clusters; one with 12 GS-140s, the other with 3 GS-140s and 2 GS-160s
Workload Growth Rate • As measured in yearly peak transaction counts • 1996-1997: 2X • 1997-1998: 2X • 1998-1999: 2X • 1999-2000: 3X • We’ll focus on these years
Scaling the Cluster: Memory • Went from 1 GB to 20 GB on systems
Scaling the Cluster: CPU • Upgraded VAX 7700 nodes by adding CPUs • Upgraded key nodes from VAX 7700 to VAX 7800 CPUs • Ported application from VAX to Alpha • Went from 2-CPU 8400s to 6-CPU GS-140s, then added 12-CPU GS-160s • From 200 Mhz EV4 chips to 731 Mhz EV67
Scaling the Cluster: I/O • Went from JBOD to RAID, and raised number of members in RAID arrays over time • Added RMS Global Buffers • 3600 RPM magnetic disks to 5400 RPM to 7200 RPM to 10K RPM • Put hot files onto large arrays of Solid-State Disks • Upgraded from CMD controllers to HSJ40s; added writeback cache; upgraded to HSJ50s and doubled # of controllers; upgraded to HSJ80s
Scaling the Cluster: I/O • Changed from shadowsets of controller-based stripesets to use host-based RAID 0+1 arrays of controller-based mirrorsets • Avoided any single controller being a bottleneck by spreading RAID array members across multiple controllers • Provides faster time-to-repair for shadowset member failures • Provided faster cross-site shadow copy times
Shadowsets of Stripesets • Volume shadowing thinks it sees large disks • Shadow copies and merges occur sequentially across entire surface • Failure of 1 member implies full shadow copy of stripeset to fix Host-based shadowset Controller-based stripeset Controller-based stripeset
Host-based RAID 0+1 arrays Host-based RAID 0+1 array • Individual disks are combined first into shadowsets • Host-based RAID software combines the shadowsets into a RAID 0 array • Shadowset members can be spread across multiple controllers Host-based shadowset Host-based shadowset Host-based shadowset
Host-based RAID 0+1 arrays Host-based RAID 0+1 array • Shadow copies and merges occur in parallel on all shadowsets at once • Failure of 1 member requires full shadow copy of only that member to fix Host-based shadowset Host-based shadowset Host-based shadowset
Scaling the Cluster: I/O & Locking • Implemented Fast_Path on CI • Tried Memory Channel and failed • CPU 0 saturation in interrupt state occurred when lock traffic moved from CI (with Fast_Path) to MC (without Fast_Path) • Went from 2 CI star couplers to 6 • Distributed lock traffic across CIs
Scaling the Cluster: • Implemented Disaster-Tolerant Cluster • Effectively doubled hardware: CPUs, I/O Subsystems, Memory • Significantly improved availability • But relatively long inter-site distance added inter-site latency as a new potential factor in performance issues
Scaling the Cluster:Datacenter space • Multi-site clustering and Volume Shadowing provided the opportunity to move to larger datacenters, without downtime, on 3 separate occasions
Challenges: • Downtime cost $Millions per event • Longer downtime meant even-larger risk • Had to resist initial pressure to favor quick fixes over any diagnostic efforts that would lengthen downtime • e.g. crash dump files
Challenges: • Network focus in application design rather than Cluster focus • Triggered by history of adding node after node, connected by DECnet, rather than forming a VMS Cluster early on • Important functions assigned to specific nodes • Failover and load balancing problematic • Systems had to boot/reboot in specific order
Challenges: • Web interface design provided quick time-to-market using screen scraping, but had fragile 3-process chain with link to Unix
Challenges: • Fragile 3-process chain with link to Unix • Failure of Unix program, TCP/IP link, or any of the 3 processes on VMS caused all 3 VMS processes to die, incurring: • Process run-down and cleanup • Process creation and image activations for 3 new processes to replace the 3 which just died • Slowing response times could cause time-outs and initiate “meltdowns”
Challenges: • Interactive system capacity requirements in an industry with historically batch-processing mentality: • Can’t run CPUs to 100% with interactive users like you can with overnight batch jobs
Challenges: • Adding Solid-State Disks • Hard to isolate hot blocks • Ended up moving entire hot RMS files to SSD array
Challenges: • Application program techniques which worked fine under low workloads failed at higher workloads • Closing files for backups • ‘Temporary’ infinite loop
Challenges: • Standardization on Cisco network hardware and SNMP monitoring • Even on GIGAswitch-based inter-site cluster link
Challenges: • Constant pressure to port to Unix: • Sun proponents continually told management: • “We will be off the VAX in 6 months” • Adversely affected VMS investments at critical times • e.g. RZ28D disks, star couplers
Surprises Along the Way: • As more Alpha nodes were added, lock tree remastering activity caused “pauses” of 10 to 50 seconds every 2-3 minutes • Controlled with PE1=100 during workday
Surprises Along the Way: • Shadowing patch c. 1997 changed algorithm for selecting disk for read operations, suddenly sending ½ of the read requests to the other site, 130 miles (4-5 milliseconds) farther away • Subsequent patch kit allowed control of behavior with SHADOW_SYS_DISK SYSGEN parameter
Surprises Along the Way: • VMS may allow a lock master node to take on so much workload that CPU 0 ends up saturated in interrupt state later • Caused CLUEXIT bugchecks and performance anomalies • With the help of VMS Source Listings and advice from VMS Engineering, wrote programs to spread lock mastership of the hot files across a set of several nodes, and held them there using PE1
Surprises Along the Way: • CI Load Sharing code never got ported from VAX to Alpha • Nodes crashing and rebooting changed assignments of which star couplers were used for lock traffic between pairs of nodes • Caused unpredictable performance anomalies • CSC and VMS Engineering came to the rescue with a program called MOVE_REMOTENODE_CONNECTIONS to set up a (static) load-balanced configuration
Surprises Along the Way: • As disks grew larger, default extent sizes and RMS bucket sizes grew by default as files were CONVERTed onto larger disks using default optimize script • Data transfer sizes gradually grew by a factor of 14X over 4 years • Solid-state disks don’t benefit from increased RMS bucket sizes like magnetic disks do • Fixed by manually selecting RMS bucket sizes for hot files on solid-state disks
Challenges Left in VMScluster Scalability and High Availability • Can’t enlarge disks or RAID arrays on-line • Can’t re-pack RMS files on-line • Can’t de-fragment open files (with DFO) • Disks are getting lots bigger but not as much faster • I/Os per second per gigabyte is actually falling
Lessons Learned: • To provide good system performance one must gain knowledge of application behavior
Lessons Learned: • High-availability systems require: • Best possible people to run them • Best available vendor support: • Remedial • Engineering
Lessons Learned: • Many problems can be avoided entirely (or at least deferred) by providing “reserve” computing capacity • Avoids saturation conditions • Avoids error paths and other seldom-exercised code paths • Provides headroom for peak loads, and to accommodate rapid workload growth when procurement efforts have long lead times
Lessons Learned: • Staff size must grow with workload growth and cluster size, but with VMS clusters, not at as high a rate • Staff size went from 1 person to 8 people (plus vendor HW/SW support) with 24X workload growth
Lessons Learned: • Visibility of system workload and system performance is key, to: • Spot surges in workload • Identify bottlenecks as each new one arises • Provide quick turn-around of performance info into changes and optimizations • Overnight, and even mid-day
Lessons Learned: • With present technology, some scheduled downtime will be needed: • to optimize performance • to do hardware upgrades & maintenance • You’re going to have to have some downtime: do you want to schedule some or just deal with it when it happens on its own? • Scheduled downtime helps prevent or minimize unscheduled downtime
Lessons Learned: • Despite the redundancy within a cluster, a VMS Cluster viewed as a whole can be a Single Point of Failure • Solution: Use multiple clusters, with the ability to shift customer data quickly between them if needed • Can hide scheduled downtime from users
Lessons Learned: • It was impossible to optimize system performance by system tuning alone • Deep knowledge of application program behavior had to be gained, by: • Code examination • Constant discussions with development staff • Observing system behavior under load
Lessons Learned: • Application code improvements are often sorely needed, but their effect on performance can be hard to predict; they may actually hurt things, or make dramatic order-of-magnitude improvements • They are also often found due to serendipity or sudden inspiration, so it’s also hard to plan them or to predict when they might occur
Lessons Learned: • Effect of hardware upgrades is easier to predict: double the hardware will double the cost, and will generally provide close to double the performance • Order-of-magnitude improvements are harder` to obtain, and more expensive
Success Factors: • Excellent people • Best technology • Quick procurement, preferably proactive • Top-notch vendor support • Services (CSC, MSE) • VMS Engineering; Storage Engineering
Speaker Contact Info Keith Parris E-mail: parris@encompasserve.org or keithparris@yahoo.com Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/ Integrity Computing, Inc. 2812 Preakness Way Colorado Springs, CO 80916-4375 (719) 392-6696