710 likes | 859 Views
OpenVMS “Marvel” EV7 Proof Points of Olympic Proportions. Tech Update - Sept 2003 Steve.Lieman@hp.com. OpenVMS Marvel EV7 Proof Points of Olympic Proportions. Live, mission-critical, production systems Multi-dimensional Before and after comparisons Upgrades from GS140 & GS160 to GS1280
E N D
OpenVMS “Marvel” EV7 Proof Points of Olympic Proportions Tech Update - Sept 2003 Steve.Lieman@hp.com
OpenVMS Marvel EV7 Proof Points of Olympic Proportions • Live, mission-critical, production systems • Multi-dimensional • Before and after comparisons • Upgrades from GS140 & GS160 to GS1280 • Proof of impact on maximum headroom
Can your enterprise benefit from an upgrade to a GS1280? • Systems with high MPsynch • Systems with high primary CPU interrupt load • Poor SMP scaling • Heavy locking • Heavy IO, Direct, Buffered, Mailbox • Heavy use of Oracle, TCPIP, Multinet • Look closer if: • Systems with poor response time • Systems with insufficient peak period throughput
T4 - Data Sources • Data for these comparisons was collected using the internally developed T4 (tabular timeline tracking tool) suite of coordinated collection utilities and analyzed with TLViz • The T4 kit & TLViz have consistently proved themselves invaluable for this kind of before and after comparison project. We havenow made T4 publiclyavailable for download (will ship with OpenVMS 7.3-2 in SYS$ETC: ) • T4 could be a useful adjunct to your performance management program.
Would you like to participate in our Marvel Proof Point Program? • Contact steve.lieman@hp.com for more information about how you can take part • Download T4 kit from public web site: http://h71000.www7.hp.com/OpenVMS/products/t4/index.html • Start creating a compact, portableT4 based performance history of your most important systems • The T4 data will create a common and efficient language for our discussions. We can then work with you and help you evaluate your unique pattern of use and the degree to which the advantages of Marvel EV7 on OpenVMS can most benefit you.
Want even more detail? • The electronic version of this presentation contains extensive captions and notes on each slide for your further study, reflection, and review.
CASE 1- Production System12P GS140 700 MHzvs. 16P GS1280 1.15 GHzTremendous Gains in Headroom Oracle Database Server with Multinet
Compute Queue Completely Evaporates with GS1280 Peak Queues of 57 drop to queues of 1 or 2 Green is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs
CPU O Idle Time With GS1280, 73% spare CPU 0 during absolute peak. With GS140 CPU 0 is completely consumed during peaks (e.g. at 11 AM) Green is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs
Almost 4 to 1 reduction in CPU Busy with GS1280 GS140 is nearly maxed out at more than 1150% busy of 1200% while GS1280 is cruising along at 250% to 350% busy of 1600% Green is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs
DirectIO (includes network traffic) GS1280 is able to push to higher peaks when load gets heavy, while still having huge spare capacity for more work. GS140 is close to maxed out at 10,000 DIRIO per second Green is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs
MPsynch MPsynch drops from 90% to under 10% leaving plenty of room for further scaling Green is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs
Packets Per Second Sent – key throughput metric - Estimate actual maximum rate for GS1280 at more than 20,000/sec GS140 maxes out at about 5,000 packets per second with little or no spare capacity. GS1280 reaches 6,000 with substantial spare capacity Blue is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs
Case 1 Summary 12P GS140 to 16P GS160 • GS1280 delivers an estimated increase in headroom of at least 4X • Eliminates CPU 0 bottleneck • Drastically cuts MPsynch • Able to handle higher peaks as they arrive • Almost 4 to 1 reduction in CPU use while doing slightly more work
Case 2 – Production System10P GS140 700 MHz vs. 8P GS1280 1.15 GHzTremendous Gains in Headroom for a Oracle Database Server despite reduced CPU countPoised to Scale
Compute Queue Completely Evaporates with GS1280 and the current workload demand Peak Queues of 32 drop to 3 Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
CPU O Idle Time With GS1280, 69% spare CPU 0 during absolute peak with this workload. With GS140 CPU 0 is completely consumed during peaks (e.g. at 10:30 for many minutes at a time) Green is GS140 at 700 MHz with 12 CPU Red is GS1280 at 1.15GHZ with 16 CPUs Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
More than 3 to 1 reduction in CPU Busy with GS1280 GS140 is completed maxed out at more than 1000% busy while GS1280 is cruising along at 200% to 350% busy of 800% Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
DirectIO (includes network traffic) GS1280 is able to push to higher peaks of 10,500 when the load temporarily gets heavier, while still having huge spare capacity for more work (appx 5 CPUs) The 10P GS140 is maxed out at slightly over 8,000 DIRIO per second. Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
MPsynch (more than a 9 to 1 reduction with this workload) MPsynch drops from peaks of 67% to peaks of only 7% leaving plenty of room for further scaling Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
Packets Per Second Sent – a key throughput metric - Estimate actual max rate for 8P GS1280 at more than 11,000/sec. With 16P this would rise to 20,000/sec The10 P GS140 maxes out at about 4,200 packets per second with no spare capacity. The 8P GS1280 reaches 4,800 with more than 4.5 CPUs to spare Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
CPU 0 interrupt – is well poised for scaling to 8, 12, and even more CPUs with the GS1280 During peak periods, despite the fact that the GS1280 with 8P is doing slightly more work, it uses a factor of 3.5X less CPU 0 for interrupt activity Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs At peaks of only 20%, the GS 1280 stands ready to handle substantially higher workloads
Disk operations rate – This shows the same head and shoulders pattern as direct IO and packets per second During peak periods, the 10P GS140 maxes out at 2,200 disk operations per second. With this workload, the 8P is able to reach 2,900 per second with lots of room to spare Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs As the load demand on the GS1280 increases, this 8P model looks capable of driving the disk op rate to 6,000/sec
Interrupt load during peak periods drops by a factor of almost 5 to 1 from 240% to 50%. This is another excellent sign of the potential future scalability of this GS1280 to 8 CPUs, 12 CPUs and beyond. Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
Microseconds of CPU per each Direct IO Normalized statistics like this show the relative power of each GS1280 CPU at 1.15 GHZ is between 3 to 4 times more than the GS140’s 700 MHz CPUs Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
Disk Reads Per Second This shows same head and shoulders pattern but even more pronounced than what we saw with network packets Red is GS1280 at 1.15GHZ with 8 CPUs Green is GS140 at 700 MHz with 10 CPUs
Case 2 Summary 10P GS140 to 8P GS1280 • GS1280 with fewer CPUs delivers an estimated headroom increase more than 2X • Eliminates CPU busy bottleneck • Drastically cuts MPsynch • Able to handle higher peaks as they arrive • Well positioned to scale to 8, 12, or higher CPUs and achieve headroom increases of 3.5X or even higher.
Proof Point Patterns • Dramatic cuts in MPsynch • Large drops in Interrupt mode • Higher, short-lived bursts of throughput • directIO, packets per second, etc. • The “HEAD and SHOULDERS” • Large increase in spare capacity and headroom • Overall CPU, primary CPU Where the workload stays relatively flat at the point of transition, the overall throughput numbers are not that different, but the shape of the new curve with its sharp peaks tells an important story
Case 3 –Stress Test Marvel 32P – RMS1 • This case shows a segment of our RMS1 testing on the 32P Marvel EV7 @ 1.15 GHz • Using Multiple 4 GB Ramdisks • Started at 16P, ramped up workload • Then increased to 24P, throughput dropped • Then affinitized jobs, throughput jumped • Combines timeline data from t4, spl, bmesh
Background to this test • RMS1 is based on a customer developed database benchmark test originally written using Rdb and converted to carry out the same task with RMS • To generate extremely high rates of IO in order to discover the limits of Marvel 32P performance, we ran multiple copies of RMS1, each using their own dedicated RAMdisk • Caution: The net effect is a test that generates an extremely heavy load, but that cannot be considered to mirror any typical production environment.
Timing of Changes • 12:05 16 CPUs • 12:11 Start ramp up with 4GB ramdisks • 12:30 Increase to 24 CPUs • 12:38 Set Process Affinity • 12:55 Turn off dedicated lock manager <Observe how timelines help make sense of this complicated situation>
Direct IO up to 70,000 per second! For the RMS1 workload, the rate of direct IO per second is a key metric of maximum throughput. Increasing to 24 CPUs, at 12:30 does not increase throughput. Turning on Affinity causes throughput to jump from 55,000 to over 70,000, and increase of approximately 30% (1.3X)
Kernel & MPsynch switch roles 12:30 is when we jumped from 16 CPUs to 24 CPUs. Note how MPsynch (green) jumps up substantially at that time to over 950%. At 12:37, we started affinitizing the different processes to CPUs we believed to be close to where there associated RAMdisk was located. Note how MPsynch and Kernel mode cross over at that point.
Lock Busy % from T4 shows jump with affinity We had dedicated lock manager turned on for this test which creates a very heaving locking load. Note that there is no change when the number of CPUs is increased at around 12:30. Note the big jump in Lock % busy that happens when we affinitize. At over 90% busy, locking is a clear primary bottleneck that will prevent further increases in throughput even with more CPUs.
Lock requests per sec vs XFC writeA True Linear Relationship The maximum rate of lock requests per minute is an astounding 450,000 per second.
Case 3 - SUMMARY • These are by far the best throughput numbers we have ever seen on this workload for: • Direct IO, Lock requests per second. • Performance is great out of the box. • New tools simplify bottleneck identification • Straightforward tuning pushes to even higher values with a surprising large upward jump • Workloads show consistent ratios between key statistics (e.g. Lock Requests per DIRIO) • Spinlock related bottlenecks remain with us, albeit at dramatically higher throughput levels
Case 4 – Production System • Upgrade from 16 CPU Wildfire EV68 running at 1.224 GHz (the fastest Wildfire) • Compared to 16 CPU Marvel EV7 running at 1.15 GHz • Oracle, TCPIP, Mixed Database Server and Application Server
CPU Busy cut in halfNote Color Switch!!! Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
CPU 0 Interrupt is cut by a factor of more than 3 to 1 Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
Buffered IO – sustained higher peaks Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
Direct IO – sustained higher peaks Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
System Wide Interrupt diminished by a factor of 4 to 1 Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
MPsynch shrinks by more than 8 to 1 Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
Kernel Mode decreases from 260 to 150 Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
User Mode decreases from about 480 to 240 Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz
Compute Queue disappears Red is GS1280 with 16 CPUs at 1.15 GHz Green is GS160 with 16 CPUs at 1.224 GHz