470 likes | 772 Views
State of the Benchmarks. Daniel Bowers daniel.bowers@gartner.com @ Daniel_Bowers. Agenda. Server Benchmarks Benchmark Types Public Result Pools Why aren’t there more results? TPC and SPEC Vendor Benchmarks Virtualization Benchmarks Public and Open-Source Benchmarks
E N D
State of the Benchmarks Daniel Bowers daniel.bowers@gartner.com @Daniel_Bowers
Agenda • Server Benchmarks • Benchmark Types • Public Result Pools • Why aren’t there more results? • TPC and SPEC • Vendor Benchmarks • Virtualization Benchmarks • Public and Open-Source Benchmarks • “Good” and “Bad” Benchmarks
Server Benchmarks: A Definition The level of performance for a given server configuration that vendors promise you can’t achieve
What People Do With Benchmark Results Purchase or deployment decisions Right-Sizing & Configuration Definition Capacity Planning / Consolidation Normalize server value for chargeback Performance baselining and monitoring Troubleshooting Setting hardware and software pricing Configuration advice Result Pools
Using Benchmarks: Good to Great. Bad Smart, Hard-Working, Rich Basically: Better than nothing.
Why Use Published Results? They’re free They’re quick No skill required You don’t need to have the servers More accurate than MHz or MIPS They’re often audited (or at least scrutinized)
Examples of Benchmarks with Public Result Pools Consortia SPEC TPC Vendor Application Benchmarks Oracle SAP VMware Vendor – Relative Metrics IBM rPerf Fujitsu OLTP2 Government APP ENERGY STAR • Academic & HPC • LINPACK / HPL • Stream • Desktop • SiSoftware Sandra • Geekbench • Embedded • EEMBC CoreMark • “Open” • DVDStore • BogoMIPS • Purchasing metrics • Amazon ECU • Google GCE
Benchmarks “Dilbert”: December 25, 2004. Source: Dilbert.com.
The Gap 19% Server configurations with published SPEC CPU2006 results
Why aren’t there more results? Vendors won’t publish losers Publishing is rarely required, and can be prohibited Can take lots of manpower & money Little incentive for end-users to publish results Benchmarks get broken or become stale
“Benchmarks are chosen to show off a platform, not to allow comparisons” - From an IBM presentation to CMG, 2008 Why Aren’t There More Results?
“A TPC-E measurement can take up to 3 months from preparation through the official acceptance by the TPC committee.” - From a Fujitsu document on server performance Why Aren’t There More Results?
TPC • Non-profit industry consortium • Creates benchmarks to compare database systems • Membership is primarily server hardware, OS, database vendors • 1997 - 53 members (including associates) • Today: 21 members • Disclosure: Gartner is an associate member • All benchmark results are audited by 3rd party & subject to challenge • Full disclosure required (actually 2); must include pricing • Estimates and non-audited results not usually allowed • Produces specifications, not programs • Benchmark setups are large, usually dominated by storage
1990 2010 1995 2000 2005 TPC-A Lightweight OLTP OLTP TPC-E OLTP (Brokerage House) TPC-B Batch/Database Stress TPC-C OLTP (Product Supplier) TPC-H Ad Hoc Decision Support DSS TPC-D Decision Support TPC-DS Decision Support (2012) TPC-R Business Reporting Legend TPC-W Web commerce transactional TPC-Energy Retired Other Active TPC-APP Application Server New TPC-VMS (2012)
TPC-C • Long history, tremendous mindshare • Results, Estimates, and Predictions for “tmpC” are plentiful • Allows across many generations • OLTP Workload that’s old and weak • Disparity between processor & I/O performance growth • Storage costs dominate • Server -> Storage IO path is bottleneck • Quiz: Why don’t SSDs yield higher results? • TPC has tried to replace it Cost breakdown of example TPC-C result DL385 G1 using Microsoft SQL2005. Full report: http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=106032001
TPC-C : Example IBM TPC-C Full Disclosure Report http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=112041101
Microsoft & TPC-C Microsoft: “TPC-E is far superior to TPC-C.” - Charles Levine, Principal Program Manager, SQL Server Performance Engineering Microsoft won’t approve TPC-C publications using SQL2008 or later Chart Source: Microsoft (http://blogs.technet.com/b/dataplatforminsider/archive/2012/08/28/tpc-e-raising-the-bar-in-oltp-performance.aspx )
Benchmarks “Pepper…and Salt” January 29, 2013 Source: Wall Street Journal
TPC-E • OLTP, like TPC-C • More tables (33) • More transaction types (~100) including more complex transactions • Only results to date are on x86 with Microsoft SQL • Trivia: Dataset based on NYSE company list and uses some US census data • Helpful Hint: Fujitsu OLTP2 • Results for all recent Xeon processor models • Search for PDF files entitled “Fujitsu PRIMERGY Servers Performance Report”
TPC-H • Benchmark results are for specific database sizes (scales) • TPC: Don’t compare different sizes (but my research says that’s OK) • Parts of the data set scales linearly with performance • * Some have become unrealistic: e.g. 50 billion customer records • Smaller database sizes are “broken” by in-memory, columnar databases • Actian VectorWise results are about double the expected results • Benchmark appears to be fading away, but may see surge of activity as Oracle & Microsoft adding columnar support to databases Source: HP Whitepaper
TPC-DS • Decision Support database benchmark meant to replace TPC-H. • Released in mid-2012. No results to date. (No auditors either) • Includes many more query types than TPC-H. • Periodic database update process that more closely matches that of today’s databases. • “Modern” : modified star schema with fact tables and dimension tables
Other TPC Benchmarks • TPC-Energy • Optional add-on to other TPC benchmarks • TPC-VMS • Just released • Runs 3 other TPC benchmarks simultaneously on single system under test • In Development: • TPC-V • TPC-ETL • TPC-Express
Benchmarks • Why Pay Attention to BogoMIPs? “To see whether your system is faster than mine. Of course this is completely wrong, unreliable, ill-founded, and utterly useless, but all benchmarks suffer from this same problem. So why not use it?” • - From the Linux BogoMIPS HOWTO
Standard Performance Evaluation Corporation (SPEC) • “The goal of SPEC is to ensure that the marketplace has a fair and useful set of metrics to differentiate candidate systems.” • Sells source code including common ports • Searchable results pool • 115 members across four independent groups: • Open System Group (Mostly vendors) • Workstation group • Graphics group • Research group (Mostly academics) • Disclosure: Gartner is a member • Results generally require a standard-format report • Lists intermediary results, optimizations used • Price not included • - Estimates are allowed for most benchmarks
SPEC CPU2006 • Measures CPU Integer and Floating Point capacity • Often correlates with overall system performance because server designs typically balance memory, IO, and CPU • Actually 8 different metrics: • Integer and Floating Point tests • Speed and Rate tests • Base and Peak results • ~25 different individual workloads, from games to quantum computing • Changes versions every 6-8 years • CPU92, CPU2000, CPU2006 • CPUv6 currently under development • Results not comparable between versions
SPEC CPU2006 • Depends almost entirely on CPU model, core count, clock speed • Some impact from Compiler (e.g. +15%) • Small impact from OS, cache • Floating Point impacted by memory speed • “Turbo Mode” frequency correlation • Benchmarked configurations must be “offered” • Published results are peer reviewed (by ‘competitors’) • Reviewers are PICKY!
“Benchmarks Results are usually for Sales and marketing Customer awareness Customer confidence” - Fujitsu presentation
SPEC jbb2005 • Server-side Java benchmark • Heavily dependant on JVM • Also highly dependant on processor speed, core count, Java garbage collection • “Plateau” amount of cache, memory • Disk and network I/O play no part • Emulates 3-tier system on a single host • Database tier is emulated, memory-resident • Useful Tidbits: • Cheap & Easy to run, so lots of results • Measures transactions per second, similar transactions to TPC-C • Full report includes performance beyond peak • Being replaced (SPECjbb2013)
SPEC jbb2013 Released last month! Scales more realistically than SPECjbb2005 * Includes inter-JVM communication Includes a response-time requirement & reporting, in addition to “operations per second” Like jbb2005, a key design goal was making it easy to run.
SPEC jEnterprise2010 • Java server benchmark designed to test whole system Java EE performance • Includes database and storage • System-Under-Tests can be more than 1 server • Harder to set up and run vs. SPECjbb2005, so fewer results Top 15 results (as of 1 Feb 2013)
Other SPEC benchmarks Power: SPECpower_ssj2008 HPC Benchmarks: SPEC MPI, SPEC OMP File System: SPECsfs2008 Messaging: SPECsip SPECweb2009, SPECweb2005 SPECmail2009 SPECCloud Handheld working group SPEC also has an research group that creates benchmarks for research & development purposes
Vendor-Sponsored Application Benchmarks • SAP • Various, but SD 2-Tier is most popular • Results published on x86 due to support requirements • Correlates with clock, cores, OS, database. • Plateaus on relatively low memory • Pre-2009 results not comparable to current results • Used for SAP “QuickSizer” system-sizing tool • Oracle • Official : EBS benchmarks, Siebel & Peoplesoft benchmarks, etc. • Good: Workload-specific. • Bad: Seeing fewer results than in the past. • Microsoft • Fast-Track system benchmarks: MCR/BCR
Benchmarks “Dilbert”: March 02, 2009. Source: Dilbert.com.
Virtualization Benchmarks • VMmark • Includes both application and “infrastructure” workload • DVDStore, OLIO, Exchange 2007 • Idle machine • vMotion, storage vMotion • Base on concept of “tiles”; each tile = 8 workloads • VMware (and therefore x86) only • CPU is the bottleneck, not memory • With the same CPU, results from different vendors almost identical • vSphere license contains DeWitt Clause • SPECvirt • Uses 3 other SPEC benchmarks as its workloads: • SPECweb2005 • SPECjAppServer2004 • SPECmail2008 • Uses similar “tiles” concept to Vmmark • Just vSphere, KVM, Xen results
Government “Benchmarks” • ENERGY STAR • Sponsored by US EPA • Rewards servers that achieve “best in class” power efficiency targets • Version 1 and upcoming Version 2 disqualify some server categories • APP • Calculated number used by US for export control reasons • Similar to MIPS
Some commercial benchmark software Server-oriented Quest (Dell) Benchmark Factory Desktop-oriented SiSoftware Sandra* Primate Labs Geekbench* SysMark Phoronix Test Suite* MaxonCinebench * Include public repositories of user-submitted results Repositories CloudHarmony (Cloud instances) Tools with metrics BMC Capacity Optimization Computer Associates Hyperformix Vmware Server Capacity Planner
Popular Open Source or Public Domain benchmarks • STREAM • Simple memory bandwidth test • Gets close to server theoretical maximums • LINPACK / HPL • Floating-point tests used to compare HPC & supercomputer performance • “Results should not be taken too seriously” • Other examples • PRIME95 • Terasort • DVDStore • ApacheBench • OLIO
Vendor attitudes towards benchmarks • Source: http://online.asbis.sk/documents/ORACLE/dokumenty/Kubacek_Dalibor_OS_Positioning.pdf
Benchmarks We Lack Converged Systems Public Cloud versus On-Premises “Microserver” Power 3rd Party Mainframe
My advice for using other people’s benchmark results Only use when you’re lazy and poor Full Disclosure: I am lazy and poor Ask vendors for non-published results Ignore differences < 10% For big servers, don’t dividing results by the number of cores If you’re going to just use SPEC CPU2006… Use SPEC CPU Integer Rate base results
Questions? 46