State of the Benchmarks

State of the Benchmarks Daniel Bowers daniel.bowers@gartner.com @Daniel_Bowers

Agenda • Server Benchmarks • Benchmark Types • Public Result Pools • Why aren’t there more results? • TPC and SPEC • Vendor Benchmarks • Virtualization Benchmarks • Public and Open-Source Benchmarks • “Good” and “Bad” Benchmarks

Server Benchmarks: A Definition The level of performance for a given server configuration that vendors promise you can’t achieve

What People Do With Benchmark Results Purchase or deployment decisions Right-Sizing & Configuration Definition Capacity Planning / Consolidation Normalize server value for chargeback Performance baselining and monitoring Troubleshooting Setting hardware and software pricing Configuration advice Result Pools

Using Benchmarks: Good to Great. Bad Smart, Hard-Working, Rich Basically: Better than nothing.

Why Use Published Results? They’re free They’re quick No skill required You don’t need to have the servers More accurate than MHz or MIPS They’re often audited (or at least scrutinized)

Examples of Benchmarks with Public Result Pools Consortia SPEC TPC Vendor Application Benchmarks Oracle SAP VMware Vendor – Relative Metrics IBM rPerf Fujitsu OLTP2 Government APP ENERGY STAR • Academic & HPC • LINPACK / HPL • Stream • Desktop • SiSoftware Sandra • Geekbench • Embedded • EEMBC CoreMark • “Open” • DVDStore • BogoMIPS • Purchasing metrics • Amazon ECU • Google GCE

Example Results Pools 6675

Benchmarks “Dilbert”: December 25, 2004. Source: Dilbert.com.

The Gap 19% Server configurations with published SPEC CPU2006 results

Why aren’t there more results? Vendors won’t publish losers Publishing is rarely required, and can be prohibited Can take lots of manpower & money Little incentive for end-users to publish results Benchmarks get broken or become stale

“Benchmarks are chosen to show off a platform, not to allow comparisons” - From an IBM presentation to CMG, 2008 Why Aren’t There More Results?

Why aren’t there more results?

“A TPC-E measurement can take up to 3 months from preparation through the official acceptance by the TPC committee.” - From a Fujitsu document on server performance Why Aren’t There More Results?

TPC • Non-profit industry consortium • Creates benchmarks to compare database systems • Membership is primarily server hardware, OS, database vendors • 1997 - 53 members (including associates) • Today: 21 members • Disclosure: Gartner is an associate member • All benchmark results are audited by 3rd party & subject to challenge • Full disclosure required (actually 2); must include pricing • Estimates and non-audited results not usually allowed • Produces specifications, not programs • Benchmark setups are large, usually dominated by storage

1990 2010 1995 2000 2005 TPC-A Lightweight OLTP OLTP TPC-E OLTP (Brokerage House) TPC-B Batch/Database Stress TPC-C OLTP (Product Supplier) TPC-H Ad Hoc Decision Support DSS TPC-D Decision Support TPC-DS Decision Support (2012) TPC-R Business Reporting Legend TPC-W Web commerce transactional TPC-Energy Retired Other Active TPC-APP Application Server New TPC-VMS (2012)

TPC-C • Long history, tremendous mindshare • Results, Estimates, and Predictions for “tmpC” are plentiful • Allows across many generations • OLTP Workload that’s old and weak • Disparity between processor & I/O performance growth • Storage costs dominate • Server -> Storage IO path is bottleneck • Quiz: Why don’t SSDs yield higher results? • TPC has tried to replace it Cost breakdown of example TPC-C result DL385 G1 using Microsoft SQL2005. Full report: http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=106032001

TPC-C : Example IBM TPC-C Full Disclosure Report http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=112041101

Microsoft & TPC-C Microsoft: “TPC-E is far superior to TPC-C.” - Charles Levine, Principal Program Manager, SQL Server Performance Engineering Microsoft won’t approve TPC-C publications using SQL2008 or later Chart Source: Microsoft (http://blogs.technet.com/b/dataplatforminsider/archive/2012/08/28/tpc-e-raising-the-bar-in-oltp-performance.aspx )

Benchmarks “Pepper…and Salt” January 29, 2013 Source: Wall Street Journal

TPC-E • OLTP, like TPC-C • More tables (33) • More transaction types (~100) including more complex transactions • Only results to date are on x86 with Microsoft SQL • Trivia: Dataset based on NYSE company list and uses some US census data • Helpful Hint: Fujitsu OLTP2 • Results for all recent Xeon processor models • Search for PDF files entitled “Fujitsu PRIMERGY Servers Performance Report”

TPC-H • Benchmark results are for specific database sizes (scales) • TPC: Don’t compare different sizes (but my research says that’s OK) • Parts of the data set scales linearly with performance • * Some have become unrealistic: e.g. 50 billion customer records • Smaller database sizes are “broken” by in-memory, columnar databases • Actian VectorWise results are about double the expected results • Benchmark appears to be fading away, but may see surge of activity as Oracle & Microsoft adding columnar support to databases Source: HP Whitepaper

TPC-DS • Decision Support database benchmark meant to replace TPC-H. • Released in mid-2012. No results to date. (No auditors either) • Includes many more query types than TPC-H. • Periodic database update process that more closely matches that of today’s databases. • “Modern” : modified star schema with fact tables and dimension tables

Other TPC Benchmarks • TPC-Energy • Optional add-on to other TPC benchmarks • TPC-VMS • Just released • Runs 3 other TPC benchmarks simultaneously on single system under test • In Development: • TPC-V • TPC-ETL • TPC-Express

Benchmarks • Why Pay Attention to BogoMIPs? “To see whether your system is faster than mine. Of course this is completely wrong, unreliable, ill-founded, and utterly useless, but all benchmarks suffer from this same problem. So why not use it?” • - From the Linux BogoMIPS HOWTO

Standard Performance Evaluation Corporation (SPEC) • “The goal of SPEC is to ensure that the marketplace has a fair and useful set of metrics to differentiate candidate systems.” • Sells source code including common ports • Searchable results pool • 115 members across four independent groups: • Open System Group (Mostly vendors) • Workstation group • Graphics group • Research group (Mostly academics) • Disclosure: Gartner is a member • Results generally require a standard-format report • Lists intermediary results, optimizations used • Price not included • - Estimates are allowed for most benchmarks

SPEC CPU2006 • Measures CPU Integer and Floating Point capacity • Often correlates with overall system performance because server designs typically balance memory, IO, and CPU • Actually 8 different metrics: • Integer and Floating Point tests • Speed and Rate tests • Base and Peak results • ~25 different individual workloads, from games to quantum computing • Changes versions every 6-8 years • CPU92, CPU2000, CPU2006 • CPUv6 currently under development • Results not comparable between versions

SPEC CPU2006 • Depends almost entirely on CPU model, core count, clock speed • Some impact from Compiler (e.g. +15%) • Small impact from OS, cache • Floating Point impacted by memory speed • “Turbo Mode” frequency correlation • Benchmarked configurations must be “offered” • Published results are peer reviewed (by ‘competitors’) • Reviewers are PICKY!

SPEC CPU2006

“Benchmarks Results are usually for Sales and marketing Customer awareness Customer confidence” - Fujitsu presentation

SPEC jbb2005 • Server-side Java benchmark • Heavily dependant on JVM • Also highly dependant on processor speed, core count, Java garbage collection • “Plateau” amount of cache, memory • Disk and network I/O play no part • Emulates 3-tier system on a single host • Database tier is emulated, memory-resident • Useful Tidbits: • Cheap & Easy to run, so lots of results • Measures transactions per second, similar transactions to TPC-C • Full report includes performance beyond peak • Being replaced (SPECjbb2013)

SPEC jbb2013 Released last month! Scales more realistically than SPECjbb2005 * Includes inter-JVM communication Includes a response-time requirement & reporting, in addition to “operations per second” Like jbb2005, a key design goal was making it easy to run.

SPEC jEnterprise2010 • Java server benchmark designed to test whole system Java EE performance • Includes database and storage • System-Under-Tests can be more than 1 server • Harder to set up and run vs. SPECjbb2005, so fewer results Top 15 results (as of 1 Feb 2013)

Other SPEC benchmarks Power: SPECpower_ssj2008 HPC Benchmarks: SPEC MPI, SPEC OMP File System: SPECsfs2008 Messaging: SPECsip SPECweb2009, SPECweb2005 SPECmail2009 SPECCloud Handheld working group SPEC also has an research group that creates benchmarks for research & development purposes

Vendor-Sponsored Application Benchmarks • SAP • Various, but SD 2-Tier is most popular • Results published on x86 due to support requirements • Correlates with clock, cores, OS, database. • Plateaus on relatively low memory • Pre-2009 results not comparable to current results • Used for SAP “QuickSizer” system-sizing tool • Oracle • Official : EBS benchmarks, Siebel & Peoplesoft benchmarks, etc. • Good: Workload-specific. • Bad: Seeing fewer results than in the past. • Microsoft • Fast-Track system benchmarks: MCR/BCR

Oracle Benchmarks

Benchmarks “Dilbert”: March 02, 2009. Source: Dilbert.com.

Virtualization Benchmarks • VMmark • Includes both application and “infrastructure” workload • DVDStore, OLIO, Exchange 2007 • Idle machine • vMotion, storage vMotion • Base on concept of “tiles”; each tile = 8 workloads • VMware (and therefore x86) only • CPU is the bottleneck, not memory • With the same CPU, results from different vendors almost identical • vSphere license contains DeWitt Clause • SPECvirt • Uses 3 other SPEC benchmarks as its workloads: • SPECweb2005 • SPECjAppServer2004 • SPECmail2008 • Uses similar “tiles” concept to Vmmark • Just vSphere, KVM, Xen results

Other Consortia Benchmarks

Government “Benchmarks” • ENERGY STAR • Sponsored by US EPA • Rewards servers that achieve “best in class” power efficiency targets • Version 1 and upcoming Version 2 disqualify some server categories • APP • Calculated number used by US for export control reasons • Similar to MIPS

Some commercial benchmark software Server-oriented Quest (Dell) Benchmark Factory Desktop-oriented SiSoftware Sandra* Primate Labs Geekbench* SysMark Phoronix Test Suite* MaxonCinebench * Include public repositories of user-submitted results Repositories CloudHarmony (Cloud instances) Tools with metrics BMC Capacity Optimization Computer Associates Hyperformix Vmware Server Capacity Planner

Popular Open Source or Public Domain benchmarks • STREAM • Simple memory bandwidth test • Gets close to server theoretical maximums • LINPACK / HPL • Floating-point tests used to compare HPC & supercomputer performance • “Results should not be taken too seriously” • Other examples • PRIME95 • Terasort • DVDStore • ApacheBench • OLIO

Vendor attitudes towards benchmarks • Source: http://online.asbis.sk/documents/ORACLE/dokumenty/Kubacek_Dalibor_OS_Positioning.pdf

Benchmarks We Lack Converged Systems Public Cloud versus On-Premises “Microserver” Power 3rd Party Mainframe

My advice for using other people’s benchmark results Only use when you’re lazy and poor Full Disclosure: I am lazy and poor Ask vendors for non-published results Ignore differences < 10% For big servers, don’t dividing results by the number of cores If you’re going to just use SPEC CPU2006… Use SPEC CPU Integer Rate base results

Questions? 46

State of the Benchmarks

State of the Benchmarks

Presentation Transcript

MI-Access Science: The State of the Extended Benchmarks

BENCHMARKS

Benchmarks of Quality BoQ

Benchmarks of Quality

Benchmarks

Of Benchmarks and Scorecards

Benchmarks of Quality BoQ

Benchmarks of Quality Training

BENCHMARKS OF QUALITY (BOQ)

Benchmarks

Benchmarks

The Return of Synthetic Benchmarks

The derivation of benchmarks

Benchmarks

MI-Access Science: The State of the Extended Benchmarks

BENCHMARKS OF QUALITY (BOQ)

Benchmarks

Benchmarks

Overview of HPEC Benchmarks