390 likes | 498 Views
CS411 Design of Database Management Systems. Lecture 12: DB Benchmarking Kevin C. Chang. Why benchmark? What does it measure?. Price. Functionality. Performance. Why Benchmarks?. The three most important aspects of a DBMS? functionality, price, and performance
E N D
CS411Design of Database Management Systems Lecture 12: DB Benchmarking Kevin C. Chang
Why benchmark? What does it measure? Price Functionality Performance CS411
Why Benchmarks? • The three most important aspects of a DBMS? • functionality, price, and performance • Performance is hard to figure out • what to implement? How to implement? • Performance is hard to compare • response time? throughput? cost? • ease of use? maintenance? CS411
Before the Wisconsin Benchmarks • Vendors quote performance numbers for marketing • but none were published or verified, or even comparable • Big customers could afford benchmarking competition • use real target applications • but difficult and confusing without a standard procedure • Vendors only as serious as necessary to make the sale • “Vendors had little incentive to publish their performance because it was often embarrassing.” [TBH91] • contribute little to move the state of the art forward CS411
The Wisconsin Benchmarks The Wisconsin benchmarks changed all that • around 1981 – 1983 • The benchmark: • a synthesized data set: the WISC database • control various parameters: • selectivity, # of duplicate tuples, # of aggregate groups • a set of 32 single-user complex SQL queries • selections, joins, projections, aggregates, updates CS411
The Wisconsin Benchmarking Results Several named major vendors benchmarked • INGRES (university-version and commercial-version) • IDM (Intelligent Database Machines) of Britton-Lee • with and without DAC (database accelerator) • DIRECT (a multiprocessor DB machine) of Wisconsin • ORACLE CS411
The Wisconsin Benchmarking Results DeWitt (then assistant prof.) v.s. Ellison: “The relative poor performance of ORACLE made it apparent that the system had some fairly serious problems that needed correction, for ORACLE was typically a factor of 5 slower than INGRES and the IDM 500 on most selection queries.” “In retrospect, the reasons for this popularity were only partially due to its technical quality. The primary reason for its success was that it was the first evaluation containing impartial measures of real products.” -- DeWitt [TBH91] CS411
Consequently • Angry vendors: • angry vendor called the author’s boss • demanded a recount (recoding or a remeasuring) • published their own numbers for the query set • began to patch problems in their DBMSs • use the Wisconsin benchmarks for regression testing • The DeWitt clause: in most software license agreements • can’t publish performance numbers • DB gurus criticized the shortcomings: • the synthesized relations hard to scale (make larger) • it is not “real” • Customers began to demand Wisc bench. results CS411
Benchmark Wars Followed “Benchmark wars start if someone loses an important or visible benchmark evaluation. The loser reruns it using regional specialists and gets new and winning numbers. Then the opponent reruns it using his regional specialists and of course gets even better numbers. The loser then reruns it using some one-star gurus. This progression can continue all the way to five-star gurus. At a certain point, a special version of the system is employed, with promises that the enhanced performance features will be included in the next regular release.” [TBH91] CS411
The WB Shortcomings • Not “realistic”: • queries were of interest for the authors’ parallel platform • but not reflect the OLTP systems of the day (banks) • Not multiuser • System price wasn’t factored in • The data set hard to scale up (make larger) • 2MB, 10,000 tuples • systems will grow, so should benchmarks • Successors: • the Anon et. al. paper: DebitCredit or “TP1” • TPC: TPC-A, TPC-B, TPC-C • addressed these shortcomings, measuring concurrent TPS CS411
The Long-term Effects of the WB • Vendors equalized their performance on the Wisc. Bench. queries (cross-vendor, release-to-release) • Gurus thought long and hard about the characteristics of a “good” DB benchmark---and they are still thinking • Vendors started to learn how to cheat on benchmarks • Customers and gurus began to think about how to stop cheating CS411
The Anon et. al. Paper • Jim Gray, 1984 • early version distributed to professionals in academia and industry for comments • published as “Anon et. al.” to suggest group effort • The benchmark tests: • an interactive OLTP emulation: DebitCredit • modeled after the actual state of Bank of America in the early 1970s • two batch tests that stress IO: • Scan: scans and updates 1000 records • Sort: disc sort of one million records CS411
What’s Good About DebitCredit?? • Why it was popular and influential? CS411
The DebitCredit Benchmark • Relevance: modeled bank OLTP • Simplicity: • one deposit transaction over the ABEH files: • account, branch, teller, and history (logs) • Scalability: DB size scales up as TPS does • for each TPS: 100k accounts, 10 branches, 100 tellers • e.g., 10 TPS: 1000k A, 100 B, and 1000 T • Comparability: • system requirements: • 95% transactions with 1sec response time • configuration control: • instead of specifying “equivalent” configurations • use cost as normalization factor • simple summary metrics: TPS, $/TPS CS411
Who Uses Benchmarks Today? • DBMS vendors • to tune, tweak, optimize, and hide weaknesses • Hardware vendors • with DBMS vendor liaisons to prompt sales • Third parties product reviews • Academic institutions • as part of research projects • Customers • help making buying decisions CS411
How To Cheat on a Benchmark? • Use a special version of the system that is not released or customers will never use • Use a data organization or non-standard approach that is never used in practice by customers • Disable every feature not explicitly required by the benchmark • Price the benchmarked system using special discounts • Just lie • More later CS411
How to Resolve These Problems? • Cheaters’ creativity knows no bounds • details later • “Benchmarketing” wars heated up • gurus/consultants often hired as auditors to certify results • The Transaction Processing Performance Council, 1988 • an independent “benchmark” body: • to design and publish standardized DB benchmarks • consultant Omri Serlin’s proposal, with 8 vendors initially CS411
TPC today • http://www.tpc.org • about 20+ companies (all the major DB vendors) • Early benchmarks: • TPC-A: TPC version of DebitCredit • TPC-B: stress test of core backend DB servers • Current benchmarks: • TPC-C: OLTP warehouse order entry and inventory monitoring • TPC-H: ad-hoc decision support queries • TPC-R: also decision support with “non ad-hoc” queries • TPC-W: Web e-commerce transactions CS411
TPC Web Site Highlights (www.tpc.org) • Who is the fastest on each benchmark? What are the trends? What do the published numbers mean? Why are there both hardware and software vendors listed? What happened to TPC-A and TPC-B? How do we know the numbers aren’t lies? How do we find out the details of how the benchmark was run? What are the “withdrawn” results? How detailed are the benchmark specs? How is the price computed? CS411
Point to Ponder • Why more hardware vendors than software on TPC results?? CS411
Why More HW than SW Vendors? • HW evolves faster and SW relatively “stable”? • HW more players than SW? • SW market saturated by just a few vendors • HW vendors view DB as driving applications? • Software marketing more than performance? • compatibility, market dominance CS411
What makes a good benchmark? (Gray) • Relevant • must resemble an actual class of apps • extrapolation is impossible • hence special benchmarks for ECAD (OO7), Sci DBs (Sequoia), … • Portable • should be easy to implement on many different platforms • Scalable • should apply to small and large systems • e.g.: different size DBs for different size platforms • Simple • must be understandable for implementability and creditability CS411
Writing a Benchmark: Focuses Benchmark focuses: • Core speed and potential bottlenecks • “micro-benchmarks”, e.g., Sort • Functionalities • e.g.: Wisconsin • End-to-end “scenarios” performance: • e.g.: DebitCredit, TPC-C CS411
Writing a Benchmark: Metrics • Performance: • Wicsonsin: total elapsed time of various queries • DebitCredit: TPS • defines the DebitCredit transaction as the “unit transaction” • simple, allows easy comparison • the single-number metric made the benchmark popular • Price/Performance: • Anon et. al.’s approach • inspired by the bank bidding case study • 100 TPS, 5M$: 50k%/TPS • 100 TPS, 25M$: 250k$/TPS • adopted by TPC CS411
Famous Cheating Methods • Put the entire benchmark inside a single precompiled stored procedure in the DBMS • run it with a single call (much cheaper) • no run-time query optimization • Divide the DB into several physical DBs • hide the fact of only DB-level locking • Use local clients, instead of remote CS411
Famous Cheating Methods • Use an unreleased version of the DBMS • and promise it will ship soon • Use your 5-star wizards to tune the DBMS • Leads to escalating wizard wars, especially for customer-supplied benchmarks • Help the query optimizer: • reorder query conditions for optimizer to pick the “right” plan • break query into a series of queries, or unnest subqueries • Remove functionalities: • turn off logging, locking, or anything not explicitly required CS411
Famous Problems Uncovered • Lock the entire table for one insert • Can’t handle tables larger than one disk • Bulk load too slow to be practical • Nested queries always use sequential scan • Core dumps during the benchmark • Can’t support tuples longer than 2KB CS411
Famous Problems uncovered • During heavy multi-user insertions, DBMS corrupts the db • Application hangs if multiple users call SQL within 5 seconds of each other • Different number of rows in answer with/without Order-By clause in query • Query requires temporary space equal to db size CS411
To Find Out More • TPC’s web site is a great starting point • www.tpc.org • The Benchmark Handbook [TBH91] • edited by Jim Gray, is the authoritative text • online at www.benchmarkresources.com/handbook/ CS411
Conclusion: Benchmark = Bound • The guru’s promise: • while cleverness should be rewarded, the clever people may disappear after you own the equipment • The performance guarantee: • a benchmark result is a guarantee that your performance will never exceed the published result • well, but sometimes comparing their “upper bounds” is still meaningful, better than nothing! CS411
What’s Next? • The “context” stage of the class – • some IR concepts CS411
End Of Talk CS411
When customers do their own • “A customer will typically involve one or more db vendors and a hardware vendor in this process. These organizations will not encourage the customer to conduct more thorough and detailed tests because such tests take longer and are more likely to uncover problems that might kill the sale. The customer will be encouraged to hurry the testing process and make the selection.” [TBH91] • “A [customer-defined] benchmark will present many opportunities for debate [over interpretation]. Both managers and technicians will be involved in rultings that require fundamental tradeoffs between realism, fairness, and expense. … A complex benchmark will leave managers with the feeling that Solomon had it easy.” [TBH91] CS411
When DB vendors do their own • “They like to set up and perform preliminary testing in private, bring the customer in to witness the test, and then get the customer out quickly before anything can go wrong.” [TBH91] CS411
What operations to include? • A specified mix of common operations? • E.g., concurrent sales assembly 30%, summary reports 10%, and everything else can be new order entry • Use probability distribution to determine type of next operation • Harder than running each operation in isolation, or in a predetermined order CS411
What operations to include? • Utility functions? • Recovery, data loading, index construction… • Is logging required? Locking? What granularity? Can they be turned on/off or tuned during the benchmark? Same physical DB for all operations? • Not glamorous, but customers need them • “Panicked vendors may state that none of their customers uses level-x but uses level-y with no dire consequences. Vendor user groups are good sources of sanity.” [TBH91] CS411
What operations to include? • Where do operations originate? • Inside the DBMS (no process boundary crossings) • Remotely (How many clients? How many operations per client? How does server handle client connections? Is there think time? Where is performance measured?) • “A system that uses ten processes, each submitting one tps, may behave much differently than a system with one thousand users, each submitting at 0.01 tps.” [TBH91] • A constant arrival rate for transactions roughly doubles the tps of the system. [TBH91] CS411
What operations to include? • Application logic? • “Unless you have compute-intensive applications, application logic should be the first thing eliminated from a benchmark. Application logic rarely accounts for more than 10-15% of the CPU load. Specifying application logic, then verifying it across all bidders, is time consuming and error prone.” CS411
What operations to include? • Performance requirements? • Typically: x% of the interactive transactions must complete within x seconds. • For x: “Naïve folks will use the average response time; more sophisticated specifiers will opt for the 90th … percentile.” • For y: “Your friendly vendors will help you keep it realistic.” Very low values for y mean very low CPU utilization. CS411