Design and Evaluation of Architectures for Commercial Applications. Part I: benchmarks. Luiz André Barroso. Why architects should learn about commercial applications?. Because they are very different from typical benchmarks Because they are demanding on many interesting architectural features

  Design and Evaluation of Architectures for Commercial Applications Part I: benchmarks Luiz André Barroso

  Why architects should learn about commercial applications? • Because they are very different from typical benchmarks • Because they are demanding on many interesting architectural features • Because they are driving the sales of mid-range and high-end systems

  Shortcomings of popular benchmarks • SPEC • uniprocessor-oriented • small cache footprints • exacerbates impact of CPU core issues • SPLASH • small cache footprints • extremely optimized sharing • STREAMS • no real sharing/communication • mainly bandwidth-oriented

  SPLASH vs. Online Transaction Processing (OLTP) A typical SPLASH app. has > 3x the issue rate, ~26x less cycles spent in memory barriers, 1/4 of the TLB miss ratios, < 1/2 the fraction of cache-to-cache transfers, ~22x smaller instruction cache miss ratio, ~1/2 L2$ miss ratio ...of an OLTP app.

  But the real reason we care? $$$! • Server market: • Total: > $50 billion • Numeric/scientific computing: < $2 billion • Remaining $48 billion? • OLTP • DSS • Internet/Web • Trend is for numerical/scientific to remain a niche

  Relevance of server vs. PC market • High profit margins • Performance is a differentiating factor • If you sell the server you will probably sell: • the client • the storage • the networking infrastructure • the middleware • the service • ...

  Need for speed in the commercial market • Applications pushing the envelope • Enterprise resource planning (ERP) • Electronic commerce • Data mining/warehousing • ADSL servers • Specialized solutions • Intel splitting Pentium line into 3-tiers • Oracle's raw iron initiative • Network Appliances' machines

  Seminar disclaimer • Hardware centric approach: • target is build better machines, not better software • focus on fundamental behavior, not on software "features" • Stick to general purpose paradigm • Emphasis on CPU+memory system issues • Lots of things missing: • object-relational and object-oriented databases • public domain/academic database engines • many others

  Overview • Day I: Introduction and workloads • Background on commercial applications • Software structure of a commercial RDBMS • Standard benchmarks • TPC-B • TPC-C • TPC-D • TPC-W • Cost and pricing trends • Scaling down TPC benchmarks

  Overview(2) • Day 2: Evaluation methods/tools • Introduction • Software instrumentation (ATOM) • Hardware measurement & profiling • IPROBE • DCPI • ProfileMe • Tracing & trace-driven simulation • User-level simulators • Complete machine simulators (SimOS)

  Overview (3) • Day III: Architecture studies • Memory system characterization • Out-of-order processors • Simultaneous multithreading • Final remarks

  Background on commercial applications • Database applications: • Online Transaction Processing (OLTP) • massive number of short queries • read/update indexed tables • canonical example: banking system • Decision Support Systems (DSS) • smaller number of complex queries • mostly read-only over large (non-indexed) tables • canonical example: business analysis

  Background (2) • Web/Internet applications • Web server • many requests for small/medium files • Proxy • many short-lived connection requests • content caching and coherence • Web search index • DSS with a Web front-end • E-commerce site • OLTP with a Web front-end

  Background (3) • Common characteristics • Large amounts of data manipulation • Interactive response times required • Highly multithreaded by design • suitable for large multiprocessors • Significant I/O requirements • Extensive/complex interactions with the operating system • Require robustness and resiliency to failures

  Database performance bottlenecks • I/O-bound until recently (Thakkar, ISCA'90) • Many improvements since then • multithreading of DB engine • I/O prefetching • VLM (very large memory) database caching • more efficient OS interactions • RAIDs • non-volatile DRAM (NVDRAM) • Today's bottlenecks: • Memory system • Processor architecture

  Structure of a database workload Application server (optional) clients Database server Formulates and issues DB query Executes query Simple logic checks

  Who is who in the database market? • DB engine: • Oracle is dominant • other players: Microsoft, Sybase, Informix • Database applications: • SAP is dominant • other players: Oracle Apps, PeopleSoft, Baan • Hardware: • players: Sun, IBM, HP and Compaq

  Who is who in the database market? (2) • Historically, mainly mainframe proprietary OS • Today: • Unix: 40% • NT: 8% • Proprietary: 52% • In two years: • Unix 46% • NT 19% • Proprietary 35%

  Overview of a RDBMS: Oracle8 • Similar in structure to most commercial engines • Runs on: • uniprocessors • SMP multiprocessors • NUMA multiprocessors* • For clusters or message passing multiprocessors: • Oracle Parallel Server (OPS)

  The Oracle RDBMS • Physical structure • Control files • basic info on the database, it's structure and status • Data files • tables: actual database data • indexes: sorted list of pointers to data • rollback segments: keep data for recovery upon a failed transaction • Log files • compressed storage of DB updates

  Index files • Critical in speeding up access to data by avoiding expensive scans • The more selective the index, the faster the access • Drawbacks: • Very selective indexes may occupy lots of storage • Updates to indexed data are more expensive

  Files or raw disk devices • Most DB engines can directly access disks as raw devices • Idea is to bypass the file system • Manageability/flexibility somewhat compromised • Performance boost not large (~10-15%) • Most customer installations use file systems

  Transactions & rollback segments • Single transaction can access/update many items • Atomicity is required: • transaction either happens or not • old value of balance(X) is kept in a rollback segment • rollback: old values restored, all locks released Example: bank transfer Transaction A (accounts X,Y; value M) { read account balance(X) subtract M from balance(X) add M to balance(Y) commit } failure

  Transactions & log files • A transaction is only committed after it's side effects are in stable storage • Writing all modified DB blocks would be too expensive • random disk writes are costly • a whole DB block has to be written back • no coalescing of updates • Alternative: write only a log of modifications • sequential I/O writes (enables NVDRAM optimizations) • batching of multiple commits • Background process periodically writes dirty data blocks out

  Transactions & log files (2) • When a block is written to disk the log file entries are deleted • If the system crashes: • in-memory dirty blocks are lost • Recovery procedure: • goes through the log files and applies all updates to the database

  Transactions & concurrency control • Many transactions in-flight at any given time • Locking of data items is required • Lock granularity: • Efficient row-level locking is needed for high transaction throughput

  Row-level locking • Each new transaction is assigned an unique ID • A transaction table keeps track of all active transactions • Lock: write ID in directory entry for row • Unlock: remove ID from transaction table Transaction table Data block • Simultaneous release of all locks • Simultaneous release of all locks

  Transaction read consistency • A transaction that reads a full table should see a consistent snapshot • For performance, reads shouldn't lock a table • Problem: intervening writes • Solution: leverage rollback mechanism • intervening write saves old value in rollback segment

  Oracle: software structure • Server processes • actual execution of transactions • DB writer • flush dirty blocks to disk • Log writer • writes redo logs to disk at commit time • Process and system monitors • misc. activity monitoring and recovery • Processes communicate through SGA and IPC

  Oracle: software structure(2) System Global Area (SGA) • SGA: • shared memory segment mapped by all processes • Block buffer area • cache of database blocks • larger portion of physical memory • Metadata area • where most communication takes place • synchronization structures • shared procedures • directory information Block buffer area Increasing virtual address Redo buffers Data dictionary Metadata area Shared pool Fixed region

  Oracle: software structure(3) • Hiding I/O latency: • many server processes/processor • large block buffer area • Process dynamics: • server reads/updates database • (allocates entries in the redo buffer pool) • at commit time server signals Log writer and sleeps • Log writer wakes up, coalesces multiple commits and issues log file write • after log is written, Log writer signals suspended servers

  Oracle: NUMA issues • Single SGA region complicates NUMA localization • Single log writer process becomes a bottleneck • Oracle8 is incorporating NUMA-friendly optimizations • Current large NUMA systems use OPS even on a single address space

  Oracle Parallel Server (OPS) • Runs on clusters of SMPs/NUMAs • Layered on top of RDBMS engine • Shared data through disk • Performance very dependent on how well data can be partitioned • Not supported by most application vendors

  Running Oracle: other issues • Most memory allocated to block buffer area • Need to eliminate OS double buffering • Best performance attained by limiting process migration • In large SMPs, dedicating one processor to I/O may be advantageous

  TPC Database Benchmarks • Transaction Processing Performance Council (TPC) • Established about 10 years ago • Mission: define representative benchmark standards for vendors (hardware/software) to compare their products • Focus on both performance and price/performance • Strict rules about how the benchmark is ran • Only widely used benchmarks

  TPC pricing rules • Must include • All hardware • server, I/O, networking, switches, clients • All software • OS, any middleware, database engine • 5-year maintenance contract • Can include usual discounts • Audited components must be products

  TPC history of benchmarks • TPC-A • First OLTP benchmark • Based on Jim Gray's Debit-Credit benchmark • TPC-B • Simpler version of TPC-A • Meant as a stress test of the server only • TPC-C • Current TPC OLTP benchmark • Much more complex than TPC-A/B • TPC-D • Current TPC DSS benchmark • TPC-W • New Web-based e-commerce benchmark

  The TPC-B benchmark • Models a bank with many branches • 1 transaction type: account update • Metrics: • tpsB (transactions/second) • $/tpsB • Scale requirement: • 1 tpsB needs 100,000 accounts Branch Begin transaction Update account balance Write entry in history table Update teller balance Update branch balance Commit 100,000 10 Teller Account History

  TPC-B: other requirements • System must be ACID • (A)tomicity • transactions either commit or leave the system as if were never issued • (C)onsistency • transactions take system from a consistent state to another • (I)solation • concurrent transactions execute as if in some serial order • (D)urability • results of committed transactions are resilient to faults

  The TPC-C benchmark • Current TPC OLTP benchmark • Moderately complex OLTP • Models a wholesale supplier managing orders • Workload consists of five transaction types • Users and database scale linearly with throughput • Specification was approved July 23, 1992

  TPC-C: schema

  TPC-C: transactions • New-order: enter a new order from a customer • Payment: update customer balance to reflect a payment • Delivery: deliver orders (done as a batch transaction) • Order-status: retrieve status of customer's most recent order • Stock-level: monitor warehouse inventory

  TPC-C: transaction flow

  TPC-C: other requirements • Transparency • tables can be split horizontally and vertically provided it is hidden from the application • Skew • 1% of new-order txn are to a random remote warehouse • 15% of payment txn are to a random remote warehouse • Metrics: • performance: new-order transactions/minute (tpmC) • cost/performance: $/tpmC

  TPC-C: scale • Maximum of 12 tpmC per warehouse • Consequently: • A quad-Xeon system today (~20,000 tpmC) needs • over 1668 warehouses • over 1 TB of disk storage!! • That's a VERY expensive benchmark to run!

  TPC-C: side effects of the skew rules • Very small fraction of transactions go to remote warehouses • Transparency rules allow data partitioning • Consequence: • Clusters of powerful machines show exceptional numbers • Compaq has current TPC-C record of over 100 KtpmC with an 8-node memory channel cluster • Skew rules are expected to change in the future

  The TPC-D benchmark • Current DSS benchmark from TPC • Moderately complex decision support workload • Models a worldwide reseller of parts • Queries ask real world business questions • 17 ad hoc DSS queries (Q1 to Q17) • 2 update queries

  TPC-D: schema Customer SF*150K Nation 25 Region 5 Order SF*1500K Supplier SF*10K Part SF*200K LineItem SF*6000K PartSupp SF*800K

  TPC-D: scale • Unlike TPC-C, scale not tied to performance • Size determined by a Scale Factor (SF) • SF = {1,10,30,100,300,1000,3000,10000} • SF=1 means a 1GB database size • Majority of current results are in the 100GB and 300GB range • Indices and temporary tables can significantly increase the total disk capacity. (3-5x is typical)

  TPC-D example query • Forecasting Revenue Query (Q6) • This query quantifies the amount of revenue increase that would have resulted from eliminating company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues • Considers all line-items shipped in a year • Query definition: SELECT SUM(

