500 likes | 646 Views
Staged Database Systems. Thesis Oral Stavros Harizopoulos. Database world: a 30,000 ft view. internet. offload data. DBMS. Sarah: “Buy this book”. DSS: Decision Support Systems few long-running queries. Jeff: “Which store needs more advertising?”. OLTP: Online Transaction Processing
E N D
Staged Database Systems Thesis Oral Stavros Harizopoulos
Database world: a 30,000 ft view internet offload data DBMS Sarah: “Buy this book” DSS: Decision Support Systems few long-running queries Jeff: “Which store needs more advertising?” OLTP: Online Transaction Processing many short-lived requests DB systems fuel most e-applications Improved performance Impact on everyday life
New HW/SW requirements CPU memory • More capacity, throughput efficiency • CPUs run much faster than they can access data today the ‘80s 1 cycle 10 300 DSS stress I/O subsystem Need to optimize all levels of memory hierarchy
The further, the slower • Keep data close to CPU • Locality and predictability is key Overlap mem. accesses with computation Modify algorithms and structures to exhibit more locality DBMS core design contradicts above goals
Thread-based execution in DBMS x thread pool no coordination • Queries are handled by a pool of threads • Threads execute independently • No means to exploit common operations D C DBMS StagedDB D C New design to expose locality across threads
Staged Database Systems StagedDB • Organize system components into stages • No need to change algorithms / structures Stage 3 DBMS Stage 1 Stage 2 queries queries High concurrency locality across requests
Thesis “By organizing and assigning system components into self-contained stages, database systems can exploit instruction and data commonality across concurrent requests thereby improving performance.”
Summary of main results 20-40% (OLTP) variable (DSS) memory hierarchy • 56% - 96% fewer I-misses • STEPS: full-system evaluation on Shore • 1.2x - 2x throughput • QPipe: full-system evaluation on BerkeleyDB L1 D I L2-L3 D I RAM Disks
Contributions and dissemination • Introduced StagedDB design • Scheduling algorithms for staged systems • Built novel query engine design • QPipe engine maximizes data and work sharing • Addressed instruction cache in OLTP • STEPS applies to any DBMS with few changes CIDR’03 IEEE Data Eng. ’05 CMU-TR’02 SIGMOD’05 ICDE’06 demo sub. CMU-TR’05 HDMS’05 VLDB J. subm. VLDB’04 TODS subm.
Outline • Introduction • QPipe • STEPS • Conclusions D I DSS
Query-centric design of DB engines • Queries are evaluated independently • No means to share across queries • Need new design to exploit common data instructions work across operators
QPipe: operator-centric engine • Conventional: “one-query, many-operators” • QPipe: “one operator, many-queries” • Relational operators become mEngines • Queries break up in tasks and queue up queue runtime QPipe conventional
QPipe design packet dispatcher Q mEngine-A Q Q Q mEngine-J conventional design query plans A mEngine-S J S S thread pool storage engine read read write
Reusing data & work in QPipe • Detect overlap at run time • Shared pages and intermediate results are simultaneously pipelinedto parent nodes Q2 Q2 Q1 Q1 simultaneous pipelining overlap in red operator
Mechanisms for sharing QPipe complements above approaches • Multi-query optimization • Materialized views • Buffer pool management • Shared scans • RedBrick, Teradata, SQL Server not used in practice requires workload knowledge opportunistic limited use
Experimental setup • QPipe prototype • Built on top of BerkeleyDB, 7,000 C++ lines • Shared-memory buffers, native OS threads • Platform • 2GHz Pentium 4, 2GB RAM, 4 SCSI disks • Benchmarks • TPC-H (4GB)
Sharing order-sensitive scans A Q2 order-insensitive S order-sensitive M-J M-J I I + I I ORDERS LINEITEM M-J I I A Q1 TPC-H Query 4 S M-J I I LINEITEM ORDERS • Two clients send query at different intervals • QPipe performs 2 separate joins
Sharing order-sensitive scans total response time (sec) time difference between arrivals • Two clients send query at different intervals • QPipe performs 2 separate joins
TPC-H workload • Clients use pool of 8 TPC-H queries • QPipe reuses large scans, runs up to 2x faster • ..while maintaining low response times throughput (queries/hr) number of clients
QPipe: conclusions • DB engines evaluate queries independently • Limited existing mechanisms for sharing • QPipe requires few code changes • SP is simple yet powerful technique • Allows dynamic sharing of data and work • Other benefits (not described here) • I-cache, D-cache performance • Efficiently execute MQO plans
Outline • Introduction • QPipe • STEPS • Conclusions OLTP D I
Online Transaction Processing Max on-chip L2/L3 cache 10MB 1MB Cache size L1-I sizes for various CPUs 100KB 10KB ‘96 ‘98 ‘00 ‘02 ‘04 Year Introduced • High-end servers, non I/O bound • L1-I stalls are 20-40% of execution time • Instruction caches cannot grow Need solution for instruction cache-residency
Related work • Hardware and compiler approaches • Increased block size, stream buffer[Ranganathan98] • Code layout optimizations[Ramirez01] • Database software approaches • Instruction cache for DSS [Padmanabhan01][Zhou04] • Instruction cache for OLTP: Challenging!
STEPS for cache-resident code U D • multiplex execution, • reuse instructions S S U S S S S S D S S S still larger than I-cache keep thread model, insert sync points Transaction STEPS:Synchronized Transactions through Explicit Processor Scheduling • Microbenchmark: eliminate 96% of L1-I misses • TPC-C: eliminate 2/3 of misses, 1.4 speedup Begin Select Update Insert Delete Commit
I-cache aware context-switching instruction cache no STEPS with STEPS thread 1 thread 2 thread 2 thread 1 select( ) s1 s2 s3 select( ) s1 s2 s3 s4 s5 s6 s7 M M M M Miss M M M M M M M select( ) s1 s2 s3 Hit H H H code fits in I-cache select( ) s1 s2 s3 s4 s5 s6 s7 M M M M M M M M s4 s5 s6 s7 M M M M context-switch (CTX) point s4 s5 s6 s7 H H H H
Placing CTX calls in source mem. address for CTX lines to insert CTX instruction mem. refs … … DBMS binary … STEPS simulation gdb valgrind file1.c:30 0x01 0x01 0x04 file2.c:40 0x05 0x05 … … 0x04 … AutoSTEPS tool Evaluation • Comparable performance to manual • ..while being more conservative
Experimental setup (1st part) • Implemented on top of Shore • AMD AthlonXP • 64KB L1-I + 64KB L1-D, 256KB L2 • Microbenchmark • Index fetch, in-memory index • Fast CTX for both systems, warm cache
Microbenchmark: L1-I misses AthlonXP 4K 3K L1-I cache misses 2K 1K STEPSeliminates 92-96% of misses for add’l threads 6 8 10 2 4 1 Concurrent threads
L1-I misses & speedup 40 40 20 20 60 60 80 80 10 10 30 30 50 50 70 70 Concurrent threads Concurrent threads AthlonXP 100% 80% Miss reduction 60% 40% 1.4 1.3 Speedup 1.2 1.1 • Steps achieves max performance for 6-10 threads • No need for larger thread groups
Challenges in full-system operation So far: • Threads are interested in same Op • Uninterrupted flow • No thread scheduler Full-system requirements • High concurrency on similar Ops • Handle exceptions • Disk I/O, locks, latches, abort • Co-exist with system threads • Deadlock detection, buffer pool housekeeping
System design Op X Op Y Xactions Xactions STEPS wrapper STEPS wrapper Op Z to other Op execution team stray thread • Fast CTX through fixed scheduling • Repair thread structures at exceptions • Modify only thread package STEPS wrapper
Experimental setup (2nd part) • AMD AthlonXP • 64KB L1-I + 64KB L1-D, 256KB L2 • TPC-C (wholesale parts supplier) • 2GB RAM, 2 disks • 10-30 Warehouses (1-3GB), 100-300 users • Zero think time, in-memory, lazy commits
One transaction: payment 100% 80% 60% Normalized count Number of users 40% 20% • STEPSoutperforms baseline system • 1.4 speedup, 65% fewer L1-I misses Cycles L1-I misses
Mix of four transactions 100% 80% 60% Normalized count Number of users 40% 20% Cycles L1-I misses • Xaction mix reduces team size • Still, 56% fewer L1-I misses
STEPS: conclusions • STEPS can handle full OLTP workloads • Significant improvements in TPC-C • 65% fewer L1-I misses • 1.2 – 1.4 speedup STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity
StagedDB: future work • Promising platform for Chip-Multiprocessors • DBMS suffer from CPU-to-CPU cache misses • StagedDB allows work to follow data -- not the other way around! • Resource scheduling • Stages cluster requests for DB locks, I/O • Potential for deeper, more effective scheduling
Conclusions • New hardware, new requirements • Server core design remains the same • Need new design to fit modern hardware StagedDB: Optimizes all memory hierarchy levels Promising design for future installations
The speaker would like to thank: his academic advisor Anastassia Ailamaki his thesis committee members Panos K. Chrysanthis, Christos Faloutsos, Todd C. Mowry, and Michael Stonebraker and his coauthors Kun Gao, Vladislav Shkapenyuk, and Ryan Williams Thank you
A mEngine in detail relational operator code mEngine mEngine simultaneous pipelining queue main routine parameters scheduling thread busy threads free threads Padmanabhan01 (ICDE) Zhou04 (SIGMOD) Harizopoulos04 (VLDB) Zhou03 (VLDB) • tuple batching I-cache • query grouping I&D-cache
Simultaneous Pipelining in QPipe SP coordinator join join join attach 2 4 3 1 Q1 Q1 Q2 Q2 write Q2 Q1 Q2 Q1 read Q2 Q1 pipeline Q1 Q2 copy COMPLETE Q2 Q2 Q1 Q1 with SP without SP
Sharing data & work across queries Query 3 min Query 2 work sharing opportunity max M-J data sharing opportunity S S S TABLE A TABLE B TABLE A A Query 1 : “Find average age of students enrolled in both class A and class B” M-J S S TABLE A TABLE B
Sharing opportunities at run time SP coordinator Q2 sharing potential R R R pipeline Q2 Q1 Q2 Q1 write read read • Q1 executes operator R • Q2 arrives with R in its plan result production for R in Q1 result production for R in Q2 with SP without SP
TPC-H workload average response time think time (sec) • Clients use pool of 8 TPC-H queries • QPipe reuses large scans, runs up to 2x faster • ..while maintaining low response times throughput (queries/hr) number of clients
Smaller L1-I cache 10 threads 209% AthlonXP, Pentium III 120% 100% 80% Normalized count 60% 40% 20% Instr. stalls (cycles) Cycles Branches Br. Mispred. L1-I misses L1-D misses Br. missed BTB • Steps outperforms Shore even on smaller caches (PIII) • 62-64% fewer mispredicted branches on both CPUs
SimFlex: L1-I misses AthlonXP 10 threads 64b cache block 10K 8K 6K L1-I cache misses 4K 2K full direct higher associativity higher associativity 8-way 2-way 4-way • Steps eliminates all capacity misses (16, 32KB caches) • Up to 89% overall miss reduction (upper limit is 90%)
One Xaction: payment Branches L2-D L1-D L2-I L1-I mispred. misses misses misses misses Number of Warehouses 100% 80% 60% Normalized count 40% 20% • Steps outperforms Shore • 1.4 speedup, 65% fewer L1-I misses • 48% fewer mispredicted branches Cycles
Mix of four Xactions Branches L2-D L1-D L2-I L1-I mispred. misses misses misses misses Number of Warehouses 121% 125% 100% 80% 60% Normalized count 40% 20% Cycles • Xaction mix reduces average team size (4.3 in 10W) • Still, Steps has 56% fewer L1-I misses (out of 77% max)