Staged Database Systems

Staged Database Systems Thesis Oral Stavros Harizopoulos

Database world: a 30,000 ft view internet offload data DBMS Sarah: “Buy this book” DSS: Decision Support Systems few long-running queries Jeff: “Which store needs more advertising?” OLTP: Online Transaction Processing many short-lived requests DB systems fuel most e-applications Improved performance Impact on everyday life

New HW/SW requirements CPU memory • More capacity, throughput efficiency • CPUs run much faster than they can access data today the ‘80s 1 cycle 10 300 DSS stress I/O subsystem Need to optimize all levels of memory hierarchy

The further, the slower • Keep data close to CPU • Locality and predictability is key Overlap mem. accesses with computation Modify algorithms and structures to exhibit more locality DBMS core design contradicts above goals

Thread-based execution in DBMS x thread pool no coordination • Queries are handled by a pool of threads • Threads execute independently • No means to exploit common operations D C DBMS StagedDB D C New design to expose locality across threads

Staged Database Systems StagedDB • Organize system components into stages • No need to change algorithms / structures Stage 3 DBMS Stage 1 Stage 2 queries queries High concurrency locality across requests

Thesis “By organizing and assigning system components into self-contained stages, database systems can exploit instruction and data commonality across concurrent requests thereby improving performance.”

Summary of main results 20-40% (OLTP) variable (DSS) memory hierarchy • 56% - 96% fewer I-misses • STEPS: full-system evaluation on Shore • 1.2x - 2x throughput • QPipe: full-system evaluation on BerkeleyDB L1 D I L2-L3 D I RAM Disks

Contributions and dissemination • Introduced StagedDB design • Scheduling algorithms for staged systems • Built novel query engine design • QPipe engine maximizes data and work sharing • Addressed instruction cache in OLTP • STEPS applies to any DBMS with few changes CIDR’03 IEEE Data Eng. ’05 CMU-TR’02 SIGMOD’05 ICDE’06 demo sub. CMU-TR’05 HDMS’05 VLDB J. subm. VLDB’04 TODS subm.

Outline • Introduction • QPipe • STEPS • Conclusions D I DSS

Query-centric design of DB engines • Queries are evaluated independently • No means to share across queries • Need new design to exploit common data instructions work across operators

QPipe: operator-centric engine • Conventional: “one-query, many-operators” • QPipe: “one operator, many-queries” • Relational operators become mEngines • Queries break up in tasks and queue up queue runtime QPipe conventional

QPipe design packet dispatcher Q mEngine-A Q Q Q mEngine-J conventional design query plans A mEngine-S J S S thread pool storage engine read read write

Reusing data & work in QPipe • Detect overlap at run time • Shared pages and intermediate results are simultaneously pipelinedto parent nodes Q2 Q2 Q1 Q1 simultaneous pipelining overlap in red operator

Mechanisms for sharing QPipe complements above approaches • Multi-query optimization • Materialized views • Buffer pool management • Shared scans • RedBrick, Teradata, SQL Server not used in practice requires workload knowledge opportunistic limited use

Experimental setup • QPipe prototype • Built on top of BerkeleyDB, 7,000 C++ lines • Shared-memory buffers, native OS threads • Platform • 2GHz Pentium 4, 2GB RAM, 4 SCSI disks • Benchmarks • TPC-H (4GB)

Sharing order-sensitive scans A Q2 order-insensitive S order-sensitive M-J M-J I I + I I ORDERS LINEITEM M-J I I A Q1 TPC-H Query 4 S M-J I I LINEITEM ORDERS • Two clients send query at different intervals • QPipe performs 2 separate joins

Sharing order-sensitive scans total response time (sec) time difference between arrivals • Two clients send query at different intervals • QPipe performs 2 separate joins

TPC-H workload • Clients use pool of 8 TPC-H queries • QPipe reuses large scans, runs up to 2x faster • ..while maintaining low response times throughput (queries/hr) number of clients

QPipe: conclusions • DB engines evaluate queries independently • Limited existing mechanisms for sharing • QPipe requires few code changes • SP is simple yet powerful technique • Allows dynamic sharing of data and work • Other benefits (not described here) • I-cache, D-cache performance • Efficiently execute MQO plans

Outline • Introduction • QPipe • STEPS • Conclusions OLTP D I

Online Transaction Processing Max on-chip L2/L3 cache 10MB 1MB Cache size L1-I sizes for various CPUs 100KB 10KB ‘96 ‘98 ‘00 ‘02 ‘04 Year Introduced • High-end servers, non I/O bound • L1-I stalls are 20-40% of execution time • Instruction caches cannot grow Need solution for instruction cache-residency

Related work • Hardware and compiler approaches • Increased block size, stream buffer[Ranganathan98] • Code layout optimizations[Ramirez01] • Database software approaches • Instruction cache for DSS [Padmanabhan01][Zhou04] • Instruction cache for OLTP: Challenging!

STEPS for cache-resident code U D • multiplex execution, • reuse instructions S S U S S S S S D S S S still larger than I-cache keep thread model, insert sync points Transaction STEPS:Synchronized Transactions through Explicit Processor Scheduling • Microbenchmark: eliminate 96% of L1-I misses • TPC-C: eliminate 2/3 of misses, 1.4 speedup Begin Select Update Insert Delete Commit

I-cache aware context-switching instruction cache no STEPS with STEPS thread 1 thread 2 thread 2 thread 1 select( ) s1 s2 s3 select( ) s1 s2 s3 s4 s5 s6 s7 M M M M Miss M M M M M M M select( ) s1 s2 s3 Hit H H H code fits in I-cache select( ) s1 s2 s3 s4 s5 s6 s7 M M M M M M M M s4 s5 s6 s7 M M M M context-switch (CTX) point s4 s5 s6 s7 H H H H

Placing CTX calls in source mem. address for CTX lines to insert CTX instruction mem. refs … … DBMS binary … STEPS simulation gdb valgrind file1.c:30 0x01 0x01 0x04 file2.c:40 0x05 0x05 … … 0x04 … AutoSTEPS tool Evaluation • Comparable performance to manual • ..while being more conservative

Experimental setup (1st part) • Implemented on top of Shore • AMD AthlonXP • 64KB L1-I + 64KB L1-D, 256KB L2 • Microbenchmark • Index fetch, in-memory index • Fast CTX for both systems, warm cache

Microbenchmark: L1-I misses AthlonXP 4K 3K L1-I cache misses 2K 1K STEPSeliminates 92-96% of misses for add’l threads 6 8 10 2 4 1 Concurrent threads

L1-I misses & speedup 40 40 20 20 60 60 80 80 10 10 30 30 50 50 70 70 Concurrent threads Concurrent threads AthlonXP 100% 80% Miss reduction 60% 40% 1.4 1.3 Speedup 1.2 1.1 • Steps achieves max performance for 6-10 threads • No need for larger thread groups

Challenges in full-system operation So far: • Threads are interested in same Op • Uninterrupted flow • No thread scheduler Full-system requirements • High concurrency on similar Ops • Handle exceptions • Disk I/O, locks, latches, abort • Co-exist with system threads • Deadlock detection, buffer pool housekeeping

System design Op X Op Y Xactions Xactions STEPS wrapper STEPS wrapper Op Z to other Op execution team stray thread • Fast CTX through fixed scheduling • Repair thread structures at exceptions • Modify only thread package STEPS wrapper

Experimental setup (2nd part) • AMD AthlonXP • 64KB L1-I + 64KB L1-D, 256KB L2 • TPC-C (wholesale parts supplier) • 2GB RAM, 2 disks • 10-30 Warehouses (1-3GB), 100-300 users • Zero think time, in-memory, lazy commits

One transaction: payment 100% 80% 60% Normalized count Number of users 40% 20% • STEPSoutperforms baseline system • 1.4 speedup, 65% fewer L1-I misses Cycles L1-I misses

Mix of four transactions 100% 80% 60% Normalized count Number of users 40% 20% Cycles L1-I misses • Xaction mix reduces team size • Still, 56% fewer L1-I misses

STEPS: conclusions • STEPS can handle full OLTP workloads • Significant improvements in TPC-C • 65% fewer L1-I misses • 1.2 – 1.4 speedup STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity

StagedDB: future work • Promising platform for Chip-Multiprocessors • DBMS suffer from CPU-to-CPU cache misses • StagedDB allows work to follow data -- not the other way around! • Resource scheduling • Stages cluster requests for DB locks, I/O • Potential for deeper, more effective scheduling

Conclusions • New hardware, new requirements • Server core design remains the same • Need new design to fit modern hardware StagedDB: Optimizes all memory hierarchy levels Promising design for future installations

The speaker would like to thank: his academic advisor Anastassia Ailamaki his thesis committee members Panos K. Chrysanthis, Christos Faloutsos, Todd C. Mowry, and Michael Stonebraker and his coauthors Kun Gao, Vladislav Shkapenyuk, and Ryan Williams Thank you

QPipe backup

A mEngine in detail relational operator code mEngine mEngine simultaneous pipelining queue main routine parameters scheduling thread busy threads free threads Padmanabhan01 (ICDE) Zhou04 (SIGMOD) Harizopoulos04 (VLDB) Zhou03 (VLDB) • tuple batching I-cache • query grouping I&D-cache

Simultaneous Pipelining in QPipe SP coordinator join join join attach 2 4 3 1 Q1 Q1 Q2 Q2 write Q2 Q1 Q2 Q1 read Q2 Q1 pipeline Q1 Q2 copy COMPLETE Q2 Q2 Q1 Q1 with SP without SP

Sharing data & work across queries Query 3 min Query 2 work sharing opportunity max M-J data sharing opportunity S S S TABLE A TABLE B TABLE A A Query 1 : “Find average age of students enrolled in both class A and class B” M-J S S TABLE A TABLE B

Sharing opportunities at run time SP coordinator Q2 sharing potential R R R pipeline Q2 Q1 Q2 Q1 write read read • Q1 executes operator R • Q2 arrives with R in its plan result production for R in Q1 result production for R in Q2 with SP without SP

TPC-H workload average response time think time (sec) • Clients use pool of 8 TPC-H queries • QPipe reuses large scans, runs up to 2x faster • ..while maintaining low response times throughput (queries/hr) number of clients

STEPS backup

Smaller L1-I cache 10 threads 209% AthlonXP, Pentium III 120% 100% 80% Normalized count 60% 40% 20% Instr. stalls (cycles) Cycles Branches Br. Mispred. L1-I misses L1-D misses Br. missed BTB • Steps outperforms Shore even on smaller caches (PIII) • 62-64% fewer mispredicted branches on both CPUs

SimFlex: L1-I misses AthlonXP 10 threads 64b cache block 10K 8K 6K L1-I cache misses 4K 2K full direct higher associativity higher associativity 8-way 2-way 4-way • Steps eliminates all capacity misses (16, 32KB caches) • Up to 89% overall miss reduction (upper limit is 90%)

One Xaction: payment Branches L2-D L1-D L2-I L1-I mispred. misses misses misses misses Number of Warehouses 100% 80% 60% Normalized count 40% 20% • Steps outperforms Shore • 1.4 speedup, 65% fewer L1-I misses • 48% fewer mispredicted branches Cycles

Mix of four Xactions Branches L2-D L1-D L2-I L1-I mispred. misses misses misses misses Number of Warehouses 121% 125% 100% 80% 60% Normalized count 40% 20% Cycles • Xaction mix reduces average team size (4.3 in 10W) • Still, Steps has 56% fewer L1-I misses (out of 77% max)

Staged Database Systems

Staged Database Systems

Presentation Transcript

Staged Delivery

STAGED DISTILLATION

Database Systems

DATABASE SYSTEMS

Database Systems

Database Systems

Staged Remission

Database Systems

Database Systems

Template staged diagramm / Staged plan

Database Systems

Database Systems

Scheduling in Staged- DB Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Staged Delivery