190 likes | 359 Views
Scaling up analytical queries with column -stores. Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki. École Polytechnique Fédérale de Lausanne. Drinking from a data firehose. Fast and high quality data analysis for smart business decisions Data warehouses
E N D
Scaling up analytical queries with column-stores Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki ÉcolePolytechniqueFédérale de Lausanne
Drinking from a data firehose • Fast and high quality data analysis for smart business decisions • Data warehouses • 1/3 of the database market ($$$) • Column-storesare here to stay! • Need for multiple concurrent users • 100s to 1000s queries* Many concurrent queries + column-stores = ??? *"High-performance data warehousing", TDWI best practices report
Multiple concurrent queries pasta? steak? vegan? Find all restaurants with rating over 3.5 and close to East Village indian? DBMS CORE 2 CORE 2 CORE 1 CORE 1 CORE 4 CORE 4 CORE 3 CORE 3 CORE 5 CORE 5 CORE 6 CORE 6 CORE 8 CORE 8 CORE 7 CORE 7 MEM HDD High contention for resources
response time throughput
Throughput (memory-resident workload) TPCH (sf:30) saturation point total #HW contexts Concurrency can hurt performance
Experimental setup • Column stores • System-A and System-B (Commercial) • System-C (Open-source) • Hardware • Dual socket Intel(R) Xeon(R) CPU E5-2660 • 2 sockets x 8 cores x 2 threads (32 HW contexts) • 128 GB RAM, 1600 MHz DIMMs • L1: 64KB and L2: 256KB (per core), L3: 20MB (shared)
Workloads • TPC-H • Scale factor: 30 (32GB on disk) • Qtpch = {10 query templates} • SSB (Star Schema Benchmark) • Scale factor: 30 (18GB on disk) • Qssb= {all of 13 query templates} • Throughput exp. with 25 queryinstances Memory-resident Hot-runs
Experiment 1: How does increased concurrency affect response time?
Scaling up TPCH Q1 Linear increase in response time
Scaling up SSB Q3.1 Similar behavior in SSB
Experiment 2: What is the variability of query response time?
Variability of System-A TPCH (64 clients) Groups of short, medium and long running queries
Variability of System-B TPCH (64 clients) Balanced resource allocation lower variation
Variability of System-C TPCH (64 clients) System-C uses an admission control mechanism
Experiment 3: How does increasing concurrency affect throughput?
Throughput - TPCH 48% 32% drop 35% drop Throughput decreases after the saturation point
Throughput - SSB throughput plateaus Exploiting sharing sustain peak performance
When concurrency in column-stores is increased: • Response time increases linearly • … with high variability • After saturation peak performance is not sustained Except from System-B for SSB
Where do we go from here? • QPipe, Datapath, CJoin, ShareDB, Blink • Recycler (MonetDB), cooperative scans, CCM (cracking) • Adaptive resource (re)allocation • Work sharing techniques • Contention-aware scheduling saturation point Thank you!