200 likes | 317 Views
LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing
E N D
LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases
Problem Q1 Q4 Q2 Q3
Goals Eliminate redundant I/O to improve query throughput • Batch queries with that exhibit data sharing • Pre-process queries to identify data sharing • Co-schedule queries that access the same data • Access contentious data first to maximize sharing • Starvation resistance • Avoid indefinite queuing times (response time) • Enforce some constraints on completion order
Target Applications • Data intensive scan queries • Executed against a clustered index • Clustered and federated databases (e.g. joins that correlate multiple nodes) • Peta-scale astronomy (Pan-STARRS) • Data are partitioned spatially • Many queries scan full DB and last hours or days • Cross-match • Probabilistic spatial join across multiple databases
Filter and Refine • Filter queries • Pre-process queries to determine join buckets • Buckets B1,…,Bn and queries Q1,…, Qm • Workload Wij denote objects from Qi that overlap Bj • Refinement • Read buckets one-at-a-time • Sort-merge join (sort by HTM ID) • Query specific predicates applied on output tuples
Workload Throughput Metric • Greedily in order of decreasing workload throughput • Exploits data regions that experience contention • May starve requests • Favors buckets experiencing frequent reuse • No guarantee a particular bucket or query receives service
Aged Workload Throughput Metric • Inspired by disk-drive head scheduling • Balance arrival order (low response time) with contention (high throughput) • Adaptive trade-offs based on workload saturation • Maximize rate at which objects are joined during saturated workloads • Enforce completion order (queuing times) to prevent indefinite starvation during low saturation
Scheduling Behavior Qi Qj Qk Qk Sub-divide queries by bucket: • Assumptions: • Inter-query time of 1 sec • I/O for each bucket of 1 sec • Cache size of 2 • Join cost is negligible Qi – Qi1, Qi2, Qi3 Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8 Qj – Qj5, Qj6 , Qj7, Qj8
Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qk End Qi End Qj End Qi1 Qi2 Qi3 Qk8 Qj7 Qj1 Qj6 Qj8 Qk1 Qj3 Qk4 Qj4 B1 B2 B3 B7 B1 B1 B3 B6 B4 B8 B4 B8 Arrival order with no sharing … Completion Times: Qi – 3 sec Qj – 8 sec Qk – 13 sec Avg – 8 sec Tp – .2 qry/sec
Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi5 Qj4Qk4 Qj7Qk7 Qj6Qk6 Qj1Qk1 Qi3Qj3 Qj8Qk8 B1 B2 B5 B3 B1 B4 B7 B8 B6 Age based scheduling (bias 1) Completion Times: Qi – 3 sec Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec
Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi3Qj3 Qk5 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qj1Qk1Qj4Qk4 B1 B2 B5 B3 B7 B8 B1 B4 B6 Contention based scheduling (bias 0) Completion Times: Qi – 7 sec Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec (5.6) (.33)
Tuning theage bias • Throughput performance gap grows while response time gap is insensitive to saturation • Increasing age bias is more attractive at low saturation
Discussion • Impact of caching strategies • Workload overflow • Large intermediate join results • Migrate pairs of workload and bucket • Beyond completion order • Higher priority for interactive queries • Batch processing in a clustered environment P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.
Filter and refine • Partition data into buckets
Outline • Motivation • Goals for data-driven, batch scheduling • Target application (SkyQuery) • LiftRaft scheduler • Filter and refine queries • Throughput maximizing metric • Starvation resistance • Differences in outcomes • Workload adaptive parameter selection