1 / 20

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing

sanaa
Download Presentation

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

  2. BETTER LUCK NEXT TIME!

  3. Problem Q1 Q4 Q2 Q3

  4. Goals Eliminate redundant I/O to improve query throughput • Batch queries with that exhibit data sharing • Pre-process queries to identify data sharing • Co-schedule queries that access the same data • Access contentious data first to maximize sharing • Starvation resistance • Avoid indefinite queuing times (response time) • Enforce some constraints on completion order

  5. Target Applications • Data intensive scan queries • Executed against a clustered index • Clustered and federated databases (e.g. joins that correlate multiple nodes) • Peta-scale astronomy (Pan-STARRS) • Data are partitioned spatially • Many queries scan full DB and last hours or days • Cross-match • Probabilistic spatial join across multiple databases

  6. Filter and Refine • Filter queries • Pre-process queries to determine join buckets • Buckets B1,…,Bn and queries Q1,…, Qm • Workload Wij denote objects from Qi that overlap Bj • Refinement • Read buckets one-at-a-time • Sort-merge join (sort by HTM ID) • Query specific predicates applied on output tuples

  7. Workload Throughput Metric • Greedily in order of decreasing workload throughput • Exploits data regions that experience contention • May starve requests • Favors buckets experiencing frequent reuse • No guarantee a particular bucket or query receives service

  8. Aged Workload Throughput Metric • Inspired by disk-drive head scheduling • Balance arrival order (low response time) with contention (high throughput) • Adaptive trade-offs based on workload saturation • Maximize rate at which objects are joined during saturated workloads • Enforce completion order (queuing times) to prevent indefinite starvation during low saturation

  9. Scheduling Behavior Qi Qj Qk Qk Sub-divide queries by bucket: • Assumptions: • Inter-query time of 1 sec • I/O for each bucket of 1 sec • Cache size of 2 • Join cost is negligible Qi – Qi1, Qi2, Qi3 Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8 Qj – Qj5, Qj6 , Qj7, Qj8

  10. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qk End Qi End Qj End Qi1 Qi2 Qi3 Qk8 Qj7 Qj1 Qj6 Qj8 Qk1 Qj3 Qk4 Qj4 B1 B2 B3 B7 B1 B1 B3 B6 B4 B8 B4 B8 Arrival order with no sharing … Completion Times: Qi – 3 sec Qj – 8 sec Qk – 13 sec Avg – 8 sec Tp – .2 qry/sec

  11. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi5 Qj4Qk4 Qj7Qk7 Qj6Qk6 Qj1Qk1 Qi3Qj3 Qj8Qk8 B1 B2 B5 B3 B1 B4 B7 B8 B6 Age based scheduling (bias 1) Completion Times: Qi – 3 sec Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

  12. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi3Qj3 Qk5 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qj1Qk1Qj4Qk4 B1 B2 B5 B3 B7 B8 B1 B4 B6 Contention based scheduling (bias 0) Completion Times: Qi – 7 sec Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec (5.6) (.33)

  13. Throughput Performance

  14. Tuning theage bias • Throughput performance gap grows while response time gap is insensitive to saturation • Increasing age bias is more attractive at low saturation

  15. Parameter tuning using trade-off curves

  16. Discussion • Impact of caching strategies • Workload overflow • Large intermediate join results • Migrate pairs of workload and bucket • Beyond completion order • Higher priority for interactive queries • Batch processing in a clustered environment P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.

  17. WHAT ABOUT US?

  18. Filter and refine • Partition data into buckets

  19. Average Response Time

  20. Outline • Motivation • Goals for data-driven, batch scheduling • Target application (SkyQuery) • LiftRaft scheduler • Filter and refine queries • Throughput maximizing metric • Starvation resistance • Differences in outcomes • Workload adaptive parameter selection

More Related