1 / 25

Network-Aware Join Processing in Global-Scale Database Federations

This paper discusses network-aware join processing in global-scale database federations, with a focus on join scheduling in SkyQuery. It explores incorporating network structure, balanced network utilization metrics, and algorithms for achieving optimal join performance. The study is based on the SkyQuery system, a publicly accessible federation of sky surveys.

fanniex
Download Presentation

Network-Aware Join Processing in Global-Scale Database Federations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network-Aware Join Processing in Global-Scale Database Federations X. Wang, R. Burns, A. Terzis Johns Hopkins University A. Deshpande University of Maryland

  2. 55 min, 107 pesos 40 min, 157 pesos Time/Cost Trade-off for Reaching Isla Mujeres Downtown Puerto Juarez 5 min, 30p 30 min, 70p 20 min, 7p Isla Mujeres Playa Tortuga 35 min, 150p 5 min, 7p You are here

  3. Outline • Target Application • Join scheduling in SkyQuery • Incorporating network structure • Balanced network utilization metric • Exploit high throughput paths • Limitations • Algorithms • Two-approximate, MST-based solution • Heuristic extensions (clustering, semi-joins, bushy plans)

  4. SkyQuery • Publicly accessible federation of sky surveys (a virtual telescope) • Autonomous, heterogeneous, and geographically distributed sites (30 across NA, EA, EU) • Data intensive workload • Terabyte data sets • Hundred megabyte intermediate join results • Queries take ten to over a hundred seconds • Network transfers consume up to 70% of the time • Principal federated query is cross-match

  5. Cross-Match Queries • Join by increasing cardinality (count *) • Minimal I/O • Fewer bytes on the network Mediator Query Probe Query Result Result Result Count: 800 Count: 100 Count: 30

  6. Incorporating Network Structure

  7. Balanced Network Utilization Metric • Exploit excess capacity and avoid long haul paths • Minimizes aggregative time on the network • Similar metrics used for stream-processing, multicast, and optimal link layer routing (Bertsekas & Gallager) • Minimizes response time for serial schedules • Avoid over utilizing resources for bushy schedules • Does not account for I/O

  8. How to Extract Network Structure?

  9. Volatility in TCP Throughput

  10. Limitations • Perfect join selectivity assumption • Observations against the same sky • Allows for polynomial-time solutions • No attribute aggregation • Address heuristically • Local optimizations at the mediators • Decentralized to achieve scale using aggregate stats • Routing at the application layer • Improve end performance and preserve I/O

  11. Spanning Tree Approximation (STA) min G H A F B E C D mediator

  12. STA: Find MST min G H A F B E C D mediator

  13. STA: Join Using Paths on the MST min 7 G H 6 A 9 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator

  14. STA: Shortcutting in Metric Regions min 6 G 5 H A 4 F 1 7 B 8 E 2 3 C 9 D mediator 10

  15. C-STA: Clustering TCP Throughput

  16. C-STA: Clustering TCP Throughput

  17. C-STA: Combine STA & Count * min 1 G H A F 2 3 4 B 7 E 5 6 9 8 C D mediator

  18. STA-SJ: Semi-joins and Attribute Agg. min 7 G Join Attr. H 6 A 9 Aggregation 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator

  19. STA-BP: Exploring Bushy Plans • Poly-time DP Algorithm that explores bushy plans using MST paths • Evaluates regions in parallel when beneficial (avoids sending data down the tree) • May operate on larger intermediate results • Intuition: Do not need to traverse STA paths twice if sites have low cardinality R R ≤ 2R > 2R

  20. Experiments: Network Utilization

  21. Experiments: I/O Overhead

  22. Experiments: Algorithms Compared

  23. Discussion • DP solution w/o selectivity, aggregation, MST-based assumptions – T: O(n3n), S: O(n2n) • Applicability beyond SkyQuery (distributed OLAP/DSS) • May tolerate exponential complexity • Value in capturing network structure • Don’t address multi-query optimization • Incomplete info about link layer • Global knowledge incurs high overhead

  24. 55 min, 107 pesos Which Path to Choose? Downtown Puerto Juarez 5 min, 30p 30 min, 70p 20 min, 7p Isla Mujeres Playa Tortuga You are here

  25. Questions ???

More Related