250 likes | 267 Views
This paper discusses network-aware join processing in global-scale database federations, with a focus on join scheduling in SkyQuery. It explores incorporating network structure, balanced network utilization metrics, and algorithms for achieving optimal join performance. The study is based on the SkyQuery system, a publicly accessible federation of sky surveys.
E N D
Network-Aware Join Processing in Global-Scale Database Federations X. Wang, R. Burns, A. Terzis Johns Hopkins University A. Deshpande University of Maryland
55 min, 107 pesos 40 min, 157 pesos Time/Cost Trade-off for Reaching Isla Mujeres Downtown Puerto Juarez 5 min, 30p 30 min, 70p 20 min, 7p Isla Mujeres Playa Tortuga 35 min, 150p 5 min, 7p You are here
Outline • Target Application • Join scheduling in SkyQuery • Incorporating network structure • Balanced network utilization metric • Exploit high throughput paths • Limitations • Algorithms • Two-approximate, MST-based solution • Heuristic extensions (clustering, semi-joins, bushy plans)
SkyQuery • Publicly accessible federation of sky surveys (a virtual telescope) • Autonomous, heterogeneous, and geographically distributed sites (30 across NA, EA, EU) • Data intensive workload • Terabyte data sets • Hundred megabyte intermediate join results • Queries take ten to over a hundred seconds • Network transfers consume up to 70% of the time • Principal federated query is cross-match
Cross-Match Queries • Join by increasing cardinality (count *) • Minimal I/O • Fewer bytes on the network Mediator Query Probe Query Result Result Result Count: 800 Count: 100 Count: 30
Balanced Network Utilization Metric • Exploit excess capacity and avoid long haul paths • Minimizes aggregative time on the network • Similar metrics used for stream-processing, multicast, and optimal link layer routing (Bertsekas & Gallager) • Minimizes response time for serial schedules • Avoid over utilizing resources for bushy schedules • Does not account for I/O
Limitations • Perfect join selectivity assumption • Observations against the same sky • Allows for polynomial-time solutions • No attribute aggregation • Address heuristically • Local optimizations at the mediators • Decentralized to achieve scale using aggregate stats • Routing at the application layer • Improve end performance and preserve I/O
Spanning Tree Approximation (STA) min G H A F B E C D mediator
STA: Find MST min G H A F B E C D mediator
STA: Join Using Paths on the MST min 7 G H 6 A 9 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator
STA: Shortcutting in Metric Regions min 6 G 5 H A 4 F 1 7 B 8 E 2 3 C 9 D mediator 10
C-STA: Combine STA & Count * min 1 G H A F 2 3 4 B 7 E 5 6 9 8 C D mediator
STA-SJ: Semi-joins and Attribute Agg. min 7 G Join Attr. H 6 A 9 Aggregation 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator
STA-BP: Exploring Bushy Plans • Poly-time DP Algorithm that explores bushy plans using MST paths • Evaluates regions in parallel when beneficial (avoids sending data down the tree) • May operate on larger intermediate results • Intuition: Do not need to traverse STA paths twice if sites have low cardinality R R ≤ 2R > 2R
Discussion • DP solution w/o selectivity, aggregation, MST-based assumptions – T: O(n3n), S: O(n2n) • Applicability beyond SkyQuery (distributed OLAP/DSS) • May tolerate exponential complexity • Value in capturing network structure • Don’t address multi-query optimization • Incomplete info about link layer • Global knowledge incurs high overhead
55 min, 107 pesos Which Path to Choose? Downtown Puerto Juarez 5 min, 30p 30 min, 70p 20 min, 7p Isla Mujeres Playa Tortuga You are here
Questions ???