200 likes | 215 Views
This paper discusses the challenges and techniques for join processing in large-scale scientific federations, focusing on throughput optimization and global-scale join processing in heterogeneous networks. The authors propose a new metric and optimization goal to balance network usage and computation, and introduce algorithms for identifying network structure and optimizing join processing in non-uniform and non-metric networks.
E N D
Throughput-Optimized, Global-Scale Join Processing in Scientific Federations Xiaodan Wang, Randal Burns, and Andreas Terzis The Johns Hopkins University
Data volume and geography deter scalability • Performance is network bound • Intermediate results are often hundreds of megabytes • 30 sites across North America, Europe, Asia • Community has identified 100 sites to be included
Join Processing in Heterogeneous Networks • Query plans optimized for scalability • Without latency/response time constraints • On global-scale, heterogeneous networks • For applications that transfer hundreds of MBs among continents • Balanced utilization of all network paths • A new query optimization goal (metric) • Exploit excess capacity where available • Avoid narrow, long-haul paths when possible • Join processing techniques and algorithms • Identifying network structure: clusters of sites and path throughput • Optimize for non-uniform and non-metric networks • Balance network usage and computation
Why do we need a new metric? • Minimizing response time • Consumes all available resources to achieve the goal • Minimizing computation costs • Does not address network bound applications • Minimizing the volume of network traffic • Insensitive to network heterogeneity • And we are concerned with • Polynomial-time algorithms for large-scale federations • Avoiding multi-objective optimization
count * • SkyQuery’s computation oriented optimization • Schedule sites in order of increasing cardinality • Minimizes computation costs under several assumptions • Perfect join selectivity (holds in practice) • Computation costs linear in the size of intermediate results (because it’s an index join) • Occasionally transfers data across the Atlantic multiple times
Balanced Network Utilization • Cost of using a path is product of the volume of data transmitted and the inverse TCP throughput • Cost of a schedule is the sum over all paths • Takes advantage of path heterogeneity • By using higher-throughput paths proportionally more • Reduces contention on narrow, long-haul paths • By making them costly • But, its not a direct measure of scalability • Does not load balance paths over multiple queries
Path Throughput • Measure throughput among all federation sites pairwise • Using a nearby PlanetLab proxy site for each SkyQuery site • 3 times a day, bulk TCP transfer • TCP throughput reflects geography • Dominant 1/distance trend correlates well with 1/RTT • But, highly non-metric • Input to scheduling
Throughput Stability • Should we measure throughput more often? • Accurate measurements are intrusive (bulk-transfer) • Short duration measures are error prone (cross-traffic) • The most volatile paths are stable • <30% throughput variation
Join Scheduling • Assumptions • Accurate cardinality estimates • Perfect join selectivity • Ignore the effect of attribute aggregation • Simplify one aspect of optimization (selectivity) in order to consider non-uniform, non-metric networks • cannot use Dynamic Programming in this environment as it lacks sub-problem optimality • Two algorithms based on Minimum Spanning Trees • Two-approximate balanced network utilization • Clustering variant defines computation and utilization trade-offs
Spanning Tree Approximation (STA) • Inputs: pairwise throughputs, site cardinalities, and a node to which we deliver results • Min: node with lowest cardinality
Spanning Tree Approximation (STA) • Construct a minimum spanning tree
Achieving the Bound • From min to sink visiting all sites • Cost(STA) 2*cost (MST) 2*OPT • Same intuition as 2-approximate Euclidean TSP • STA can visit each site more than once • Applies to non-metric networks
Heuristic Improvement • For paths on which the triangle inequality holds • Route directly to next unvisited node • 30% improvement in practice • Identify and use metric regions in the network
Clustered-STA • Well-connected clusters separated by narrow, long-haul paths • Optimize for computation inside clusters (count *) • Optimize balanced network utilization among clusters (STA)
Clustering Sites • Organize sites using Bond-Energy Algorithm • Minimize difference between adjacent elements • Extract clusters with a threshold • 3 Mbps produces 6 clusters for 30 SkyQuery sites • Define computation versus utilization tradeoff • By tuning the extraction threshold
Network Utilization • Results are independent of assumptions • OPT is best serial plan • STA often finds OPT plan • C-STA performs poorly within clusters • Also poor on narrow paths due to attribute aggregation
Computation Time • count * represents a “soft” lower bound • C-STA reduces computation costs
Discussion • Balanced network utilization metric captures path heterogeneity • Avoids narrow, long-haul paths • Scheduling algorithms of low complexity • OPT is a viable alternative for serial plans • Limitations of C-STA • Does not really create meaningful utilization/computation tradeoffs • Threshold can only find natural clusters • Systematically aggregate attributes in each cluster • Semi-joins address these limitations • Extending this work to parallel schedules • Applicability to other workloads? OLAP?
A World-Wide Telescope • Federations of sky surveys make the world’s best telescope • whole sky coverage • multi-spectral (optical, radio, infrared, x-ray) • data are always available (no clouds, no moon, day or night) • Multi-spectral and temporal experiments have already lead to many new discoveries
The Crossmatch Query SELECT O.object_id, O.right_accession, T.object_id FROM SDSS:Photo_Object O, TWOMASS:Photo_Primary T, FIRST:Primary_Object P WHERE AREA (185.0,-0.5,4.5) AND XMATCH (O,T,P) <3.5 AND O.type= GALAXY AND (O.i_flux - T.i_flux)>2}