Scalable Join Processing in Global-Scale Scientific Federations

Throughput-Optimized, Global-Scale Join Processing in Scientific Federations Xiaodan Wang, Randal Burns, and Andreas Terzis The Johns Hopkins University

Data volume and geography deter scalability • Performance is network bound • Intermediate results are often hundreds of megabytes • 30 sites across North America, Europe, Asia • Community has identified 100 sites to be included

Join Processing in Heterogeneous Networks • Query plans optimized for scalability • Without latency/response time constraints • On global-scale, heterogeneous networks • For applications that transfer hundreds of MBs among continents • Balanced utilization of all network paths • A new query optimization goal (metric) • Exploit excess capacity where available • Avoid narrow, long-haul paths when possible • Join processing techniques and algorithms • Identifying network structure: clusters of sites and path throughput • Optimize for non-uniform and non-metric networks • Balance network usage and computation

Why do we need a new metric? • Minimizing response time • Consumes all available resources to achieve the goal • Minimizing computation costs • Does not address network bound applications • Minimizing the volume of network traffic • Insensitive to network heterogeneity • And we are concerned with • Polynomial-time algorithms for large-scale federations • Avoiding multi-objective optimization

count * • SkyQuery’s computation oriented optimization • Schedule sites in order of increasing cardinality • Minimizes computation costs under several assumptions • Perfect join selectivity (holds in practice) • Computation costs linear in the size of intermediate results (because it’s an index join) • Occasionally transfers data across the Atlantic multiple times

Balanced Network Utilization • Cost of using a path is product of the volume of data transmitted and the inverse TCP throughput • Cost of a schedule is the sum over all paths • Takes advantage of path heterogeneity • By using higher-throughput paths proportionally more • Reduces contention on narrow, long-haul paths • By making them costly • But, its not a direct measure of scalability • Does not load balance paths over multiple queries

Path Throughput • Measure throughput among all federation sites pairwise • Using a nearby PlanetLab proxy site for each SkyQuery site • 3 times a day, bulk TCP transfer • TCP throughput reflects geography • Dominant 1/distance trend correlates well with 1/RTT • But, highly non-metric • Input to scheduling

Throughput Stability • Should we measure throughput more often? • Accurate measurements are intrusive (bulk-transfer) • Short duration measures are error prone (cross-traffic) • The most volatile paths are stable • <30% throughput variation

Join Scheduling • Assumptions • Accurate cardinality estimates • Perfect join selectivity • Ignore the effect of attribute aggregation • Simplify one aspect of optimization (selectivity) in order to consider non-uniform, non-metric networks • cannot use Dynamic Programming in this environment as it lacks sub-problem optimality • Two algorithms based on Minimum Spanning Trees • Two-approximate balanced network utilization • Clustering variant defines computation and utilization trade-offs

Spanning Tree Approximation (STA) • Inputs: pairwise throughputs, site cardinalities, and a node to which we deliver results • Min: node with lowest cardinality

Spanning Tree Approximation (STA) • Construct a minimum spanning tree

Achieving the Bound • From min to sink visiting all sites • Cost(STA)  2*cost (MST)  2*OPT • Same intuition as 2-approximate Euclidean TSP • STA can visit each site more than once • Applies to non-metric networks

Heuristic Improvement • For paths on which the triangle inequality holds • Route directly to next unvisited node • 30% improvement in practice • Identify and use metric regions in the network

Clustered-STA • Well-connected clusters separated by narrow, long-haul paths • Optimize for computation inside clusters (count *) • Optimize balanced network utilization among clusters (STA)

Clustering Sites • Organize sites using Bond-Energy Algorithm • Minimize difference between adjacent elements • Extract clusters with a threshold • 3 Mbps produces 6 clusters for 30 SkyQuery sites • Define computation versus utilization tradeoff • By tuning the extraction threshold

Network Utilization • Results are independent of assumptions • OPT is best serial plan • STA often finds OPT plan • C-STA performs poorly within clusters • Also poor on narrow paths due to attribute aggregation

Computation Time • count * represents a “soft” lower bound • C-STA reduces computation costs

Discussion • Balanced network utilization metric captures path heterogeneity • Avoids narrow, long-haul paths • Scheduling algorithms of low complexity • OPT is a viable alternative for serial plans • Limitations of C-STA • Does not really create meaningful utilization/computation tradeoffs • Threshold can only find natural clusters • Systematically aggregate attributes in each cluster • Semi-joins address these limitations • Extending this work to parallel schedules • Applicability to other workloads? OLAP?

A World-Wide Telescope • Federations of sky surveys make the world’s best telescope • whole sky coverage • multi-spectral (optical, radio, infrared, x-ray) • data are always available (no clouds, no moon, day or night) • Multi-spectral and temporal experiments have already lead to many new discoveries

The Crossmatch Query SELECT O.object_id, O.right_accession, T.object_id FROM SDSS:Photo_Object O, TWOMASS:Photo_Primary T, FIRST:Primary_Object P WHERE AREA (185.0,-0.5,4.5) AND XMATCH (O,T,P) <3.5 AND O.type= GALAXY AND (O.i_flux - T.i_flux)>2}

Scalable Join Processing in Global-Scale Scientific Federations

Scalable Join Processing in Global-Scale Scientific Federations

Presentation Transcript

Processing Data Intensive Queries in Scientific Database Federations

Global Scale Winds

Large-scale Processing with MapReduce

Authenticated Join Processing in Outsourced Databases

High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism

Cloud Federations

Large-scale Data Processing Challenges

The Global Scale

Efficient Temporal Join Processing using Indices

Cache-aware batch scheduling policies for large-scale scientific data processing

Large scale data processing

Governance in Identity Management Federations

Global Scientific Flows

Dynamic Federations

Revisiting Pipelined Parallelism in Multi-Join Query Processing

High Throughput and Large Scale Proteomics Analysis

GP Federations

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

Federations in a WebAuthn World

Network-Aware Join Processing in Global-Scale Database Federations