A Fault-Tolerant Environment for Large-Scale Query Processing

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State University HiPC’12 Pune, India

Motivation HiPC’12 Pune, India • “big data” problem • Walmart handles 1 million customer transaction every hour, estimated data volume is 2.5 Petabytes. • Facebook handles more than 40 billion images • LSST generates 6 petabytes every year • massive parallelism is the key

Motivation * taken from Jeff Dean’s talk in Google IO (http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx) HiPC’12 Pune, India • Mean-Time To Failure (MTTF) decreases • Typical first year for a new cluster* • 1000 individual machine failures • 1 PDU failure (~500-1000 machines suddenly disappear) • 20 rack failures (40-80 machines disappear, 1-6 hours to get back)

Our Work * rack: a number of machines connected to the same hardware (network switch, …) HiPC’12 Pune, India • supporting fault-tolerant query processing and data analysis for a massive scientific dataset • focusing on two specific query types: • Range Queries on Spatial datasets • Aggregation Queries on Point datasets • supported failure types:single-machine failures andrack failures

Our Work HiPC’12 Pune, India Primary Goals high efficiency of execution when there are no failures (indexing if applicable, ensuring load-balance) handling failures efficiently up to a certain number of nodes (low-overhead fault tolerance through data replication) a modest slowdown in processing times when recovered from a failure (preserving load-balance)

Range Queries on Spatial Data Y query X master query query query worker worker worker data data data HiPC’12 Pune, India • nature of the task: • each data object is a rectangle in 2D space • each query is defined with a rectangle • return intersecting data rectangles • computational model: • master/worker model • master serves as coordinator • each worker responsible for a portion of data

Range Queries on Spatial Data Y chunk 1 chunk 2 worker chunk 3 chunk 4 worker X * actual number of chunks depends on chunk size parameter. HiPC’12 Pune, India • data organization: • chunk is the smallest data unit • create chunks by grouping data objects together • assign chunks to workers in round-robin fashion

Range Queries on Spatial Data o4 sorted objects: o1, o3 , o8, o6 , o2 , o7 , o4 , o5 3 2 o2 o7 o5 o1 o8 chunk 1 chunk 2 chunk 3 chunk 4 o3 1 4 o6 HiPC’12 Pune, India • ensuring load-balance: • enumerate & sort data objects according to Hilbert Space-Filling Curve, then pack sorted data objects into chunks • spatial index support: • Hilbert R-Tree deployed on master node • leaf nodes correspond to data chunks • initial filtering at master, tells workers which chunks to look

Range Queries on Spatial Data Worker 1 Worker 2 Worker 3 Worker 4 chunk3 chunk4 chunk1 chunk2 step1 step1 step1 step1 k = 2 chunk2,1 chunk2,2 chunk3,1 chunk3,2 chunk4,1 chunk4,2 chunk1,1 chunk1,2 * rack-failure: same approach, but distribute sub-chunks to nodes in different rack HiPC’12 Pune, India • Fault-Tolerance Support – Sub-chunk Replication: step1:divide data chunks into k sub-chunks step2: distribute sub-chunks in round-robin fashion

Range Queries on Spatial Data HiPC’12 Pune, India • Fault-Tolerance Support - Bookkeeping: • add a sub-leaf level to the bottom of Hilbert R-Tree • Hilbert R-Tree both as a filtering structure and failure management tool

Aggregation Queries on Point Data partial result in worker 2 Y worker 2 worker 1 worker 3 worker 4 X M = 4 HiPC’12 Pune, India • nature of the task: • each data object is a point in 2D space • each query is defined with a dimension (X or Y), and aggregation function (SUM, AVG, …) • computational model: • master/worker model • divide space into M partitions • no indexing support • standard 2-phase algorithm: local and global aggregation

Aggregation Queries on Point Data HiPC’12 Pune, India • reducing communication volume • initial partitioning scheme has a direct impact • have insights about data and query workload: P(X) and P(Y) = probability of aggregation along X and Y-axis |rx| and |ry| = range of X and Y coordinates • expected communication volume Vcomm defined as: • Goal: choose a partitioning scheme (cv and ch) that minimizes Vcomm

Aggregation Queries on Point Data Y M’ = 4 ch’ = 2 cv’ = 2 a better distribution reduces comm. overhead rule-based selection:assign to nodes which share the same coordinate-range X HiPC’12 Pune, India • Fault-Tolerance Support – Sub-partition Replication: step1:divide each partition evenly into M’ sub-partitions step2: send each of M’ sub-partitions to a different worker node • Important questions: • how many sub-partitions (M’)? • how to divide a partition (cv’ and ch’) ? • where to send each sub-partition? (random vs. rule-based)

Experiments HiPC’12 Pune, India • local cluster with nodes • two quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM • entire system implemented in C by using MPI-library • range queries: • comparison with chunk replication scheme • 32 GB spatial data • 1000 queries are run, and aggregate time is reported • aggregation queries: • comparison with partition replication scheme • 24 GB point data • 64 nodes used, unless noted otherwise

Experiments: Range Queries - Execution Times with No Replication and No Failures Optimal Chunk Size Selection Scalability (chunk size = 10000) HiPC’12 Pune, India

Experiments: Range Queries • Execution Times under Failure Scenarios (64 workers in total) • k is the number of sub-chunks for a chunk Single-Machine Failure Rack Failure HiPC’12 Pune, India

Experiments: Aggregation Queries Effect of Partitioning Scheme On Normal Execution Single-Machine Failure P(X) = P(Y) = 0.5, |rx| = |ry| = 10000 P(X) = P(Y) = 0.5, |rx| = |ry| = 100000 HiPC’12 Pune, India

Conclusion HiPC’12 Pune, India • a fault-tolerant environment that can process • range queries on spatial data and aggregation queries on point data • but, proposed approaches can be extended for other type of queries and analysis tasks • high efficiency under normal execution • sub-chunk and sub-partition replications • preserve load-balance in presence of failures, and hence • outperform traditional replication schemes

Thank you for listening … HiPC’12 Pune, India Questions

A Fault-Tolerant Environment for Large-Scale Query Processing

A Fault-Tolerant Environment for Large-Scale Query Processing

Presentation Transcript

Fault Tolerant FPGA Co-processing Toolkit

Pregel : A System for Large-Scale Graph Processing

Fault-Tolerant Broadcast

Sailfish: A Framework For Large Scale Data Processing

Pregel : A System for Large-Scale Graph Processing

Replication-based Fault-tolerance for Large-scale Graph Processing

Fault-Tolerant Broadcast

Pregel : A System for Large-Scale Graph Processing

Integrated Fault Tolerant Techniques Using Parallel Processing

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Large scale data processing

Fault Tolerant Configuration

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Managing a Large Scale Student Environment:

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus