500 likes | 623 Views
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures. Wei Jiang Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal.
E N D
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal
The Era of “Big Data” Old days • When data size becomes a problem: • Need easy-to-use tools! What other aspects? • Performance? Analysis and Management? Security and Privacy? Data-Intensive and High Performance Computing Research Group
Motivation • Growing need of Data-Intensive SuperComputing • Performance is highest priority in HPC! • Efficient data processing & High programming productivity • Emergence of various parallel architectures • Traditional CPU clusters (multi-cores) • GPU clusters (many-cores) • CPU-GPU clusters (heterogeneous systems) • Given big data, high-end apps, and parallel architectures… • We need Programming Models and Middleware Support! Data-Intensive and High Performance Computing Research Group
Map-Reduce is good, but… • Map-Reduce and its variants • Simple API : map and reduce • Easy to write parallel programs • Fault-tolerant for large-scale data centers with commodity nodes • High programming productivity • Performance? • Always a concern for HPC community and also Database • Data-intensive applications • Various Subclasses: • Data center-oriented : search technologies • Data Mining, graph mining, and scientific computing • Large intermediate structures: pre-processing/post-processing Data-Intensive and High Performance Computing Research Group
Parallel Computing Environments • Parallel Architectures • CPU clusters (multi-cores) • Most widely used as traditional HPC platforms • Motivated MapReduce and many of its variants • GPU clusters (many-cores) • Higher performance with better cost & energy efficiency • Low programming productivity • Limited MapReduce-like support • CPU-GPU clusters • Emerging heterogeneous systems • No general MapReduce-like support up to date • New Hybrid Architectures • CPU+GPU on the same chip: Sandy Bridge, Fusion, etc. Data-Intensive and High Performance Computing Research Group
Our Middleware Series MATE GPU GPU Ex-MATE Tall oaks grow from little acorns! MATE-CG GPU GPU FT-MATE • Bridge the gap between the parallel architectures and the applications • Higher programming productivity than MPI • Better performance efficiency than MapReduce Data-Intensive and High Performance Computing Research Group
Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group
The Programming Model The generalized reduction model Based on user-declared reduction objects Motivated by a set of data mining applications For example, K-Means could have a very large set of data points to process but only need to update a small set of centroids (the reduction object!) Forms a compact summary of computational states Helps achieve more efficient fault tolerance and recovery than replication/job re-execution in Map-Reduce Avoids large-sized intermediate data Applies updates directly on the reduction object instead of going through Map---Intermediate Processing---Reduce Data-Intensive and High Performance Computing Research Group
Comparing Processing Structures • Reduction Objectrepresents the intermediate state of the execution • Reduce func. is commutative and associative • Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj. • But we need global combination. • Insight: we could even provide a better implementation of the same • map-reduce API! --- e.g., Turbo MapReducefrom Quantcast! Data-Intensive and High Performance Computing Research Group
Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group
Shared-Memory Parallelization in MATE Basic one-stage dataflow in Full Replication scheme Locking-free: each CPU core has a private copy of the reduction object Parallel merge is performed in combination phase Split 0 Reduction Reduction Object Split 1 Split 2 Reduction Combination Reduction Object Output Split 3 Split 4 Reduction Reduction Object Split 5 Data-Intensive and High Performance Computing Research Group
Functions APIs defined/customized by the user Data-Intensive and High Performance Computing Research Group
Experiments Design • For comparison against Phoenix, we used three data mining applications • K-Means Clustering, Princinple Component Analysis (PCA), Apriori Associative Mining. • Also evaluated the single-node performance of Hadoop on KMeans and Apriori • Combine function is used in Hadoop with careful tuning • Experiments on two multi-core platforms • 8 cores on one 8-core node (Intel cpu) • 16 cores on one 16-core node (AMD cpu) Data-Intensive and High Performance Computing Research Group 13
Results: Data Mining (I) K-Means on 8-core and 16-core machines: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 3.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group
Results: Data Mining (II) PCA on 8-core and 16-core machines : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group
Extending MATE • Main issue of the original MATE: • Assumes that the reduction object MUST fit in memory • We extended MATE to address this limitation • Focus on graph mining: an emerging class of apps • Require large-sized reduction objects as well as large-scale datasets • E.g., PageRank could have a 16GB reduction object! • Support of managing arbitrary-sized reduction objects • Large-sized reduction objects are disk-resident • Evaluated Ex-MATE using PEGASUS • PEGASUS: A Hadoop-based graph mining system Data-Intensive and High Performance Computing Research Group
Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group
Ex-MATE Runtime Overview Basic one-stage execution Execution Overview of the Extended MATE Data-Intensive and High Performance Computing Research Group
Implementation Considerations Support for processing very large datasets Partitioning function: Partition and distribute to a number of computing nodes Splitting function: Use the multi-core CPU on each node Management of a large reduction-object on disk: How to reduce disk I/O? Outputs (R.O.) are updated in a demand-driven way Partition the reduction object into splits Inputs are re-organized based on data access patterns Reuse a R.O. split as much as possible in memory Example: Matrix-Vector Multiplication Data-Intensive and High Performance Computing Research Group
A MV-Multiplication Example Input Vector (1, 1) (1, 2) Output Vector Input Matrix (2, 1) Matrix-Vector Multiplication using checkerboard partitioning. B(i,j) represents a matrix block, I_V(j) represents an input vector split, and O_V(i) represents an output vector split. The matrix/vector multiplies are done block-wise, not element-wise. Data-Intensive and High Performance Computing Research Group
Experiments Design • Applications: • Three graph mining algorithms: • PageRank, Diameter Estimation (HADI), and Finding Connected Components (HCC) – Parallelized using GIM-V method • Evaluation: • Performance comparison with PEGASUS • PEGASUS provides a naïve version and an optimized version • Speedups with an increasing number of nodes • Scalability speedups with an increasing size of datasets • Experimental platform: • A cluster of multi-core CPU machines • Used up to 128 cores (16 nodes) Data-Intensive and High Performance Computing Research Group 21
Results: Graph Mining (I) 16GB datasets: Ex-MATE:~10times speedup HADI PageRank Avg. Time Per Iteration (min) HCC # of nodes Data-Intensive and High Performance Computing Research Group
Scalability: Graph Mining (II) HCC: better scalability with larger datasets 32GB 8GB Avg. Time Per Iteration (min) 64GB # of nodes 1.5 3.0 Data-Intensive and High Performance Computing Research Group
Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group
MATE for CPU-GPU Clusters • Still adopts Generalized Reduction • Built on top of MATE and Ex-MATE • Accelerates data-intensive computations on heterogeneous systems • Focus on CPU-GPU clusters • A multi-level data partitioning • Proposed a novel auto-tuning framework • Exploits iterative nature of many data-intensive apps • Automatically decides the workload distribution between CPUs and GPUs Data-Intensive and High Performance Computing Research Group
MATE-CG Overview Execution work-flow Data-Intensive and High Performance Computing Research Group
Auto-Tuning Framework • Auto-tuning problem: • Given an application, find the optimal data distribution between the CPU and the GPU to minimize the overall running time on each node • For example: which is best, 20/80, 50/50, or 70/30? • Our approach: • Exploits the iterative nature of many data-intensive applications with similar computations over a number of iterations • Construct an analytical model to predict performance • The optimal value is computed and learnt over the first few iterations • No compile-time search or tuning is needed • Low runtime overheads with a large number of iterations Data-Intensive and High Performance Computing Research Group
The Analytical Model T_cg varies with p T_c varies with p T_g varies with p T_c: processing times on CPU with p; T_o: fixed overheads on CPU T_p: processing times on GPU with (1-p): T_g_o: fixed overheads on GPU T_cg: overall processing times on CPU+GPU with p Illustration of the relationship between Tcg and p: Data-Intensive and High Performance Computing Research Group
Experiments Design • Experiments Platform • A heterogeneous CPU-GPU cluster • Each node has one multi-core CPU and one GPU • Intel 8-core CPU • NVIDA Tesla (Fermi) GPU (14*32 (448) cores) • Used up to 128 CPU cores and 7168 GPU cores on 16 nodes • Three representative applications • Gridding kernel, EM, and PageRank • For each application, we run it in four modes in the cluster • CPU-1: CPU-8: GPU-only: CPU-8-n-GPU Data-Intensive and High Performance Computing Research Group 29
GPU-only is better than CPU-8 Results: Scalability with increasing # of GPUs PageRank EM Avg. Time Per Iteration (sec) 16% 3.0 Gridding Kernel 25% # of nodes Data-Intensive and High Performance Computing Research Group
On 16 nodes Results: Auto-tuning EM PageRank Execution Time In One Iteration (sec) Iteration Number Gridding Kernel Data-Intensive and High Performance Computing Research Group
Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group
Reduction Object-based Fault Tolerance • K-Means Clustering: • One node fails at 50% of data processing • Overheads • Hadoop • 23.06 | 71.78 | 78.11 • FREERIDE-G • 20.37 | 8.18 | 9.18 Taken from Tekin Bicer et.al. (IPDPS’2010) Data-Intensive and High Performance Computing Research Group • Fault tolerance is important for data-intensive computing • Reduction object helps to provide fault-tolerance at lowercosts than Hadoop • Example: FREERIDE-G V.S. Hadoop
Applying Reduction Object-based Approach to MPI Programs Data-Intensive and High Performance Computing Research Group • The reduction object model achieves better fault tolerance for data mining and graph mining as in FREERIDE-G and MATE systems • Fault-tolerance support for MPI applications remains an ongoing challenge and existing solutions will not work in the future • Checkpoint time will exceed the Mean-Time To Failure in the exascale era (exascale systems expected in 2018)! • So, can our ideas from the reduction object work help for other types of application like MPI?
Our Fault Tolerance Approach (I) • Based on the Extended Generalized Reduction Model • Aims to improve the expensive C/R for MPI applications • Divides the reduction object into two parts • One inter-node global reduction object • One set of intra-node local reduction objects • Only the global reduction object participates in the global combination phase • Target applications that can be written by the extended generalized reduction model • No redundant/backup nodes are used/needed • Deal with fail-stop type of failures (not affecting others) • Assume failure detection is accurate and instant Data-Intensive and High Performance Computing Research Group
Our Fault Tolerance Approach (II) • Using distributed memory and reliable storage • Cache the global reduction object into the memory of other nodes • Save the local reduction objects onto a persistent storage • Can be viewed as an adaption of the application-level checkpointing approach • The key difference is we exploit the reduction object structures and re-distribute remaining data upon a failure • There is no need to restart a failed process • Suitable for a diverse set of applications • Data mining: only the global reduction object is needed • Stencil computations: both global and local ones are needed • Irregular reductions: only the local reduction objects are needed Data-Intensive and High Performance Computing Research Group
MPI Application Examples Data-Intensive and High Performance Computing Research Group • Dense Grid Computations • Stencil computations like Jacobi and Sobel Filter • Sparse Grid Computations • Irregular Reductions like Euler Solver and Molecular Dynamics
Implementing Dense Grid Computations using EGR (I) Corresponding input matrix Output matrix partitioning Data-Intensive and High Performance Computing Research Group • A simple way is based on output matrix partitioning • The input data needed for computing an output partition consists of the corresponding input partition and the elements on the border of its neighboring input partitions • Each output partition is a local reduction object and there is no use of global reduction object
Implementing Dense Grid Computations using EGR (II) Input matrix partitioning Corresponding output matrix Data-Intensive and High Performance Computing Research Group • An alternative is based on input matrix partitioning • A data-driven approach: determine the corresponding points to be updated in the output matrix for each point in the input matrix • The ghost output rows of two neighboring output partitions are global reduction object and the other output rows are local reduction objects
Implementing Sparse Grid Computations using EGR (I) Data-Intensive and High Performance Computing Research Group • For irregular reductions, the corresponding points in the output space are not known at the compile time • The input space partitioning will have to treat the entire output space as the global reduction object and results in poor scalability in our preliminary experiments • We choose the output space partitioning in our implementation and re-organize the corresponding input in the pre-processing stage
Implementing Sparse Grid Computations using EGR (II) Data-Intensive and High Performance Computing Research Group Partitioning on Reduction Space for Irregular Reductions
The FT-MATE System (I) Checkpointing Recovery The processing structure with fault tolerance support Data-Intensive and High Performance Computing Research Group
The FT-MATE System (II) • Fault tolerance runtime components: • Configuration --- MATE_FTSetup() • Setup checkpoint interval, directory for saving checkpoints, etc. • Check-pointing • MATE_MemCheckpoint() --- synchronous/asynchronous data exchange • MATE_DiskCheckpoint() --- single-thread/multi-thread data output • Detecting Failures --- MATE_DetectFailure() • Peer-to-peer communication among the nodes are kept alive and timeouts are used to detect node failures with the aid of MPI stack • Recovering Failures --- MATE_RecoverFailure() • Data re-distribution and processing of unfinished data • Output space recovery if needed Data-Intensive and High Performance Computing Research Group
The FT-MATE System (III) Data-Intensive and High Performance Computing Research Group Example: the fault recovery process for an irregular application
Experiments Design • Experiments Platform • A cluster of nodes with multi-cores and each node has one Intel 8-core CPU • Four representative apps in scientific computing • Stencil Computations: Jacobi and Sobel Filter • Irregular Reductions: Euler Solver and Molecular Dynamics • For each application, we could run it in two modes: • CPU-1: use 1 CPU core per node • CPU-8: use 8 CPU cores per node • Evaluated against the fault tolerant MPICH2 library • FT-MATE V.S. MPICH2-BLCR Data-Intensive and High Performance Computing Research Group 45
Results: Scalabilities Study Scalabilities without a failure in FT-MATE MPI’s absolute performance is similar to that of FT-MATE CPU-1 Versions CPU-8 Versions 7.8 6.8 Avg. Time Per Iteration (secs) # of Nodes Scalability with CPU-1/CPU-8 versions on 2, 4, 8, and 16 nodes for each of the four applications Data-Intensive and High Performance Computing Research Group
Results: Checkpointing Overheads Low checkpointing overheads Sobel Filter Molecular Dynamics 765 1124 115 Normalized Checkpointing Costs (%) 5.54% 2.99% # of Iterations per Checkpoint Interval Checkpoint Size for 7.4GB Moldyn: 9.3GB V.S. 48MB On 8 nodes and running for 1000 iterations in CPU-8 mode Data-Intensive and High Performance Computing Research Group
Results: Fault Recovery W/REBIRTH in FT-MATE: the failed node recovers one iteration after the fault recovery starts Low Absolute Recovery Costs Jacobi Euler Solver Normalized Recovery Costs (%) Failure Point (%) 0.02% 0.19% On 32 nodes and checkpoint interval is 100/1000 Data-Intensive and High Performance Computing Research Group
Summary • MATE, Ex-MATE, MATE-CG, and FT-MATE for multi-cores, homogeneous clusters, and heterogeneous clusters • A diverse set of applications: data mining, graph mining, scientific computing, stencil computations, irregular reductions, etc. • FT-MATE supports more efficient fault tolerance for MPI programs • Also, MATE series have been used internally in our group • MATE-EC2: allows to start MATE instances in cloud providers like Amazon Web Services • Sci-MATE: supports scientific data formats like NetCDF and HDF5 • As a backend to run generated C code from python/R code • Some ongoing follow-up projects: • Using an implicit reduction object with the same map-reduce API? • What are the opportunities in Sandy Bridge and Fusion APU? • Dealing with GPU failures in MATE-CG? Data-Intensive and High Performance Computing Research Group
Thank You! Data-Intensive and High Performance Computing Research Group • Questions, comments, and suggestions?