1 / 50

Wei Jiang Data-Intensive and High Performance Computing Research Group

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures. Wei Jiang Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal.

woody
Download Presentation

Wei Jiang Data-Intensive and High Performance Computing Research Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal

  2. The Era of “Big Data” Old days • When data size becomes a problem: • Need easy-to-use tools! What other aspects? • Performance? Analysis and Management? Security and Privacy? Data-Intensive and High Performance Computing Research Group

  3. Motivation • Growing need of Data-Intensive SuperComputing • Performance is highest priority in HPC! • Efficient data processing & High programming productivity • Emergence of various parallel architectures • Traditional CPU clusters (multi-cores) • GPU clusters (many-cores) • CPU-GPU clusters (heterogeneous systems) • Given big data, high-end apps, and parallel architectures… • We need Programming Models and Middleware Support! Data-Intensive and High Performance Computing Research Group

  4. Map-Reduce is good, but… • Map-Reduce and its variants • Simple API : map and reduce • Easy to write parallel programs • Fault-tolerant for large-scale data centers with commodity nodes • High programming productivity • Performance? • Always a concern for HPC community and also Database • Data-intensive applications • Various Subclasses: • Data center-oriented : search technologies • Data Mining, graph mining, and scientific computing • Large intermediate structures: pre-processing/post-processing Data-Intensive and High Performance Computing Research Group

  5. Parallel Computing Environments • Parallel Architectures • CPU clusters (multi-cores) • Most widely used as traditional HPC platforms • Motivated MapReduce and many of its variants • GPU clusters (many-cores) • Higher performance with better cost & energy efficiency • Low programming productivity • Limited MapReduce-like support • CPU-GPU clusters • Emerging heterogeneous systems • No general MapReduce-like support up to date • New Hybrid Architectures • CPU+GPU on the same chip: Sandy Bridge, Fusion, etc. Data-Intensive and High Performance Computing Research Group

  6. Our Middleware Series MATE GPU GPU Ex-MATE Tall oaks grow from little acorns! MATE-CG GPU GPU FT-MATE • Bridge the gap between the parallel architectures and the applications • Higher programming productivity than MPI • Better performance efficiency than MapReduce Data-Intensive and High Performance Computing Research Group

  7. Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group

  8. The Programming Model The generalized reduction model Based on user-declared reduction objects Motivated by a set of data mining applications For example, K-Means could have a very large set of data points to process but only need to update a small set of centroids (the reduction object!) Forms a compact summary of computational states Helps achieve more efficient fault tolerance and recovery than replication/job re-execution in Map-Reduce Avoids large-sized intermediate data Applies updates directly on the reduction object instead of going through Map---Intermediate Processing---Reduce Data-Intensive and High Performance Computing Research Group

  9. Comparing Processing Structures • Reduction Objectrepresents the intermediate state of the execution • Reduce func. is commutative and associative • Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj. • But we need global combination. • Insight: we could even provide a better implementation of the same • map-reduce API! --- e.g., Turbo MapReducefrom Quantcast! Data-Intensive and High Performance Computing Research Group

  10. Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group

  11. Shared-Memory Parallelization in MATE Basic one-stage dataflow in Full Replication scheme Locking-free: each CPU core has a private copy of the reduction object Parallel merge is performed in combination phase Split 0 Reduction Reduction Object Split 1 Split 2 Reduction Combination Reduction Object Output Split 3 Split 4 Reduction Reduction Object Split 5 Data-Intensive and High Performance Computing Research Group

  12. Functions APIs defined/customized by the user Data-Intensive and High Performance Computing Research Group

  13. Experiments Design • For comparison against Phoenix, we used three data mining applications • K-Means Clustering, Princinple Component Analysis (PCA), Apriori Associative Mining. • Also evaluated the single-node performance of Hadoop on KMeans and Apriori • Combine function is used in Hadoop with careful tuning • Experiments on two multi-core platforms • 8 cores on one 8-core node (Intel cpu) • 16 cores on one 16-core node (AMD cpu) Data-Intensive and High Performance Computing Research Group 13

  14. Results: Data Mining (I) K-Means on 8-core and 16-core machines: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 3.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group

  15. Results: Data Mining (II) PCA on 8-core and 16-core machines : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads Data-Intensive and High Performance Computing Research Group

  16. Extending MATE • Main issue of the original MATE: • Assumes that the reduction object MUST fit in memory • We extended MATE to address this limitation • Focus on graph mining: an emerging class of apps • Require large-sized reduction objects as well as large-scale datasets • E.g., PageRank could have a 16GB reduction object! • Support of managing arbitrary-sized reduction objects • Large-sized reduction objects are disk-resident • Evaluated Ex-MATE using PEGASUS • PEGASUS: A Hadoop-based graph mining system Data-Intensive and High Performance Computing Research Group

  17. Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group

  18. Ex-MATE Runtime Overview Basic one-stage execution Execution Overview of the Extended MATE Data-Intensive and High Performance Computing Research Group

  19. Implementation Considerations Support for processing very large datasets Partitioning function: Partition and distribute to a number of computing nodes Splitting function: Use the multi-core CPU on each node Management of a large reduction-object on disk: How to reduce disk I/O? Outputs (R.O.) are updated in a demand-driven way Partition the reduction object into splits Inputs are re-organized based on data access patterns Reuse a R.O. split as much as possible in memory Example: Matrix-Vector Multiplication Data-Intensive and High Performance Computing Research Group

  20. A MV-Multiplication Example Input Vector (1, 1) (1, 2) Output Vector Input Matrix (2, 1) Matrix-Vector Multiplication using checkerboard partitioning. B(i,j) represents a matrix block, I_V(j) represents an input vector split, and O_V(i) represents an output vector split. The matrix/vector multiplies are done block-wise, not element-wise. Data-Intensive and High Performance Computing Research Group

  21. Experiments Design • Applications: • Three graph mining algorithms: • PageRank, Diameter Estimation (HADI), and Finding Connected Components (HCC) – Parallelized using GIM-V method • Evaluation: • Performance comparison with PEGASUS • PEGASUS provides a naïve version and an optimized version • Speedups with an increasing number of nodes • Scalability speedups with an increasing size of datasets • Experimental platform: • A cluster of multi-core CPU machines • Used up to 128 cores (16 nodes) Data-Intensive and High Performance Computing Research Group 21

  22. Results: Graph Mining (I) 16GB datasets: Ex-MATE:~10times speedup HADI PageRank Avg. Time Per Iteration (min) HCC # of nodes Data-Intensive and High Performance Computing Research Group

  23. Scalability: Graph Mining (II) HCC: better scalability with larger datasets 32GB 8GB Avg. Time Per Iteration (min) 64GB # of nodes 1.5 3.0 Data-Intensive and High Performance Computing Research Group

  24. Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group

  25. MATE for CPU-GPU Clusters • Still adopts Generalized Reduction • Built on top of MATE and Ex-MATE • Accelerates data-intensive computations on heterogeneous systems • Focus on CPU-GPU clusters • A multi-level data partitioning • Proposed a novel auto-tuning framework • Exploits iterative nature of many data-intensive apps • Automatically decides the workload distribution between CPUs and GPUs Data-Intensive and High Performance Computing Research Group

  26. MATE-CG Overview Execution work-flow Data-Intensive and High Performance Computing Research Group

  27. Auto-Tuning Framework • Auto-tuning problem: • Given an application, find the optimal data distribution between the CPU and the GPU to minimize the overall running time on each node • For example: which is best, 20/80, 50/50, or 70/30? • Our approach: • Exploits the iterative nature of many data-intensive applications with similar computations over a number of iterations • Construct an analytical model to predict performance • The optimal value is computed and learnt over the first few iterations • No compile-time search or tuning is needed • Low runtime overheads with a large number of iterations Data-Intensive and High Performance Computing Research Group

  28. The Analytical Model T_cg varies with p T_c varies with p T_g varies with p T_c: processing times on CPU with p; T_o: fixed overheads on CPU T_p: processing times on GPU with (1-p): T_g_o: fixed overheads on GPU T_cg: overall processing times on CPU+GPU with p Illustration of the relationship between Tcg and p: Data-Intensive and High Performance Computing Research Group

  29. Experiments Design • Experiments Platform • A heterogeneous CPU-GPU cluster • Each node has one multi-core CPU and one GPU • Intel 8-core CPU • NVIDA Tesla (Fermi) GPU (14*32 (448) cores) • Used up to 128 CPU cores and 7168 GPU cores on 16 nodes • Three representative applications • Gridding kernel, EM, and PageRank • For each application, we run it in four modes in the cluster • CPU-1: CPU-8: GPU-only: CPU-8-n-GPU Data-Intensive and High Performance Computing Research Group 29

  30. GPU-only is better than CPU-8 Results: Scalability with increasing # of GPUs PageRank EM Avg. Time Per Iteration (sec) 16% 3.0 Gridding Kernel 25% # of nodes Data-Intensive and High Performance Computing Research Group

  31. On 16 nodes Results: Auto-tuning EM PageRank Execution Time In One Iteration (sec) Iteration Number Gridding Kernel Data-Intensive and High Performance Computing Research Group

  32. Our Current Work • Four systems on different parallel architectures: • MATE (Map-reduce with an AlternaTE API) • For multi-core environments and data mining • Ex-MATE (Extended MATE) • For clusters of multi-cores • Provided large-sized reduction object support • MATE-CG (MATE for Cpu-Gpu) • For heterogeneous CPU-GPU clusters • Provided an auto-tuning framework for data distribution • FT-MATE (Fault Tolerant MATE) • Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage Data-Intensive and High Performance Computing Research Group

  33. Reduction Object-based Fault Tolerance • K-Means Clustering: • One node fails at 50% of data processing • Overheads • Hadoop • 23.06 | 71.78 | 78.11 • FREERIDE-G • 20.37 | 8.18 | 9.18 Taken from Tekin Bicer et.al. (IPDPS’2010) Data-Intensive and High Performance Computing Research Group • Fault tolerance is important for data-intensive computing • Reduction object helps to provide fault-tolerance at lowercosts than Hadoop • Example: FREERIDE-G V.S. Hadoop

  34. Applying Reduction Object-based Approach to MPI Programs Data-Intensive and High Performance Computing Research Group • The reduction object model achieves better fault tolerance for data mining and graph mining as in FREERIDE-G and MATE systems • Fault-tolerance support for MPI applications remains an ongoing challenge and existing solutions will not work in the future • Checkpoint time will exceed the Mean-Time To Failure in the exascale era (exascale systems expected in 2018)! • So, can our ideas from the reduction object work help for other types of application like MPI?

  35. Our Fault Tolerance Approach (I) • Based on the Extended Generalized Reduction Model • Aims to improve the expensive C/R for MPI applications • Divides the reduction object into two parts • One inter-node global reduction object • One set of intra-node local reduction objects • Only the global reduction object participates in the global combination phase • Target applications that can be written by the extended generalized reduction model • No redundant/backup nodes are used/needed • Deal with fail-stop type of failures (not affecting others) • Assume failure detection is accurate and instant Data-Intensive and High Performance Computing Research Group

  36. Our Fault Tolerance Approach (II) • Using distributed memory and reliable storage • Cache the global reduction object into the memory of other nodes • Save the local reduction objects onto a persistent storage • Can be viewed as an adaption of the application-level checkpointing approach • The key difference is we exploit the reduction object structures and re-distribute remaining data upon a failure • There is no need to restart a failed process • Suitable for a diverse set of applications • Data mining: only the global reduction object is needed • Stencil computations: both global and local ones are needed • Irregular reductions: only the local reduction objects are needed Data-Intensive and High Performance Computing Research Group

  37. MPI Application Examples Data-Intensive and High Performance Computing Research Group • Dense Grid Computations • Stencil computations like Jacobi and Sobel Filter • Sparse Grid Computations • Irregular Reductions like Euler Solver and Molecular Dynamics

  38. Implementing Dense Grid Computations using EGR (I) Corresponding input matrix Output matrix partitioning Data-Intensive and High Performance Computing Research Group • A simple way is based on output matrix partitioning • The input data needed for computing an output partition consists of the corresponding input partition and the elements on the border of its neighboring input partitions • Each output partition is a local reduction object and there is no use of global reduction object

  39. Implementing Dense Grid Computations using EGR (II) Input matrix partitioning Corresponding output matrix Data-Intensive and High Performance Computing Research Group • An alternative is based on input matrix partitioning • A data-driven approach: determine the corresponding points to be updated in the output matrix for each point in the input matrix • The ghost output rows of two neighboring output partitions are global reduction object and the other output rows are local reduction objects

  40. Implementing Sparse Grid Computations using EGR (I) Data-Intensive and High Performance Computing Research Group • For irregular reductions, the corresponding points in the output space are not known at the compile time • The input space partitioning will have to treat the entire output space as the global reduction object and results in poor scalability in our preliminary experiments • We choose the output space partitioning in our implementation and re-organize the corresponding input in the pre-processing stage

  41. Implementing Sparse Grid Computations using EGR (II) Data-Intensive and High Performance Computing Research Group Partitioning on Reduction Space for Irregular Reductions

  42. The FT-MATE System (I) Checkpointing Recovery The processing structure with fault tolerance support Data-Intensive and High Performance Computing Research Group

  43. The FT-MATE System (II) • Fault tolerance runtime components: • Configuration --- MATE_FTSetup() • Setup checkpoint interval, directory for saving checkpoints, etc. • Check-pointing • MATE_MemCheckpoint() --- synchronous/asynchronous data exchange • MATE_DiskCheckpoint() --- single-thread/multi-thread data output • Detecting Failures --- MATE_DetectFailure() • Peer-to-peer communication among the nodes are kept alive and timeouts are used to detect node failures with the aid of MPI stack • Recovering Failures --- MATE_RecoverFailure() • Data re-distribution and processing of unfinished data • Output space recovery if needed Data-Intensive and High Performance Computing Research Group

  44. The FT-MATE System (III) Data-Intensive and High Performance Computing Research Group Example: the fault recovery process for an irregular application

  45. Experiments Design • Experiments Platform • A cluster of nodes with multi-cores and each node has one Intel 8-core CPU • Four representative apps in scientific computing • Stencil Computations: Jacobi and Sobel Filter • Irregular Reductions: Euler Solver and Molecular Dynamics • For each application, we could run it in two modes: • CPU-1: use 1 CPU core per node • CPU-8: use 8 CPU cores per node • Evaluated against the fault tolerant MPICH2 library • FT-MATE V.S. MPICH2-BLCR Data-Intensive and High Performance Computing Research Group 45

  46. Results: Scalabilities Study Scalabilities without a failure in FT-MATE MPI’s absolute performance is similar to that of FT-MATE CPU-1 Versions CPU-8 Versions 7.8 6.8 Avg. Time Per Iteration (secs) # of Nodes Scalability with CPU-1/CPU-8 versions on 2, 4, 8, and 16 nodes for each of the four applications Data-Intensive and High Performance Computing Research Group

  47. Results: Checkpointing Overheads Low checkpointing overheads Sobel Filter Molecular Dynamics 765 1124 115 Normalized Checkpointing Costs (%) 5.54% 2.99% # of Iterations per Checkpoint Interval Checkpoint Size for 7.4GB Moldyn: 9.3GB V.S. 48MB On 8 nodes and running for 1000 iterations in CPU-8 mode Data-Intensive and High Performance Computing Research Group

  48. Results: Fault Recovery W/REBIRTH in FT-MATE: the failed node recovers one iteration after the fault recovery starts Low Absolute Recovery Costs Jacobi Euler Solver Normalized Recovery Costs (%) Failure Point (%) 0.02% 0.19% On 32 nodes and checkpoint interval is 100/1000 Data-Intensive and High Performance Computing Research Group

  49. Summary • MATE, Ex-MATE, MATE-CG, and FT-MATE for multi-cores, homogeneous clusters, and heterogeneous clusters • A diverse set of applications: data mining, graph mining, scientific computing, stencil computations, irregular reductions, etc. • FT-MATE supports more efficient fault tolerance for MPI programs • Also, MATE series have been used internally in our group • MATE-EC2: allows to start MATE instances in cloud providers like Amazon Web Services • Sci-MATE: supports scientific data formats like NetCDF and HDF5 • As a backend to run generated C code from python/R code • Some ongoing follow-up projects: • Using an implicit reduction object with the same map-reduce API? • What are the opportunities in Sandy Bridge and Fusion APU? • Dealing with GPU failures in MATE-CG? Data-Intensive and High Performance Computing Research Group

  50. Thank You! Data-Intensive and High Performance Computing Research Group • Questions, comments, and suggestions?

More Related