1 / 29

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster. Vignesh Ravi and Gagan Agrawal. OUTLINE. Motivation FREERIDE Middleware Generalized Reduction structure Shared Memory Parallelization techniques Scalability results - Kmeans , Apriori & EM

tirza
Download Presentation

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and GaganAgrawal {raviv,agrawal}@cse.ohio-state.edu

  2. OUTLINE • Motivation • FREERIDE Middleware • Generalized Reduction structure • Shared Memory Parallelization techniques • Scalability results - Kmeans, Apriori & EM • Performance Analysis results • Related work & Conclusion

  3. Motivation • Availability of huge amount of data • Data-intensive applications • Advent of multi-core • Need for abstractions and parallel programming systems • Best Shared Memory Parallelization (SMP) technique is still not clear.

  4. Context: FREERIDE • A middle-ware for parallelizing Data-intensive applications • Motivated by difficulties in implementing parallel datamining applications • Provides high-level APIs for easier parallel programming • Based on an observation of similar generalized reduction among many datamining and scientific applications

  5. FREERIDE – Core • Reduction Object – A shared data structure where results from processed data instances are stored Types of Reduction • Local Reduction – Reduction within a single node • Global Reduction – Reduction among a cluster of nodes

  6. Generalized Reduction structure

  7. Parallelization Challenges • Reduction object cannot be statically partitioned between threads/nodes • Data races should be handled at runtime • Size of reduction object could be large • Replication can cause memory overhead • Updates to reduction object is fine-grained • Locking schemes can cause significant overhead

  8. Techniques in FREERIDE • Full-replication(f-r) • Locking based techniques • Full-locking (f-l) • Optimized Full-locking(o-f-l) • Cache-sensitive locking( cs-l)

  9. Memory Layout of locking schemes

  10. Applications Implemented on FREERIDE • Apriori (Association mining) • Kmeans (Clustering based) • Expectation Maximization (E-M) (clustering based)

  11. Goals in Experimental Study • Scalability of data-intensive applications on multi-core • Comparison of different shared memory parallelization (SMP) techniques and mpi • Performance analysis of SMP techniques

  12. Experimental setup Each node in the cluster has: • Intel Xeon E5345 CPU • 2 Quad-core machine • Each core 2.33GHz • 6GB Main memory Nodes in cluster are connected by Infiniband

  13. Experiments Two sets of experiments: • Comparison of scalability results for f-r, cs-l, o-f-l and mpi with k-means, Apriori and E-M • Single node • Cluster of nodes • Performance analysis results with k-means, Apriori and E-M

  14. Applications data setup • Apriori • Dataset size 900MB • Support = 3%, Confidence = 9% • K-means • Dataset size 6.4 GB • 3-Dimensional points • No. of clusters, 250 • E-M • Dataset size 6.4 GB • 3-Dimensional points • No. of clusters, 60

  15. Apriori (Single node)

  16. Apriori (cluster)

  17. k-means (single node)

  18. K-means (cluster)

  19. E-M (Single node)

  20. E-M (cluster)

  21. Performance Analysis of SMP techniques • Given an application can we predict the factors that determines the best SMP technique? • Why locking techniques suffer with Apriori, but competes well with other applications? • What factors limit the overall scalability of data-intensive applications?

  22. Performance Analysis setup • Valgrind used for the Dynamic Binary Analysis • Cachegrind used for the analysis of cache utilization

  23. Performance Analysis Locking vs Merge Overhead

  24. Performance Analysis (contd…) Relative L2 misses for reduction object

  25. Performance Analysis (contd …) Total program read/write misses

  26. Analysis • Important Trade-off • Memory needs of application • Frequency of updating reduction object • E-M is compute and memory intensive • Locking overhead is very low • Replication overhead is high • Apriori has high update fraction and very less computation • Locking overhead is extremely high • Replication performs the best

  27. Related Work • Google Mapreduce • Yahoo Hadoop • Phoenix – Stanford university • SALSA – Indiana university

  28. Conclusion • Replication and locking schemes can outperform each other • Locking schemes have huge overhead when there is little computation between updates in ReductionObject • MPI processes competes well upto 4 threads, but experiences communication overheads with 8 threads • Performance analysis shows memory needs of an application and update fraction are significant factors for scalability

  29. Thank you!!!! Questions???

More Related