290 likes | 377 Views
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster. Vignesh Ravi and Gagan Agrawal. OUTLINE. Motivation FREERIDE Middleware Generalized Reduction structure Shared Memory Parallelization techniques Scalability results - Kmeans , Apriori & EM
E N D
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and GaganAgrawal {raviv,agrawal}@cse.ohio-state.edu
OUTLINE • Motivation • FREERIDE Middleware • Generalized Reduction structure • Shared Memory Parallelization techniques • Scalability results - Kmeans, Apriori & EM • Performance Analysis results • Related work & Conclusion
Motivation • Availability of huge amount of data • Data-intensive applications • Advent of multi-core • Need for abstractions and parallel programming systems • Best Shared Memory Parallelization (SMP) technique is still not clear.
Context: FREERIDE • A middle-ware for parallelizing Data-intensive applications • Motivated by difficulties in implementing parallel datamining applications • Provides high-level APIs for easier parallel programming • Based on an observation of similar generalized reduction among many datamining and scientific applications
FREERIDE – Core • Reduction Object – A shared data structure where results from processed data instances are stored Types of Reduction • Local Reduction – Reduction within a single node • Global Reduction – Reduction among a cluster of nodes
Parallelization Challenges • Reduction object cannot be statically partitioned between threads/nodes • Data races should be handled at runtime • Size of reduction object could be large • Replication can cause memory overhead • Updates to reduction object is fine-grained • Locking schemes can cause significant overhead
Techniques in FREERIDE • Full-replication(f-r) • Locking based techniques • Full-locking (f-l) • Optimized Full-locking(o-f-l) • Cache-sensitive locking( cs-l)
Applications Implemented on FREERIDE • Apriori (Association mining) • Kmeans (Clustering based) • Expectation Maximization (E-M) (clustering based)
Goals in Experimental Study • Scalability of data-intensive applications on multi-core • Comparison of different shared memory parallelization (SMP) techniques and mpi • Performance analysis of SMP techniques
Experimental setup Each node in the cluster has: • Intel Xeon E5345 CPU • 2 Quad-core machine • Each core 2.33GHz • 6GB Main memory Nodes in cluster are connected by Infiniband
Experiments Two sets of experiments: • Comparison of scalability results for f-r, cs-l, o-f-l and mpi with k-means, Apriori and E-M • Single node • Cluster of nodes • Performance analysis results with k-means, Apriori and E-M
Applications data setup • Apriori • Dataset size 900MB • Support = 3%, Confidence = 9% • K-means • Dataset size 6.4 GB • 3-Dimensional points • No. of clusters, 250 • E-M • Dataset size 6.4 GB • 3-Dimensional points • No. of clusters, 60
Performance Analysis of SMP techniques • Given an application can we predict the factors that determines the best SMP technique? • Why locking techniques suffer with Apriori, but competes well with other applications? • What factors limit the overall scalability of data-intensive applications?
Performance Analysis setup • Valgrind used for the Dynamic Binary Analysis • Cachegrind used for the analysis of cache utilization
Performance Analysis Locking vs Merge Overhead
Performance Analysis (contd…) Relative L2 misses for reduction object
Performance Analysis (contd …) Total program read/write misses
Analysis • Important Trade-off • Memory needs of application • Frequency of updating reduction object • E-M is compute and memory intensive • Locking overhead is very low • Replication overhead is high • Apriori has high update fraction and very less computation • Locking overhead is extremely high • Replication performs the best
Related Work • Google Mapreduce • Yahoo Hadoop • Phoenix – Stanford university • SALSA – Indiana university
Conclusion • Replication and locking schemes can outperform each other • Locking schemes have huge overhead when there is little computation between updates in ReductionObject • MPI processes competes well upto 4 threads, but experiences communication overheads with 8 threads • Performance analysis shows memory needs of an application and update fraction are significant factors for scalability
Thank you!!!! Questions???