130 likes | 156 Views
Explore the benefits of using ccNUMA systems for parallel applications, including speed enhancements, challenges, and strategies for optimization. This study delves into efficiency metrics, impact of problem size, and structural modifications for higher performance. Gain insights on programming guidelines and concluding thoughts on the scalability of ccNUMA systems for scientific computation.
E N D
Mukesh Agrawal Scaling Parallel Applications
Introduction • Parallel systems are ccNUMA • ...so is ccNUMA useful? • How much faster is it? • How can we make it faster? • How hard is it?
ccNUMA (review) • Multiple processors • Private physical memories • Shared address space • Hardware support for cache coherence
Scenario • Scientific computation problems (SPLASH-2) • Metric: • Simulation study (simulate Stanford FLASH) • Experimental study (SGI Origin 2000, 128 proc)
Efficiency and Size • What is the smallest problem instance to achieve 60% efficiency? • Why might this be a bad metric?
Efficiency and Size • What is the smallest problem instance to achieve 60% efficiency? • Why might this be a bad metric? • Assumes more efficiency for larger instances • May not happen if data is laid out poorly (cache usage) • Why might larger instances run more efficiently?
Efficiency and Size • What is the smallest problem instance to achieve 60% efficiency? • Why might this be a bad metric? • Assumes more efficiency for larger instances • May not happen if data is laid out poorly (cache usage) • Why might larger instances run more efficiently? • Better communication/computation ratio (nearest neighbor) • Less load imbalance (less waiting for others) • Cache capacity (many misses on uniprocessor) • Cache sharing (small problem may share lines)
Efficiency and Size (results) • Depends on problem • For some, efficiency on reasonable sizes (Barnes-Hut) • Others never efficient (Radix) • Experiments show: reality requires larger instances than simulation
Efficiency and Structure • Can we get higher efficiency on small instances by modifying computation structure? • What might we try?
Efficiency and Structure • Can we get higher efficiency on small instances by modifying computation structure? • What might we try? • Reduce communication! • Algorithmic changes • Cache management (keep remote data in cache) • Static partitioning
Efficiency and Structure • Can we get higher efficiency on small instances by modifying computation structure? • What might we try? • Reduce communication! • Algorithmic changes • Cache management (keep remote data in cache) • Static partitioning • Most programs can scale after restructuring • Bonus: changes for ccNUMA often help with SVM (cluster) systems as well
Programming Guidelines • Partition statically; optimize for locality • Load balance should not be compromised • Separate partitions, avoid write sharing
Conclusion • ccNUMA can deliver scalable performance for scientific computation • Restructuring program usually required • ccNUMA and SVM machines need similar program mods • Simulator good for qualitative questions; not so good for quantitative