470 likes | 566 Views
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric. Ioana Banicescu and Mark Bilderback Department of Computer Science and NSF/ERC for Computational Field Simulation Mississippi State University. Overview. Scientific Applications
E N D
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of Computer Science and NSF/ERC for Computational Field Simulation Mississippi State University
Overview • Scientific Applications • Performance Evaluation • Scalability Analysis • Optimal Effectiveness Metric • Parallel Fast Mutipole Algorithm • Experimental Results • Conclusions and Future Work
Scientific Applications • Large, computationally intensive, irregular • Parallel Implementation (various algorithms) • Performance degradation factors • Communication and load imbalance • architecture independent • architecture dependent
Architecture Independent Factors • Problem characteristics • nonuniformity of input data • Algorithmic • serial section • communication patterns • local / non-local dependencies
Architecture Dependent Factors • Architectural charateristics • Language, OS • Interconnected Network • Characteristics of each component processor • speed, memory, etc.
Performance Evaluation • Parallel Applications • Scalability • algorithm, architecture, mapping • Evaluation • Isolated to particular applications • Different types of performance metrics • Performance metric characteristics • Relevant, consistent, quantitative, predictive
Performance Metrics • Commonly used (time, speedup, efficiency, cost) • Speedup [Amdahl ‘67] • Scaled Speedup [Gustafson ‘88] • Fixed time size-up [Sun and Gustafson ‘91] • Isoefficiency [Gupta & Kumar ‘93] • Optimal effectiveness [Luke, Banicescu, Li ‘98]
Isoefficiency • Algorithms that can add processors at faster rate are able to achieve higher performance. • Does not identify the number of processors required before an algorithm becomes an effective option. • It discounts valuable parallel algorithms for which an isoefficiency does not exists.
Performance - Cost Tradeoffs • High performance application seek performance-cost balance. • Scalability analysis - theoretical, experimental. • Optimal effectiveness [Luke, Banicecsu, Li ‘98] • Similar to (E*S)max [Tang, Li ‘90] • Asymptotic relationship between isoefficiency and (E*S)max
Optimal Effectiveness • Cost Effectiveness: • Optimal Effectiveness:
Optimal Effectiveness (contd.) • Compare the performance of different parallel algorithms. • Identify specific conditions of problem size and number of processors that characterize crossover points and intervals where one algorithm becomes more cost effective than another. • Prescribe the number of processors that are relevant to particular problem size: Popt.
The N-body Problem Resulting force • Problem: Simulate the evolution of N particles over time (given initial positions and velocities) • Compute new positions and velocities of the N particles after one time step • Applications: astrophysics, molecular dynamics • Naive algorithm: O(N2)
Approximation Algorithms • O(N) [Appel85] • O(NlogN) [Barnes-Hut86] • O(N) Fast Multipole Algorithm (FMA) [Greengard87]a • Particles interaction approximation within a specified accuracy (Zhao, Board, Pringle,..) • O(N) Adaptive Fast Multipole Algorithm (AFMA) [Greengard87]b • Singh et al., Nyland et al., etc
The Greengard Algorithm • Two traversals: • upward • downward • 2D: Quad-tree • 3D: Oct-tree
group of particles evaluation point well-separated equivalent particle Traversing the Tree Upwards • Computing combined field • effects of particles • in regions • Multipole expansion
Traversing the Tree Downwards Higher level Lower level
Implementation • 3D-PFMA, LB[Duke], Fractiling • KSR-1, IBM-SP2, SuperMSPARC • Pthreads, MPI • Uniform, Nonuniform (Gaussian, Corner) • 4 - 64 processors, 1k - 100k particles
3-d Cost: nonuniform (corner)(KSR1) • Densely packed (50K5) • Lightly packed (50K6) Cost in seconds Number of processors • LB better 4-16 proc
Cost vs. Cost Effectiveness • 10k nonunioform corner • Fractiling cost < LB cost < PFMA cost (regardless of number of processors). • The IDEAL number of processors to use for a cost effective execution is unknown. • Allocate only Popt number of processors and leave the rest for other simultaneously executing applications.
Conclusions • Cost effectiveness analysis - novel approach. • Qualitative and quantitative characteristics. • Optimal effectiveness derived from cost effectiveness curves. • Measurement of Γopt give the exact number of processors relevant to particular problem size.
Conclutions (contd.) • Cost effectiveness / Optimal effectiveness: • Quantifies specific conditions that make a particular algorithm optimal. • Capability to compare any set of algorithms regardless of the existence of the isoefficiency. • Γopt shows the point at which using one of the algorithm is more advantageous than using another.
Conclutions (contd.) • Cost effectiveness / Optimal effectiveness: • Allows intelligent allocation of available processors to other applications. • Improved throughput for the entire system. • Captures the impact and tradeoff in complexity of the conditions that dictate performance.