Parallel Performance Diagnosis

Automatic Performance Diagnosis of Parallel Computations with Compositional ModelsLi Li, Allen D. Malony{lili, malony}@cs.uoregon.eduPerformance Research LaboratoryDep. of Computer and Information ScienceUniversity of Oregon

Parallel Performance Diagnosis • Performance tuning process • Process to find and fix performance problems • Performance diagnosis: detect and explain problems • Performance optimization: repair found problems • Diagnosis is critical to efficiency of performance tuning • Focus on the performance diagnosis • Capture diagnosis processes • Integrate with performance experimentation and evaluation • Formalize the (expert) performance cause inference • Support diagnosis in an automated manner

Generic Performance Diagnosis Process • Design and run performance experiments • Observe performance under a specific circumstance • Generate desirable performance evaluation data • Find symptoms • Observation deviating from performance expectation • Detect by evaluating performance metrics • Infer causes of symptoms • Relate symptoms to program • Interpret symptoms at different levels of abstraction • Iterate the process to refine performance bug search • Refine performance hypothesis based on symptoms found • Generate more data to validate the hypothesis

Knowledge-Based Automatic Performance Diagnosis • Experts analyze systematically and use experience • Implicitly use knowledge of code structure and parallelism • Guide by the knowledge to conduct diagnostic analysis • Knowledge-based approach • Capture knowledge about performance problems • Capture knowledge about how to detect and explain them • Apply the knowledge to performance diagnosis • Performance knowledge • Experiment design and specifications • Performance models • Performance metrics and evaluation rules • High level performance factors/design parameters (causes)

Implications • Where does the knowledge come from? • Extract from parallel computational models • Structural and operational characteristics • Reusable parallel design patterns • Associate computational models with performance models • Well-defined computation and communication pattern • Model examples • Single models: Master-worker, Pipeline,AMR, ... • Compositional models • Use model knowledge to diagnose performance problem • Engineer model knowledge • Integrate model knowledge with cause inference

Algorithmic-specific events Algorithmic performance modeling Model-based Generic Knowledge Generation Algorithm-specific Knowledge Extension Behavioral Modeling extend & event1 Abstract events instantiate event2 Performance Modeling refine Performance composition and coupling descriptions Metrics Definition Algorithmic-specific metrics extend & Model-based metrics instantiate extend Metric-driven inference Performance bug search and cause inference Inference Modeling Algorithm- specific factors extend Performancefactor library

Computational models Hercule Automatic Performance Diagnosis System model Parallel program model knowledge algorithm-spec. info Hercule perf. data knowledge base event recognizer measurement system inference engine metric evaluator experiment specifications inference rules diagnosis results explanations problems • Goals of automation, adaptability, extension, and reuse

Single Model Knowledge Engineered • Master-worker • Divide-and-conquer • Wavefront (2D pipeline) • Adaptive Mesh Refinement • Parallel Recursive Tree • Geometric Decomposition • Related publications • L. Li and A. D. Malony, "Model-based Performance Diagnosis of Master-worker Parallel Computations", in the proceedings of Europar 2006. • L. Li, A. D. Malony and K. Huck, "Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations", in the proceedings of HPCC 2006. • L. Li, A. D. Malony, "Knowledge Engineering for Automatic Parallel Performance Diagnosis", to appear in Concurrency and Computation: Practice and Experience.

Characteristics of Model Composition • Compositional model • Combine two or more models • Interaction changes individual model behaviors • Composition pattern affects performance • Model abstraction for describing composition • Computational component set {C1, C2, ..., Ck} • Relative control order F(C1, C2, ..., Ck) • Integrate component sets in a compositional model • Composition forms • Model nesting • Model restructuring • Different implications to performance knowledge engineering

Model Nesting • Formal representation • Two models: root F(C1, C2, ..., Ck) and child G(D1, D2, ..., Dl) • F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl)  • F(C1{G(D1, D2, ..., Dl)}, • C2{G(D1, D2, ..., Dl)}, • ... ... • Ck{G(D1, D2, ..., Dl)}) • where Ci{G(D1, D2, ..., Dl)} means Ci • implements the G model.

Model Nesting (contd.) • Examples • Iterative, multi-phase applications • FLASH, developed by DOE supported ASC/Alliances Center for Astrophysical Thermonuclear Flashes • Implications to performance diagnosis • Hierarchical model structure dictates analysis order • Refine problem discovery from root to child • Preserve performance features of individual models

Model Restructuring • Formal representation • Two models: F(C1, C2, ..., Ck) and G(D1, D2, ..., Dl) • F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl)  • H(({C1F, ..., CkF}|{D1G, ..., DlG})+) • where {C1F, ..., CkF}|{D1G, ..., DlG} selects a • component CiF or DjG while preserving relative • component order in F and G. H is the new • function ruling all components.

Adapt Performance Knowledge to Composition • Objective: discover and interpret performance effects caused by model interaction • Model nesting • Behavioral modeling • Derive F(C1, C2, ..., Ck) fromsingle model behaviors • Replace affected root component with child model behaviors • Performance modeling and metric formulation • Unite overhead categories according to nesting hierarchy • Evaluate overheads according to the model hierarchy • Inference modeling • Represent inference process with an inference tree • Merge inference steps of participant models • Extend root model inferences with implementing child model inferences

Model Nesting Case Study - FLASH • FLASH • Parallel simulations in astrophysical hydrodynamics • Use Adaptive Mesh Refinement (AMR) to manage meshes • Use a Parallel Recursive Tree (PRT) to manage mesh data • Model nesting • Root AMR model • Child PRT model • AMR implements PRT data operations

Single Model Characteristics • AMR operations • AMR_Refinement – refine a mesh grid • AMR_Derefinement – coarsen a mesh grid • AMR_LoadBalancing – even out work load after refinement or derefinement • AMR_Guardcell – update guard cells at the boundary of every grid block with data from the neighbors • AMR_Prolong – prolong the solution to newly created leaf blocks after refinement • AMR_Restrict – restrict the solution up the block tree after derefinement • AMR_MeshRedistribution – mesh redistribution when balancing workload • PRT operations • PRT_comm_to_parent – communicate to parent processor • PRT_comm_to_child – communicate to child processor • PRT_comm_to_sibling – communicate to sibling processor • PRT_build_tree – initialize tree structure, or migrate part of the tree to another processor and rebuild the connection.

low speedup other_comm. communication computation AMR_comm. AMR_Prolong refine_freq. AMR_Workbalance AMR_Guardcell leaf_restrict AMR_Restrict AMR_Refine guardcell_size AMR_Derefine data_fetch parent_prolong check_neighbor refine_levels refine_levels refine_levels physical_block_contiguity physical_block_contiguity physical_block_contiguity block_weight_assign_method physical_block_contiguity physical_block_contiguity physical_block_contiguity data_contiguity_in_cache balance_strategy balance_strategy migrate_blocks calculate_blocks_weight inform_neighbor sort_blocks check_refine rebuild_block_connection AMR Inference Tree : symptoms : intermediate observations : performance factors : inference direction ... ... ... ... ... ... ... ...

low speedup communication computation lookup_parent sibling_comm parent_comm child_comm link_parent lookup_sib. rebuild_tree lookup_child build_tree data_transfer data_transfer link_sib. rebuild_ connection migrate_subtree link_child data_transfer init._tree tree_node_contiguity tree_node_contiguity tree_node_contiguity tree_depth freq. migrate_strategy tree_node_contiguity tree_depth tree_depth data_contiguity data_contiguity tree_node_contiguity tree_depth data_contiguity migrate_strategy migrate_strategy other_ comm fetch_freq. fetch_freq. fetch_freq. PRT Inference Tree : symptoms : intermediate observations : performance factors : inference direction 1 2 ... ... 3 5 4

low speedup communication computation AMR_comm. others AMR_Guardcell refine_freq. AMR_Workbalance leaf_restrict AMR_Refine inform_neighbor guardcell_size refine_levels data_fetch check_neighbor parent_prolong refine_levels refine_levels physical_block_contiguity physical_block_contiguity block_weight_assign_method physical_block_contiguity physical_block_contiguity physical_block_contiguity physical_block_contiguity data_contiguity_in_cache balance_strategy balance_strategy sort_blocks calculate_blocks_weight rebuild_block_connection migrate_blocks check_refine FLASH Inference Tree A : refine perf. problem search following subtrees of PRT that are relevant to A. The No. represent corresponding subtrees in PRT. No. ... ... 1,3 1,2,3 1,2,3 1,2,3 3 4 1,2,3 1,2,3 5

Experiment with FLASH v3.0 • Sedov explosion simulation in FLASH3 • Test platform • IBM pSeries 690 SMP cluster with 8 processors • Execution profiles of a problematic run (Paraprof view)

Diagnosis Results Output (Step 1&2) Begin diagnosing ... ======================================================== Begin diagnosing AMR program ... ... Level 1 experiment -- collect performance profiles with respect to computation and communication. ______________________________________________________________ do experiment 1... ... Communication accounts for 80.70% of run time. Communication cost of the run degrades performance. ======================================================== • Step 1: find performance symptom • Step 2: look at root AMR model performance ========================================================= Level 2 experiment -- collect performance profiles with respect to AMR refine, derefine, guardcell-fill, prolong, and workload-balance. ________________________________________________________________ do experiment 2... ... Processes spent 4.35% of communication time in checking refinement, 2.22% in refinement, 13.83% in checking derefinement (coarsening), 1.43% in derefinement, 49.44% in guardcell filling, 3.44% in prolongating data, 9.43% in dealing with work balancing, =========================================================

Step 3: Interpret Expensive guardcell_filling with PRT Performance ==================================================================== Level 3 experiment for diagnosing grid guardcell-filling related problems -- collect performance event trace with respect to restriction, intra-level and inter-level commu. associated with the grid block tree. ___________________________________________________________________________________ do experiment 3... ... Among the guardcell-filling communication, 53.01% is spent restricting the solution up the block tree, 8.27% is spent in building tree connections required by guardcell-filling (updating the neighbor list in terms of morton order), and 38.71% in transferring guardcell data among grid blocks. ___________________________________________________________________________________ The restriction communication time consists of 94.77% in transferring physical data among grid blocks, and 5.23% in building tree connections. Among the restriction communication, 92.26% is spent in collective communications. Looking at the performance of data transfer in restrictions from the PRT perspective, remote fetch parent data comprises 0.0%, remote fetch sibling comprises 0.0%, and remote fetch child comprises 100%. Improving block contiguity at the inter-level of the PRT will reduce restriction data communication. __________________________________________________________________________________ Among the guardcell data transfer, 65.78% is spent in collective communications. Looking at the performance of guardcell data transfer from the PRT perspective, remote fetch parent data comprises 3.42%, remote fetch sibling comprises 85.93%, and remote fetch child comprises 10.64%. Improving block contiguity at the intra-level of the PRT will reduce guardcell data communication. ==================================================================== AMR model performance PRT operation perf. in AMR_Restrict PRT operation perf. in transferring guardcell data

Conclusion and Future Directions • Model-based performance diagnosis approach • Provide performance feedbacks at a high level of abstraction • Support automatic problem discovery and interpretation • Enable novice programmers to use established expertises • Compositional model diagnosis • Adapt knowledge engineering approach to model integration • Disentangle cross-model performance effects • Enhance applicability of model-based approach • Future directions • Automate performance knowledge adaptation • Algorithmic knowledge, compositional model knowledge • Incorporate system utilization model • Reveal interplay between programming model and system utilization • Explain performance with the model-system relationship

Parallel Performance Diagnosis