130 likes | 308 Views
Parallel Performance Wizard: Introduction. Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant
E N D
Parallel Performance Wizard:Introduction Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant Mr. Josh Hartman, Undergraduate Volunteer HCS Research Laboratory University of Florida
Outline • Motivations and Objectives • Background • Framework & Key Features • Phase II Tasks & Schedules • Today’s Schedule
Motivations and Objectives • Motivations • UPC/SHMEM program does not yield the expected performance. Why? • Due to complexity of parallel computing, difficult to determine without tools for performance analysis and optimization • Discouraging for users, new & old; few options for shared-memory computing in UPC and SHMEM communities • Objectives • Research topics relating to performance analysis • Develop framework for a performance analysis tool • Design with both performance and user productivity in mind • Develop a performance analysis tool for UPC and SHMEM
Need for Performance Analysis • Performance analysis of sequential applications can be challenging • Performance analysis of explicitly communicating parallel applications is significantly more difficult • Mainly due to increase in number of processing nodes • Performance analysis of Implicitly communicating parallel applications is even more difficult • Non-blocking, one-sided communication is tricky to track and analyze accurately
Background - SHMEM • SHared MEMory library • Based on SPMD model • Available for C / Fortran • Available for servers and clusters • Easier to program than MPI • Hybrid programming model • Traits of message passing • Explicit communication, replication and synchronization • Need to give remote data location (processing element ID) • Traits of shared memory • Provides logically shared memory system view • Non-blocking, one-sided communication • Lower latency, higher bandwidth communication • PSHMEM available for some implementations
Background - UPC • Unified Parallel C (UPC) • Partitioned GAS parallel programming language • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Many implementations • Open source: Berkeley UPC, Michigan UPC, GCC-UPC • Proprietary: HP-UPC, Cray-UPC • Easier to program than MPI, software more scalable • With hand-tuning, UPC performance compares favorably with MPI
Background – Performance Analysis • Three general performance analysis approaches • Analytical modeling • Mostly predictive methods • Could also be used in conjunction with experimental performance measurement • Pros: easy to use, fast, can be performed without running the program • Cons: usually not very accurate • Simulation • Pros: allow performance estimation of program with various system architectures • Cons: slow, not generally applicable for regular UPC/SHMEM users • Experimental performance measurement • Strategy used by most modern performance analysis tools (PATs) • Uses actual event measurement to perform analysis • Pros: most accurate • Cons: can be time-consuming PAT = Performance Analysis Tool
Background - Experimental Performance Measurement Stages • Instrumentation – user-assisted or automatic insertion of instrumentation code • Measurement – actual measuring stage • Analysis – data analysis toward bottleneck detection & resolution • Presentation – display of analyzed data to user, deals directly with user • Optimization – process of finding and resolving bottlenecks
Key Features • Semi-automatic source-level instrumentation as default • Only P module and part of I module are visible to user • PAPI will be used • Tracing mode as default with profiling support • Post-mortem data processing and analysis • Analyses: load balancing, scalability, memory system • Visualizations: timeline display, speedup chart, call-tree graph, communication volume graph, memory access graph, profiling table
Discussion Topic: Target Platforms Our current platform list; changes needed? • Open • Quadrics SHMEM on Opterons + RHEL4 (qsnet) • Berkeley UPC on Opterons + RHEL4 (iba) • Proprietary • Cray UPC on X1E (src. inst) • Cray SHMEM on X1E
Today’s Schedule 09:00 – 09:30 AM Project overview 09:30 – 10:15 AM Instrumentation (I) module presentation 10:15 – 10:30 AMBREAK 10:30 – 11:15 AM Measurement (M) module presentation 11:15 – 11:45 AMI&M-modules demo 11:45 – 13:00 PMLUNCH 13:00 – 13:45 PM Analysis (A) module presentation 13:45 – 14:00 PMA-module demo 14:00 – 14:45 PM Presentation (P) module presentation 14:45 – 15:00 PMP-module demo 15:00 – 15:15 PMBREAK 15:15 – 16:00 PM Wrap-up & planning discussion