Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

GenIDLEST • Generalized Incompressible Direct and Large Eddy Simulation of Turbulence • General purpose turbulent fluid flow and heat transfer solver • In development over a period of 15 years • Besides turbulent flow, has capabilities for two-phase modeling and fluid-structure interaction • Library of 325 subroutines and ~145,000 lines of code (Fortran 90) • Mixed-mode (hybrid MPI/OpenMP) parallelism

Projects Using GenIDLEST • The Adaptive Code Kitchen & Flexible Tools for Dynamic Application Composition (NSF) • Syngas Particulate Deposition and Erosion at the Leading Edge of a Turbine Blade with Film Cooling (DOE-NETL) • Fluid-Structure Interaction in Arterial Blood Flow (ICTAS-VT) • Evaluation of Engine Scale Combustor Liner Heat Transfer for Can and Annular Combustor (Solar Turbines) • Investigation of Air-side Fouling in Exhast Gas Recirculators (Modine-US Army) • Extreme OpenMP: A Productive Programming Model for High End Computing (NSF) • Unsteady Aerodynamics of Flapping Wings at Re=10,000-100,000 for Micro-Air Vehicles (Army Research Office) • Sand Ingestion and Deposition in Internal and External Cooling Configurations in the Turbine Flow Path (Pratt & Whitney) • Advanced Fuel Gasification: A Coupled Experimental Computational Program (ICTAS-VT)

Initial Motivation • August 2009: Tafti group contacted NCSA’s Advanced Application Support group regarding severe slowdown in performance on the Altix system (“Cobalt”) since 5/2008 • Code execution times reported for a ribbed duct CFD problem:

What Had Changed? • NCSA had upgraded the Altix system software from ProPack 5 to ProPack 6 in August ‘08 • Substantial software changes: kernel, compiler, glibc, MKL • Compiler versions (Intel) used possibly changed • However, compiler used at NCSA not noted by VaTech, so was unknown • NASA compiler version identified only as 9.1 and 10.1 • New hardware arrived at NCSA • Cobalt now a “hybrid”, with older (Madison) IA-64 processors on co-compute[12], and newer (Montvale, multicore) processors on co-compute3

Puzzle: Great Variability Between Runs at Single Site (NCSA) in Same Week • What? • These are the two groups of runs labeled “OMP-1” and “OMP-2” in comparison table on prior slide • Execution times on same processor counts: 370.4/673.8(8), 383.7/696(16), 404/721.9(32) • Requested and received PBS job IDs for these runs • Examining the job histories revealed that the slower times had run on the Madison (older) system, while the faster times were obtained on the Montvale (multicore) system • Magnitude of difference surprising, though… • …but either way, the code was still in the intensive care unit 

First Step: Thread Placement • Initial hunch: threads were not being placed optimally • With respect to each other • With respect to “their data” • System tools used to launch/examine at runtime: • “dplace”, “omplace”, “ps”, “top”, “dlook” (SGI tool for examining data placement across nodes) • Verdict: all threads were running on the same core!!!  • Why?

From the man page for SGI’s “omplace” • … so the tools (compiler/placement) were fighting each other. • Solution? Don’t use that compiler version! • Moved to 11.1.038 • was not the default at NCSA (and still isn’t) • Runtime improved (8 cores) from 673.8 secs to 152.8 secs (~ 4.5x speedup) “omplace does not pin OpenMP threads properly when used with the Intel OpenMP library built on Feb 15 2008. The build date of the OpenMP library will be printed at run time if KMP_VERSION is set to 1. Note that this version of the OpenMP library was shipped with Intel compiler package versions 10.1.015 and 10.1.017. This library is incompatible with dplace and omplace because it introduces CPU affinity functionality without the ability to disable it.”

Next Step • While compiler changes resulted in substantial improvement from initial version, the code was still ~2.5x slower than 2008 timings • An obvious (but often overlooked question):Have there been any changes to the code in the interim? • Answer: many changes, but a belief that none would impact performance… • Time to include performance tools…

PerfSuite • Entry-level, easy-to-use performance toolset developed at NCSA1 • Currently funded by NSF SDCI (Software Development for Cyberinfrastructure) program as part of “POINT” project: • Collaborators: NCSA, Oregon, Tennessee, and PSC • Two primary modes of operation for measurement: • “counting mode”, which counts overall occurrences of one or more hardware performance events (e.g., retired instructions, cache misses, branching, …) • “profiling mode”, which produces a statistical profile of an application triggered by hardware event overflows (generalization of “gprof”) 1 Kufrin, R. PerfSuite: An Accessible, Open Source Performance Analysis Environment for Linux. 6th International Conference on Linux Clusters: The HPC Revolution 2005. Chapel Hill, NC, April 2005.

Profiling With PerfSuite • When all is working properly, the PerfSuite tool “psrun” can be used with unmodified applications to obtain profiles, which is attractive to typical users • An XML “configuration file” is used to specify what to measure and how • Unfortunately, psrun was not functional after Altix upgrade to ProPack 6: • Currently investigating SGI-supplied/developed patches to PerfSuite that address PP-related problems (SGI releases “SGI/PerfSuite” through their online SupportFolio) • Alternate approach: use the PerfSuite API rather than the tools; workaround for lack of psrun

PerfSuite Performance API C / C++ ps_hwpc_init (void) ps_hwpc_start (void) ps_hwpc_read (long long *values) ps_hwpc_suspend (void) ps_hwpc_stop (char *prefix) ps_hwpc_shutdown (void) • Call “init” once, call “start”, “read”, and “suspend” as many times as you like • Call “stop” (supplying a file name prefix of your choice) to write out the performance data results to an XML document • Optionally, call “shutdown” • Additional routines ps_hwpc_numevents() and ps_hwpc_eventnames() allow querying current configuration. Fortran call psf_hwpc_init (ierr) call psf_hwpc_start (ierr) call psf_hwpc_read (integer*8 values,ierr) call psf_hwpc_suspend (ierr) call psf_hwpc_stop (prefix, ierr) call psf_hwpc_shutdown (ierr)

FORTRAN API Example include 'fperfsuite.h' call PSF_hwpc_init(ierr) c$omp parallel private(ierr) call PSF_hwpc_start(ierr) c$omp end parallel do j = 1, n do i = 1, m do k = 1, l c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do c$omp parallel private(ierr) call PSF_hwpc_stop('perf', ierr) c$omp end parallel call PSF_hwpc_shutdown(ierr) The “ierr” argument to PerfSuite routines should be tested for error conditions (omitted here for brevity) In multithreaded case (e.g., OpenMP), each thread must call individually

Sample Profile Excerpt PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Fri Sep 04 04:21:30 PM CDT 2009 Generator : psprocess 0.4 XML Source : genidlest.0.168652.co-compute1.xml Profile Information ========================================================================== Class : PAPI Version : 3.6.2 Event : PAPI_TOT_CYC (Total cycles) Period : 14500000 Samples : 17724 Domain : all Run Time : 162.59 (seconds) Min Self % : (all) Function:File:Line Summary ------------------------------------------------------------------------- Samples Self % Total % Function:File:Line 5190 29.28% 29.28% exchange_var:exchange_var.f:119 1220 6.88% 36.17% sgs_test_filter:sgs_test_filter.f:54 723 4.08% 40.24% sum_domain:sum_domain.f:55 649 3.66% 43.91% mpi_sendbuf:mpi_sendbuf.f:137

PerfSuite Profile Displayed with TAU’s “ParaProf” Visualization Tool • “Stacked” and “Unstacked” views allow quick overview of parallel run and show wide variability between threads in various subroutines within the code • This led to suspicion of data layout among threads (locality) • Profiling based on total cycles and processor stalls (“bubbles”) isolated offending routine(s)

Addressing Locality/First-Touch Policy • The PerfSuite profiling results pointed out suspect routines. VaTech examined and found that code changes (introduced since 2008) used F90 array-style initialization for local arrays; changed to parallel version: Original version: buf2ds=0.0;buf2dr=0.0 After these changes were implemented, the code performance improved by a further factor of nearly 2x, began to approach target Modified for first-touch policy: c$omp parallel do private(m) do m = 1, m_blk(myproc) buf2ds(:,:,:,m)=0.0 buf2dr(:,:,:,m)=0.0 enddoc$omp end parallel do

Runtime Improvements Due To Node Optimizations (8 procs)

Aside: Miscellaneous Items Uncovered During Optimization Cycle • As noted, NCSA’s Cobalt is a hybrid system (mix of Madison/Montvale Itanium 2) • These CPUs also differ in the number of available PMUs (performance monitoring units): Madison has 4, Montvale has 12 • The PAPI 3.x library decides at build time (for PAPI) how much space to allocate for registers. As a result, separate library builds would be necessary to work on both machines • After reporting/discussing with the PAPI team, work was done by Haihang You and Dan Terpstra to address this deficiency. The mods to PAPI were not released in version 3.x, but were made generally available for PAPI 4.x (PAPI-C), released recently. Benefits community as a whole!

Miscellaneous Items Uncovered (cont’d) • VaTech noted unusual discrepancy in internal time (collected by MPI_Wtime()) and PerfSuite-reported wall-clock time. We realized that the ordering of the two mattered, especially in multi-threaded (OpenMP) runs • Reason? Output in “stop” is serialized among threads (to minimize filesystem contention) • Comparisons of MPI/PerfSuite times important to validate results and provide a sanity check start = MPI_Wtime()start PerfSuite measurementcomputestop PerfSuite measurementend = MPI_Wtime() MPI_time = end-start start PerfSuite measurementstart = MPI_Wtime()computeend = MPI_Wtime() stop PerfSuite measurement MPI_time = end-start TIME!

Miscellaneous Items Uncovered (cont’d) • For runs at high processor count, the nature of measurement can impact the system. These are issues that were known (since the NCSA Altix was installed): • PAPI “multiplexing” is achieved through regular interrupts • Profiling through statistical sampling also generates regular interrupts • A single system image with large numbers of processors must deal with interrupts being generated concurrently… can become overwhelmed • We have adjusted the default interrupt frequency on Cobalt when using PerfSuite to help address • Linux kernel developers have modified the relevant code for scalability • Changes are implemented in the upcoming Altix UV system and associated software

A Final Hurdle with Higher Processor-Count Jobs • With initial optimizations/changes implemented using smaller (8-32) jobs, moving to largest run (256 procs) gave: • Initial suspicion was that memory reserved for PerfSuite profiling (sample) buffers may have had excessive requirements • Internal memory usage tracking showed ~70MB/thread, so this was not the problem source • MPI_MEMMAP_OFF used to disable SGI’s MPT memory mapping optimizations; allowed jobs to complete MPI: mmap failed (memmap_base) for 8068448256 pages (132193456226304 bytes) on 256 ranks Attempted byte sizes of regions: static+heap 0x76c97d4000 symheap 0x0 stack 0x17132c000 mpibuf 0x0MPI: daemon terminated: co-compute2 - job aborting

First “Legitimate” Timings: OpenMP • Initial interpretation at VaTech: superlinear scaling occurring between 32-128 processors on co-compute1, 64 processors on co-compute2 • Why the substantial differences between two “identical” machines at low core counts?

Speedup (Main Timestep Loop)

Machine Comparison • Although co-compute1 and co-compute2 are, in many ways, identical, there is an important difference: • These runs use ~ 50GB of memory, more than can be serviced from a single node at low core counts • The cost of remote memory access for the lower core count runs resulted in sublinear scaling. More pronounced on co-compute1 since each node has only 1/3 the memory of co-compute2 • Additional runs using the “dlook” utility revealed exactly how many pages were allocated across how many nodes

Timings (Main Timestep Loop) At this point, 8 nodes (co-2) supply 96 GB

Development/Optimization Observations • Many runs were made to arrive at the current “optimized” version • By geographically distinct groups • With multiple compiler versions • In-progress code changes • Various compiler flags • Multiple performance tools • Very easy to get buried in the volume of data • Consistency in recording results is critical • Cannot control how others handle this • Need for performance regression testing and tracking

The TAU Portal • Web-based access to performance data • Supports collaborative performance study • Secure performance data sharing • Does not require TAU installation • Launch TAU performance tools via Java WebStart • ParaProf, PerfExplorer • http://tau.nic.uoregon.edu/

TAU Portal Entry Page Access is free of charge Create your account at this page Passwords required to access/upload data Do not use existing password; security is light

TAU Portal Workspaces • Basic unit of organization for performance experiments • Can be shared between users • Each experiment initially shown as “metadata” • Can launch ParaProf directly from workspace

Example Basic ParaProf Display • Bar chart displays of profiles are a commonly used display technique (we showed one earlier with PerfSuite data) • All experiment trials previously uploaded to the portal are accessible and can be viewed, compared off the portal

Using the TAU Portal from a Batch Job • It is extremely easy to incorporate upload of data to the portal from a batch job through command-line utilities • For TAU-generated profiles:paraprof –-pack myprof.ppk profile.* • For PerfSuite-generated data:paraprof --pack myprof.ppk *.xml • This results in a “packed” data file, to upload:tau_portal.py up –u name –p pw –w wkspace \-e exp packed_data_file

For More Information • GenIDLESThttp://www.hpcfd.me.vt.edu/codes.shtml • PerfSuitehttp://perfsuite.ncsa.uiuc.edu/http://perfsuite.sourceforge.net/ • POINThttp://nic.uoregon.edu/point/

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

Presentation Transcript

Incorporating Functional Performance Testing into Clinical Practice EATA 58th Annual Meeting and Clinical Symposium Ph

Profiling and Testing Applications Using the Eclipse Test and Performance Tools Platform (TPTP)

Incorporating Human Performance Improvement Tools into DOE Processes

Regression testing

Windows Performance Analysis: Using Windows Performance Tools

Incorporating Ethics into PBL

Incorporating Writing Into Teaching

Regression testing

Tools for Performance, Load Testing, Stress Testing Using Telerik Test Studio

Tools for Performance, Load Testing, Stress Testing Using Visual Studio

Ontology Evolution and Regression Analysis Insights into Ontology Regression Testing

Tracking Utility Performance Using Interconnection Reporting

Network Testing and Performance Using SeRIF

Regression Testing

Translating Exercise Testing into Athletic Performance

INCORPORATING QUOTATIONS INTO WRITING

Incorporating research into your essay Using Quotations

Incorporating Safety into Design

Planning for Performance Measurement: Incorporating Performance Measurement into the RFP

Hardware Performance Testing Tools (Oracle DB)

Rigorous Performance Testing - Modern Testing Tools | Instar

Regression Testing Tools & Best Practices at Fastest Beastute