1 / 32

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools. GenIDLEST. Generalized Incompressible Direct and Large Eddy Simulation of Turbulence General purpose turbulent fluid flow and heat transfer solver In development over a period of 15 years

pavel
Download Presentation

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools

  2. GenIDLEST • Generalized Incompressible Direct and Large Eddy Simulation of Turbulence • General purpose turbulent fluid flow and heat transfer solver • In development over a period of 15 years • Besides turbulent flow, has capabilities for two-phase modeling and fluid-structure interaction • Library of 325 subroutines and ~145,000 lines of code (Fortran 90) • Mixed-mode (hybrid MPI/OpenMP) parallelism

  3. Projects Using GenIDLEST • The Adaptive Code Kitchen & Flexible Tools for Dynamic Application Composition (NSF) • Syngas Particulate Deposition and Erosion at the Leading Edge of a Turbine Blade with Film Cooling (DOE-NETL) • Fluid-Structure Interaction in Arterial Blood Flow (ICTAS-VT) • Evaluation of Engine Scale Combustor Liner Heat Transfer for Can and Annular Combustor (Solar Turbines) • Investigation of Air-side Fouling in Exhast Gas Recirculators (Modine-US Army) • Extreme OpenMP: A Productive Programming Model for High End Computing (NSF) • Unsteady Aerodynamics of Flapping Wings at Re=10,000-100,000 for Micro-Air Vehicles (Army Research Office) • Sand Ingestion and Deposition in Internal and External Cooling Configurations in the Turbine Flow Path (Pratt & Whitney) • Advanced Fuel Gasification: A Coupled Experimental Computational Program (ICTAS-VT)

  4. Initial Motivation • August 2009: Tafti group contacted NCSA’s Advanced Application Support group regarding severe slowdown in performance on the Altix system (“Cobalt”) since 5/2008 • Code execution times reported for a ribbed duct CFD problem:

  5. What Had Changed? • NCSA had upgraded the Altix system software from ProPack 5 to ProPack 6 in August ‘08 • Substantial software changes: kernel, compiler, glibc, MKL • Compiler versions (Intel) used possibly changed • However, compiler used at NCSA not noted by VaTech, so was unknown • NASA compiler version identified only as 9.1 and 10.1 • New hardware arrived at NCSA • Cobalt now a “hybrid”, with older (Madison) IA-64 processors on co-compute[12], and newer (Montvale, multicore) processors on co-compute3

  6. Puzzle: Great Variability Between Runs at Single Site (NCSA) in Same Week • What? • These are the two groups of runs labeled “OMP-1” and “OMP-2” in comparison table on prior slide • Execution times on same processor counts: 370.4/673.8(8), 383.7/696(16), 404/721.9(32) • Requested and received PBS job IDs for these runs • Examining the job histories revealed that the slower times had run on the Madison (older) system, while the faster times were obtained on the Montvale (multicore) system • Magnitude of difference surprising, though… • …but either way, the code was still in the intensive care unit 

  7. First Step: Thread Placement • Initial hunch: threads were not being placed optimally • With respect to each other • With respect to “their data” • System tools used to launch/examine at runtime: • “dplace”, “omplace”, “ps”, “top”, “dlook” (SGI tool for examining data placement across nodes) • Verdict: all threads were running on the same core!!!  • Why?

  8. From the man page for SGI’s “omplace” • … so the tools (compiler/placement) were fighting each other. • Solution? Don’t use that compiler version! • Moved to 11.1.038 • was not the default at NCSA (and still isn’t) • Runtime improved (8 cores) from 673.8 secs to 152.8 secs (~ 4.5x speedup) “omplace does not pin OpenMP threads properly when used with the Intel OpenMP library built on Feb 15 2008. The build date of the OpenMP library will be printed at run time if KMP_VERSION is set to 1. Note that this version of the OpenMP library was shipped with Intel compiler package versions 10.1.015 and 10.1.017. This library is incompatible with dplace and omplace because it introduces CPU affinity functionality without the ability to disable it.”

  9. Next Step • While compiler changes resulted in substantial improvement from initial version, the code was still ~2.5x slower than 2008 timings • An obvious (but often overlooked question):Have there been any changes to the code in the interim? • Answer: many changes, but a belief that none would impact performance… • Time to include performance tools…

  10. PerfSuite • Entry-level, easy-to-use performance toolset developed at NCSA1 • Currently funded by NSF SDCI (Software Development for Cyberinfrastructure) program as part of “POINT” project: • Collaborators: NCSA, Oregon, Tennessee, and PSC • Two primary modes of operation for measurement: • “counting mode”, which counts overall occurrences of one or more hardware performance events (e.g., retired instructions, cache misses, branching, …) • “profiling mode”, which produces a statistical profile of an application triggered by hardware event overflows (generalization of “gprof”) 1 Kufrin, R. PerfSuite: An Accessible, Open Source Performance Analysis Environment for Linux. 6th International Conference on Linux Clusters: The HPC Revolution 2005. Chapel Hill, NC, April 2005.

  11. Profiling With PerfSuite • When all is working properly, the PerfSuite tool “psrun” can be used with unmodified applications to obtain profiles, which is attractive to typical users • An XML “configuration file” is used to specify what to measure and how • Unfortunately, psrun was not functional after Altix upgrade to ProPack 6: • Currently investigating SGI-supplied/developed patches to PerfSuite that address PP-related problems (SGI releases “SGI/PerfSuite” through their online SupportFolio) • Alternate approach: use the PerfSuite API rather than the tools; workaround for lack of psrun

  12. PerfSuite Performance API C / C++ ps_hwpc_init (void) ps_hwpc_start (void) ps_hwpc_read (long long *values) ps_hwpc_suspend (void) ps_hwpc_stop (char *prefix) ps_hwpc_shutdown (void) • Call “init” once, call “start”, “read”, and “suspend” as many times as you like • Call “stop” (supplying a file name prefix of your choice) to write out the performance data results to an XML document • Optionally, call “shutdown” • Additional routines ps_hwpc_numevents() and ps_hwpc_eventnames() allow querying current configuration. Fortran call psf_hwpc_init (ierr) call psf_hwpc_start (ierr) call psf_hwpc_read (integer*8 values,ierr) call psf_hwpc_suspend (ierr) call psf_hwpc_stop (prefix, ierr) call psf_hwpc_shutdown (ierr)

  13. FORTRAN API Example include 'fperfsuite.h' call PSF_hwpc_init(ierr) c$omp parallel private(ierr) call PSF_hwpc_start(ierr) c$omp end parallel do j = 1, n do i = 1, m do k = 1, l c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do c$omp parallel private(ierr) call PSF_hwpc_stop('perf', ierr) c$omp end parallel call PSF_hwpc_shutdown(ierr) The “ierr” argument to PerfSuite routines should be tested for error conditions (omitted here for brevity) In multithreaded case (e.g., OpenMP), each thread must call individually

  14. Sample Profile Excerpt PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Fri Sep 04 04:21:30 PM CDT 2009 Generator : psprocess 0.4 XML Source : genidlest.0.168652.co-compute1.xml Profile Information ========================================================================== Class : PAPI Version : 3.6.2 Event : PAPI_TOT_CYC (Total cycles) Period : 14500000 Samples : 17724 Domain : all Run Time : 162.59 (seconds) Min Self % : (all) Function:File:Line Summary ------------------------------------------------------------------------- Samples Self % Total % Function:File:Line 5190 29.28% 29.28% exchange_var:exchange_var.f:119 1220 6.88% 36.17% sgs_test_filter:sgs_test_filter.f:54 723 4.08% 40.24% sum_domain:sum_domain.f:55 649 3.66% 43.91% mpi_sendbuf:mpi_sendbuf.f:137

  15. PerfSuite Profile Displayed with TAU’s “ParaProf” Visualization Tool • “Stacked” and “Unstacked” views allow quick overview of parallel run and show wide variability between threads in various subroutines within the code • This led to suspicion of data layout among threads (locality) • Profiling based on total cycles and processor stalls (“bubbles”) isolated offending routine(s)

  16. Addressing Locality/First-Touch Policy • The PerfSuite profiling results pointed out suspect routines. VaTech examined and found that code changes (introduced since 2008) used F90 array-style initialization for local arrays; changed to parallel version: Original version: buf2ds=0.0;buf2dr=0.0 After these changes were implemented, the code performance improved by a further factor of nearly 2x, began to approach target Modified for first-touch policy: c$omp parallel do private(m)      do m = 1, m_blk(myproc)      buf2ds(:,:,:,m)=0.0      buf2dr(:,:,:,m)=0.0      enddoc$omp end parallel do

  17. Runtime Improvements Due To Node Optimizations (8 procs)

  18. Aside: Miscellaneous Items Uncovered During Optimization Cycle • As noted, NCSA’s Cobalt is a hybrid system (mix of Madison/Montvale Itanium 2) • These CPUs also differ in the number of available PMUs (performance monitoring units): Madison has 4, Montvale has 12 • The PAPI 3.x library decides at build time (for PAPI) how much space to allocate for registers. As a result, separate library builds would be necessary to work on both machines • After reporting/discussing with the PAPI team, work was done by Haihang You and Dan Terpstra to address this deficiency. The mods to PAPI were not released in version 3.x, but were made generally available for PAPI 4.x (PAPI-C), released recently. Benefits community as a whole!

  19. Miscellaneous Items Uncovered (cont’d) • VaTech noted unusual discrepancy in internal time (collected by MPI_Wtime()) and PerfSuite-reported wall-clock time. We realized that the ordering of the two mattered, especially in multi-threaded (OpenMP) runs • Reason? Output in “stop” is serialized among threads (to minimize filesystem contention) • Comparisons of MPI/PerfSuite times important to validate results and provide a sanity check start = MPI_Wtime()start PerfSuite measurementcomputestop PerfSuite measurementend = MPI_Wtime() MPI_time = end-start start PerfSuite measurementstart = MPI_Wtime()computeend = MPI_Wtime() stop PerfSuite measurement MPI_time = end-start TIME!

  20. Miscellaneous Items Uncovered (cont’d) • For runs at high processor count, the nature of measurement can impact the system. These are issues that were known (since the NCSA Altix was installed): • PAPI “multiplexing” is achieved through regular interrupts • Profiling through statistical sampling also generates regular interrupts • A single system image with large numbers of processors must deal with interrupts being generated concurrently… can become overwhelmed • We have adjusted the default interrupt frequency on Cobalt when using PerfSuite to help address • Linux kernel developers have modified the relevant code for scalability • Changes are implemented in the upcoming Altix UV system and associated software

  21. A Final Hurdle with Higher Processor-Count Jobs • With initial optimizations/changes implemented using smaller (8-32) jobs, moving to largest run (256 procs) gave: • Initial suspicion was that memory reserved for PerfSuite profiling (sample) buffers may have had excessive requirements • Internal memory usage tracking showed ~70MB/thread, so this was not the problem source • MPI_MEMMAP_OFF used to disable SGI’s MPT memory mapping optimizations; allowed jobs to complete MPI: mmap failed (memmap_base) for 8068448256 pages (132193456226304 bytes) on 256 ranks  Attempted byte sizes of regions:    static+heap 0x76c97d4000    symheap     0x0    stack       0x17132c000    mpibuf      0x0MPI: daemon terminated: co-compute2 - job aborting

  22. First “Legitimate” Timings: OpenMP • Initial interpretation at VaTech: superlinear scaling occurring between 32-128 processors on co-compute1, 64 processors on co-compute2 • Why the substantial differences between two “identical” machines at low core counts?

  23. Speedup (Main Timestep Loop)

  24. Machine Comparison • Although co-compute1 and co-compute2 are, in many ways, identical, there is an important difference: • These runs use ~ 50GB of memory, more than can be serviced from a single node at low core counts • The cost of remote memory access for the lower core count runs resulted in sublinear scaling. More pronounced on co-compute1 since each node has only 1/3 the memory of co-compute2 • Additional runs using the “dlook” utility revealed exactly how many pages were allocated across how many nodes

  25. Timings (Main Timestep Loop) At this point, 8 nodes (co-2) supply 96 GB

  26. Development/Optimization Observations • Many runs were made to arrive at the current “optimized” version • By geographically distinct groups • With multiple compiler versions • In-progress code changes • Various compiler flags • Multiple performance tools • Very easy to get buried in the volume of data • Consistency in recording results is critical • Cannot control how others handle this • Need for performance regression testing and tracking

  27. The TAU Portal • Web-based access to performance data • Supports collaborative performance study • Secure performance data sharing • Does not require TAU installation • Launch TAU performance tools via Java WebStart • ParaProf, PerfExplorer • http://tau.nic.uoregon.edu/

  28. TAU Portal Entry Page Access is free of charge Create your account at this page Passwords required to access/upload data Do not use existing password; security is light

  29. TAU Portal Workspaces • Basic unit of organization for performance experiments • Can be shared between users • Each experiment initially shown as “metadata” • Can launch ParaProf directly from workspace

  30. Example Basic ParaProf Display • Bar chart displays of profiles are a commonly used display technique (we showed one earlier with PerfSuite data) • All experiment trials previously uploaded to the portal are accessible and can be viewed, compared off the portal

  31. Using the TAU Portal from a Batch Job • It is extremely easy to incorporate upload of data to the portal from a batch job through command-line utilities • For TAU-generated profiles:paraprof –-pack myprof.ppk profile.* • For PerfSuite-generated data:paraprof --pack myprof.ppk *.xml • This results in a “packed” data file, to upload:tau_portal.py up –u name –p pw –w wkspace \-e exp packed_data_file

  32. For More Information • GenIDLESThttp://www.hpcfd.me.vt.edu/codes.shtml • PerfSuitehttp://perfsuite.ncsa.uiuc.edu/http://perfsuite.sourceforge.net/ • POINThttp://nic.uoregon.edu/point/

More Related