80 likes | 233 Views
Advanced User Support for MPCUGLES code at University of Minnesota October 09, 2008 Mahidhar Tatineni (SDSC) Lonnie Crosby (NICS) John Cazes (TACC). Overview of MPCUGLES Code.
E N D
Advanced User SupportforMPCUGLES code at University of Minnesota October 09, 2008 Mahidhar Tatineni (SDSC)Lonnie Crosby (NICS)John Cazes (TACC)
Overview of MPCUGLES Code • MPCUGLES is an unstructured grid large eddy simulation code (written in f90/MPI), developed by Prof. Mahesh Krishnan’s group at the University of Minnesota, which can be used for very complex geometries. • The incompressible flow algorithm employs a staggered approach with face-normal velocities stored at the centroids of faces, velocity and pressure stored at cell-centroids. The non-linear terms are discretized such that discrete energy conservation is imposed. • The code also uses the HYPRE library (developed at LLNL) which is a set of high performance preconditioners to help solve sparse linear systems of equations which are part of the main algorithm. • MPCUGLES has been run at scale using upto 2048 cores and 50 million control volumes, on the Blue Gene (SDSC), DataStar (SDSC), Ranger (TACC) and Kraken (NICS).
General Requirements • Grid, initial condition generation and partitioning for the runs is done using the METIS software. For the larger grids the experimental metis-5.0pre1 version is required (Previous ASTA project uncovered a problem with metis-4.0 version for large scale cases). • The I/O in the code is done using NETCDF. Each processor writes its own files in the NETCDF format. There is no MPI-IO or parallel netcdf requirement. • HYPRE library (from LLNL) of high performance preconditioners that features parallel multigrid methods for both structured and unstructured grid problems. Compiled with version 1.8.2b. The algebraic multigrid (HYPRE_BoomerAMG) solver is used from the library. The MPCUGLES code also has the option of using a conjugate-gradient method as an alternative.
Porting to Ranger and Kraken • The code was recently ported to both the available track 2 systems (Ranger and Kraken). • Compiling the code on both machines was relatively straightforward. Both Ranger and Kraken had the netcdf libraries already installed. The needed versions of the Hypre library (v 1.8.2b) and METIS (v 5.0pre1) were easy to install on both machines. • The grid and initial condition generation codes are currently serial. For the current scaling studies they were run on Ranger (1 proc/node, 32GB) or DataStar (1 proc/p690 node; 128GB). This is a potential bottleneck for larger runs (>50 million CVs) and part of the current AUS project will be focused on parallelizing this part so that much larger grid sizes can be considered.
Performance on Ranger • Strong Scaling (257^3 grid) • Weak Scaling (64k CVs/task)
Performance on Kraken • Strong Scaling (257^3 grid) • Weak Scaling (64k CVs/task)
Comments on Performance • Strong scaling for 16 million control volumes case is o.k. upto 256 cores on Ranger and 512 cores on Kraken. The primary factor is the network bandwidth available per core (higher on Kraken). Overall the code scales o.k. if there are ~32-64K CVs per task. This is consistent with previous results on DataStar. • The code should exhibit good weak scaling based on the communication pattern seen in older runs (mostly nearest neighbor). The results are o.k. up to 256 cores but show a jump in run times after that. One of the problems is that the underlying solver might be taking longer to converge as the number of CVs increases (this is not a isotropic problem … wall bound channel flow). • Weak scaling runs for 65K CVs/task and above 512 cores are restricted due to grid size limitations at this point. Needs to be addressed.
Future Work • Near term: • Redo the weak scaling runs with an isotropic run to see if that helps avoid the extra computations needed by the underlying solver. • Run at larger processor counts on both Ranger and Kraken with profiling / performance tools to analyze the performance. • Long term: • Parallelize the initial condition and grid generation parts to enable scaling to much larger processor counts. • Investigate the performance implications of changing the underlying linear solver and see if any improvements can be made. For example the CG algorithm scales much better (tests on Kraken already show this) but takes longer to converge (tradeoff).